Characters from the Past from ASCII to ISO Encodings

The American Standard Code for Information Interchange (ASCII) was developed in the early '60s as a standard encoding of computer characters, encompassing the 26 letters of the English alphabet, both lowercase and uppercase, the numbers, common punctuation symbols, and a number of control characters4.

ASCII uses a 7 bit encoding system to represent 128 different characters. Only characters between #32 (Space) and #126 (Tilde) have a visual representation, as show in the following table:

0

1

2

3

4

5

6

7

8

9

10

11

1 2

1 3

14

IS

0

16

32

!

-

ji

I

9S

&

'

(

)

*

+

-

/

48

0

1

2

3

4

5

6

7

8

9

;

<

=

>

?

64

m

A

B

C

D

E

F

C

H

1

J

K

L

M

N

80

p

Q

R

S

T

U

V

W

X

y

z

[

]

A

-

96

%

a

b

c

d

e

f

Q

h

1

j

k

1

m

n

0

112

p

q

r

5

t

u

V

w

X

y

z

{

1

}

-

While ASCII was certainly a foundation (with its basic set of 128 characters that are still part of the core of Unicode), it was soon superseded by extended versions that used the 8th bit to add another 128 characters to the set.

Now the problem is that with so many languages around the world, there was no simple way to figure out which other characters to include in the set (at times indicated as ASCII-8). To make the story short, Windows adopts a different set of characters, called a code page, with a set of characters depending on your locale configuration and version of Windows. Beside Windows code pages there are many other standards based on a similar paging approach.

4 While most control characters have lost any meaning (like the File Separator or the Vertical Tab) some like the Carriage Return (#13), Line Feed (#10), Tab (#9), and Backspace (#8) are still in everyday use.

The most relevant is certainly the ISO 8859 standard, which defines several regional sets. The most used set (well, the one used in most Western countries to be a little more precise) is the Latin set, referenced as ISO 8859-1. Even if partially similar, Windows 1252 code page doesn't fully conform to the ISO 8859-1 set. Windows adds extra characters like the € symbol, as we'll see later.

If I keep printing all 8-bit characters, on my computer (that uses Windows 1252 code page by default) I get the following output (yours might be differ-ent)5:

2

3

4

5

6

8

9

,0

11

12

13

,4

15

0

1 6

32

!

ii

I

%

a

1

(

)

1

+

-

48

0

1

2

3

4

5

6

7

8

9

;

<

=

>

7

64

e

A

E

C

D

E

F

G

H

1

J

K

L

M

W

O

BO

p

Q

R

5

T

U

V

W

X

Y

z

[

\

]

A

-

96

a

b

c

d

e

f

0

h

i

j

k

1

m

n

0

1 1 2

p

q

r

5

t

u

V

w

X

V

z

t

1

i

~

1 28

«

J

t

*

Sio

s

<

CE

Z

144

-

-

s

>

ce

z

Y

1 60

i

c

£

*

§

s

a

«

-

s

1 76

±

.

'

|J

1

1

-

»

X

Vi

M

1 92

A

A

A

A

A

A

/E

C

E

E

E

}

\

1

1

208

e

N

6

6

6

6

O

X

0

U

U

U

u

Y

1=

224

a

a

a

a

a

a

EG

e

e

e

e

'

'

1

240

& n o o o

0 □ 0 u u u

u

Y

Y

How did I get this and the previous image? Using a simple Delphi 2009 program (called FromAsciiToUnicode) that displays characters on a StringGrid component, initially with the number of the corresponding columns and rows painted on the borders. The program forces some type casts to the Ansichar type6 to be able to manage traditional 8-bit characters (more on this in the next chapter):

5 If the system default is a multi-byte code page, the code of this program becomes meaningless, because most of the characters #$80 through #$FF are lead bytes, which can't be displayed on their own.

6 As we'll see in detail in the next chapter, in Delphi 2009 the Char type has changed and the old Char type of Delphi 1 through Delphi 2007 is now called Ansichar.

procedure TForm30.btnAscii8Click(Sender: TObject); var

I: Integer; begi n

ClearGrid;

StringGrid1.Cells [I mod 16 + 1, I div 16 + 1] := AnsiChar (I);

In previous versions of Delphi you could obtain the same output by writing the following simpler version (that uses Char rather than AnsiChar for the conversion):

StringGrid1.Cells [I mod 16 + 1, I div 16 + 1] := Char (I);

end;

I don't think I really need to tell you how messy the situation is with the various ISO 8859 encodings (there are 16 of them, still unable to cover the more complex alphabets), Windows page codes, multi byte representations to cover Chinese and other languages. With Unicode, this is all behind us, even though the new standard has its own complexity and potential problems.

Was this article helpful?

0 0
Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook


Post a comment