CERM - Business Management Software for Narrow Web …



Unicode

Some history

ASCII

In order to display numbers, letters and other kind of symbols on a screen, a character set (named ASCII) was invented which had 1 number related to every single character. The looks of the character aren’t included in this description. The chosen font is responsible for the looks.

The numbers 0 to 31 of the famous ASCII table were control codes. Like carriage return, etc. All other numbers represent visible characters.

[pic]

One number was 1 byte, so 8 bits. In the original ASCII presentation, only 7 bits were used, since the 8th bit was used as a check bit. So 27 = 128 possibilities.

However, in many languages 26 characters weren’t sufficient and to solve hat problem they started to change the ASCII table per language, resulting in multiple ‘language dependant’ character sets .

8e bit extension

Soon, IBM had to take the initiative to realy use the 8th bit. This resulted in 28 = 256 characters. Even this wasn’t sufficient, so as before multiple ‘language dependant’ character sets were developed. IBM identified all these variants as Code Pages with an identification number. In MS-DOS for Western Europe for example, Code Page 437 was applied since its contains the ‘é’ character. Later on ISO also created a standard: the populair ISO 8859-1 which was sufficient for most languages used in Western Europe. Microsoft used this ISO standard for their Code Page 1252.

[pic]

Microsoft introduced from Window 95 on the term ‘ANSI’, to identify the national extended ASCII character set supported by the operating system

Unicode

Unicode extension

The goal became to develop a system capable to support all worldwide types (like Greek, Chinese, etc.) The international standard Unicode was developed to meet this goal.

The use of 2 bytes per character would result in 216 = 65.536 characters. But this wasn’t enough to support all worldwide types (since only Chinese type already consists of more than 25.000 characters). So a complex system of code types was put in place to only use multiple bits if necessary:

- UTF-8: characters are described as one to four 8-bits numbers. This system keeps existing ASCII text intact. This is important to know.

- UTF-16: uses one to two (depending on needs) 16-bits numbers.

- UTF-32: uses one 32-bit number. This should be sufficient to describe all worldwide type.

All possible character sets were added to the Unicode standard. And even today new ones are added. Even ancient types and Braille are part of the system. This means Unicode has version numbers.

Unicode characters are often represented by their hexadecimal presentation. The Greek type for example uses the range 0370 to 03FF.

[pic]

Tip: without changing you keyboard settings, you can type any Unicode character on a PC holding the Alt en ‘+’ key. If you open notepad, and you type (while holding the and key) 0398, it will result in the Θ character on screen. Word also allows to do this (richedit) in an other way. First type 0398 and simultaneously press and afterwards.

( )

The Unicode standard took over the 256 characters of the ISO8859-1 standard. So these 256 characters still have the same value. The first 128 characters can be described with 1 byte. The following 128 characters can’t be described (no matter the code type used) with 1 byte, a second byte is necessary.

Consequences of Unicode

This means that upon conversion to Unicode UTF-8 of an existing text file that only consists of characters below 128, the file size will remain the same. However if the text contains one ‘é’, the file size will grow. By the way: to indentify a file as ‘unicode’, it contains at the beginning a “BOM” (byte order mark), this mark adds 2 extra bytes.

Data in the Cerm SQL Server database has always been stored as varchar. 1 character = 1 byte. SQL Server has the possibility to support a datatype nvarchar, which is base on UTF-16 (UNICODE UCS-2). So each character is represented by at least 16-bits (or 2 bytes) and at maximum 2 x 16 bits (or 4 bytes. This means that conversion of database fields from varchar to nvarchar, the size of string fields will double! So the database size can grow considerably by this conversion. So the growth and database disk space should be monitored before the conversion.

From Windows XP the operating system supports unicode. For emailing however the MAPI functions didn’t follow this evolution. Mails sent via SMTP are Unicode compatible. This means no mail with Chinese type content can be send via MAPI from a Belgian sender to a Chinese adressee.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download