Information, Characters, Unicode

Information, Characters, Unicode

Unicode

? 3 April 2023

1/1

Hidden Moral

Small mistakes can be catastrophic! Style

Care about every character of your program.

Tip: printf Care about every character in the program's output.

(Be reasonably tolerant and defensive about the input. "Fail early" and clearly.)

Unicode

? 3 April 2023

2/1

Imperative

Thou s halt care about every aracter in your program.

Unicode

? 3 April 2023

3/1

Corollaries

Thou s halt know every aracter in the input.

Thou s halt care about every aracter in your output.

Unicode

? 3 April 2023

4/1

Information ? Characters

In modern computing, natural-language text is very important information. ("number-crunching" is less important.) Characters of text are represented in several different ways and a known character encoding is necessary to exchange text information. For many years an important encoding standard for characters has been US ASCII?a 7-bit encoding. Since 7 does not divide 32, the ubiquitous word size of computers, 8-bit encodings are more common. Very common is ISO 8859-1 aka "Latin-1," and other 8-bit encodings of characters sets for languages other than English. Currently, a very large multi-lingual character repertoire known as Unicode is important.

Unicode

Character Sets

? 3 April 2023

5/1

Information ? Characters

Bits are not information until the relevant parties agree and what they represent. A standard is required to successfully communicate a character of text. The bits are mostly arbitrary choices.

binary oct dec hex char 0110 0001 041 97 0x61 a the letter `a' 0110 0010 042 98 0x62 b the letter `b' 0110 0011 043 99 0x63 c the letter `c'

Blocks of n bits have 2n different bit patterns and so 2n characters can be represented.

Unicode

Character Sets

? 3 April 2023

6/1

ASCII (American Standard Code for Information Interchange), is a 7-bit character encoding standard for digital communication. It has defined 27 = 128 bit patterns.

It was one of the first standards for encoding symbols (letters, numbers, and punctuation used in English text). This fixed-width encoding evolved in the 1960s by the institution for standards for the United States. It has been in widespread use for information exchange ever since, but now supplanted by other standards. A survey (2023) suggests that US-ASCII is used by far less than 1% of websites and UTF-8 (described later) by 98% of websites ( ). (But UTF-8 retains US-ASCII.)

The Internet Assigned Numbers Authority (IANA) prefers the name US-ASCII for this character encoding.

Unicode

Character Sets

? 3 April 2023

7/1

Some US-ASCII Characters

Each character has a unique bit pattern used to represent it (and a Unicode name as we shall see later).

binary oct dec char

Unicode

0000 1001 0011 9 HT U+0009 horizontal tabulation

0010 0000 0040 32

U+0020 space

0010 1110 0056 46 . U+002E full stop

0010 1111 0057 47 / U+002F solidus

0011 0000 0060 48 0 U+0030 digit zero

0011 0001 0061 49 1 U+0031 digit one

Although 8 bits are shown above, only 7 bits are used in the US-ASCII standard.

Unicode

Character Sets

? 3 April 2023

8/1

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download