Crash Course on Character Encodings - New York University

[Pages:47]Crash Course on Character Encodings

Yusuke Shinyama

NYCNLP Oct. 27, 2006

Introduction

2

Are they the same?

? Unicode ? UTF

3

Two Mappings

Character

Character Code

"

64

1590

32654

Byte Sequence

64 216 182 231 190 142

4

Two Mappings

Character

Unicode

Character Code

UTF-8

"

64

Byte Sequence

64

1590

216 182

32654 231 190 142

"Character Set" "Encoding Scheme"

5

Terminology

? Character Set

- Mapping from abstract characters to numbers.

? Encoding Scheme

- Way to represent (encode) a number in a byte sequence in a decodable way.

- Only necessary for character sets that have more than 256 characters.

6

In ASCII...

Character

ASCII

Character Code

5

53

A

65

m 109

Byte Sequence

53 65 109

7

Character Sets

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download