Unicode in Python

Unicode in Python

Simon Funke, Center for Biomedical Computing, Simula

Research Laboratory & Dept. of Informatics, University of Oslo,

based on Kumar McMillan, Unicode In Python, Completely

Demystied

Sep 22, 2015

Introduction

Unicode is useful if you want to handle non-English languages in

your program.

Seen this before?

UnicodeDecodeError: 'ascii' codec

can't decode byte 0xc4 in position

10: ordinal not in range(128)

Then you are not handling strings correctly in Python!

Some important terms

Unicode:

Unicode is a

coded character set.

It denes all

characters of mayjor languages today, and denes a mapping

between these characters and integer codes representing them.

UTF-8:

UTF-8 is a

character encoding

capable of encoding

all possible characters, or code points, in Unicode.

ASCII:

ASCII is an old

character encoding.

It species the

characters used in the English language into numbers ranging

from 0 to 127.

When saving a string to a le/database/... Python needs to

encode

the string with a character encoding.

When reading a string to a le/database/... Python needs to

decode

the string with a character encoding.

The same unicode string might have dierent representations for

dierent character encodings.

Bokm?l

Lets read a UTF-8 le with the word Bokm?l.

#!/usr/bin/env python

import sys

# wget

# noreg.txt is encoded in the UTF-8 character encoding

f = open("noreg.txt", "r")

s_utf8 = f.readline().split("\t")[12]

s_utf8 # Out: 'Bokm\xc3\xa5l'

type(s_utf8) # Out: str

s_utf8

is a string encoded in UTF-8 format

The encoding assigns a numeric value to each character

Note that ? takes 2 bytes

Python supports many encodings (over 100)

See the default encoding with sys.getdefaultencoding(). It is

typically 'ascii'.

Python string types (Python 2)

|

+--

|

+--

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download