Unicode in Python

Unicode in Python

Simon Funke, Center for Biomedical Computing, Simula Research Laboratory & Dept. of Informatics, University of Oslo,

based on Kumar McMillan, Unicode In Python, Completely Demystied

Sep 22, 2015

Introduction

Unicode is useful if you want to handle non-English languages in your program. Seen this before?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)

Then you are not handling strings correctly in Python!

Some important terms

Unicode: Unicode is a coded character set. It denes all

characters of mayjor languages today, and denes a mapping between these characters and integer codes representing them.

UTF-8: UTF-8 is a character encoding capable of encoding

all possible characters, or code points, in Unicode.

ASCII: ASCII is an old character encoding. It species the

characters used in the English language into numbers ranging from 0 to 127.

When saving a string to a le/database/... Python needs to encode

the string with a character encoding.

When reading a string to a le/database/... Python needs to

decode the string with a character encoding.

The same unicode string might have dierent representations for dierent character encodings.

Bokm?l

Lets read a UTF-8 le with the word Bokm?l.

#!/usr/bin/env python import sys # wget # noreg.txt is encoded in the UTF-8 character encoding f = open("noreg.txt", "r") s_utf8 = f.readline().split("\t")[12] s_utf8 # Out: 'Bokm\xc3\xa5l' type(s_utf8) # Out: str

s_utf8 is a string encoded in UTF-8 format

The encoding assigns a numeric value to each character Note that ? takes 2 bytes Python supports many encodings (over 100) See the default encoding with sys.getdefaultencoding(). It is typically 'ascii'.

Python string types (Python 2)

| +-- | +--

Important methods

s.decode(encoding)

Converts to

s.decode(encoding)

Converts to

Take home message

1. Decode early 2. Unicode everywhere 3. Encode late

Decode early

f = open('noreg.txt','r') for line in f:

line_uni = line.decode('utf-8') ... f.close()

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download