Unicode in Python

Unicode in Python

Simon Funke, Center for Biomedical Computing, Simula

Research Laboratory & Dept. of Informatics, University of Oslo,

based on Kumar McMillan, Unicode In Python, Completely

Demystied

Sep 22, 2015

Introduction

Unicode is useful if you want to handle non-English languages in

your program.

Seen this before?

UnicodeDecodeError: 'ascii' codec

can't decode byte 0xc4 in position

10: ordinal not in range(128)

Then you are not handling strings correctly in Python!

Some important terms

Unicode:

Unicode is a

coded character set.

It denes all

characters of mayjor languages today, and denes a mapping

between these characters and integer codes representing them.

UTF-8:

UTF-8 is a

character encoding

capable of encoding

all possible characters, or code points, in Unicode.

ASCII:

ASCII is an old

character encoding.

It species the

characters used in the English language into numbers ranging

from 0 to 127.

When saving a string to a le/database/... Python needs to

encode

the string with a character encoding.

When reading a string to a le/database/... Python needs to

decode

the string with a character encoding.

The same unicode string might have dierent representations for

dierent character encodings.

Bokm?l

Lets read a UTF-8 le with the word Bokm?l.

#!/usr/bin/env python

import sys

# wget

# noreg.txt is encoded in the UTF-8 character encoding

f = open("noreg.txt", "r")

s_utf8 = f.readline().split("\t")[12]

s_utf8 # Out: 'Bokm\xc3\xa5l'

type(s_utf8) # Out: str

s_utf8

is a string encoded in UTF-8 format

The encoding assigns a numeric value to each character

Note that ? takes 2 bytes

Python supports many encodings (over 100)

See the default encoding with sys.getdefaultencoding(). It is

typically 'ascii'.

Python string types (Python 2)

|

+--

|

+--

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches