Unicode in Python
Unicode in Python
Simon Funke, Center for Biomedical Computing, Simula
Research Laboratory & Dept. of Informatics, University of Oslo,
based on Kumar McMillan, Unicode In Python, Completely
Demystied
Sep 22, 2015
Introduction
Unicode is useful if you want to handle non-English languages in
your program.
Seen this before?
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xc4 in position
10: ordinal not in range(128)
Then you are not handling strings correctly in Python!
Some important terms
Unicode:
Unicode is a
coded character set.
It denes all
characters of mayjor languages today, and denes a mapping
between these characters and integer codes representing them.
UTF-8:
UTF-8 is a
character encoding
capable of encoding
all possible characters, or code points, in Unicode.
ASCII:
ASCII is an old
character encoding.
It species the
characters used in the English language into numbers ranging
from 0 to 127.
When saving a string to a le/database/... Python needs to
encode
the string with a character encoding.
When reading a string to a le/database/... Python needs to
decode
the string with a character encoding.
The same unicode string might have dierent representations for
dierent character encodings.
Bokm?l
Lets read a UTF-8 le with the word Bokm?l.
#!/usr/bin/env python
import sys
# wget
# noreg.txt is encoded in the UTF-8 character encoding
f = open("noreg.txt", "r")
s_utf8 = f.readline().split("\t")[12]
s_utf8 # Out: 'Bokm\xc3\xa5l'
type(s_utf8) # Out: str
s_utf8
is a string encoded in UTF-8 format
The encoding assigns a numeric value to each character
Note that ? takes 2 bytes
Python supports many encodings (over 100)
See the default encoding with sys.getdefaultencoding(). It is
typically 'ascii'.
Python string types (Python 2)
|
+--
|
+--
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- file handling
- 1 td 2 manipuler des expressions régulières avec python
- programming principles in python csci 503
- chapter 12 — string encoding an i
- pattern matching and text manipulation bram kuijper
- part 5 the python language
- types in python
- stats 507 data analysis in python
- programming principles in python csci 503 490
- python and unicode
Related searches
- sort dictionary in python by values
- shape in python numpy
- array shape in python numpy
- str in python example
- join in python using on
- replace character in python string
- create a matrix in python using for
- random generator in python examples
- create matrix in python numpy
- install numpy in python 2 7
- tuple in python example
- numpy in python tutorial