Python 與中文處理

Python c̎

Tseng Yuen-HsienԪ@



_W

2011/10/27

Ŀ

Python c̎ .......................................................................................................... 1

Python cܷ̎Á ........................................................................................................... 1

ľaPython Ȳ_ʽʽnΞĻݔ .................................................................... 1

ľaݔnݔn ......................................................................................................... 4

ʲNr encode()ʲNr decode() ...................................................................................... 5

֪ijһִ֙nľa ................................................................................ 5

Unicodeutf-8utf-16utf-32.......................................................................................................... 5

Python c̎

ii

Python c̎

Python cܷ̎Á

ڌW Python ʽZ̎ĵ^УһЩ}ϾWYһһQᣬ

򞌑֣ṩ҅

ľaPython Ȳ_ʽʽnΞĻݔ

Python Ȳ_ִr Unicode ִҲ byte string 탦c_

ִ

ʹ Unicode _ִPython ִУ_ȡһ

֡磬 Python ʾh£

>>> s=u'' # Unicode string

>>> print len(s), s[0] # index 0 fetches the first character

2

>>> t='' # byte string

>>> print len(t), t[0:3] # a Chinese character has n bytes

6

>>> print type(s), type(t)

>>> print s, t # this line works fine, ݔ



>>> print s + t.decode('utf-8') # t DQ Unicode



Python c̎

1

>>> print s + unicode(t, 'utf-8') # t DQ Unicode,Чͬһ



>>> print s.encode('utf-8')+ t # s DQ byte string



>>> print s+t # this will cause an error due to type mismatch

Python ڳʽnУЕrҪע⵽ɼ£

1.

ʽnľan big5 atҪڳʽеĵڶУV Python

#!/usr/bin/env python

# -*- coding: big5 -*# Note the first line in the above is for Operating system, the

# second line is for Python interpreter

s=u'' # big5 code will be stored in Unicode in Python

print len(s), s[0]

ʽn utf-8 aʽntҪɣ

#!/usr/bin/env python

# -*- coding: utf-8 -*s=u'' # utf-8 code will be stored in Unicode in Python

print len(s), s[0]

t=''

print len(t), t[:3]

utraedit ЙnĴnʽҲ utf-8 cmd O chcp 65001

Python c̎

2

ڶеĿģҪV Python gՈԓaʽִ̎

ڶָ utf-8t s=u''eġġɂ֣

Python J utf-8 aģȻDɃȲ Unicodeִǰӂ u

byte string Mac UnixhУ utf-8 ʽnݔ

Windows ޸ģȻღ big5 a@rڶҲҪ޸ij big5nr

ľaʽڶָľaʽͬt Python ԓʽnr͕e

2.

ݔҕľaWindows ʾԪҕ DOS ҕ big5 ʽ

@ʾġ Mac UnixĽK˙Ctx big5 utf-8 Ⱦa@ʾ

֡ Python ִݔΞĻrҪ֣

yatҪ֪ݔҕľaݔִaݔ_@ʾ֡

#!/usr/bin/env python

# -*- coding: utf-8 -*s=u'' # utf-8 code will be stored in Unicode in Python

print s.encode('utf-8') # assume output screen is utf-8

t=''

print t # no need for encode(), assume output screen is utf-8

Mac UnixЈгʽݔ֣ Windows ݔ`

təzݔĔ侎aʽǷݔҕľa֮Ȼ

]@韩ܷ޸ijʽڲͬݔҕУ_ݔأ

ݔg[͛]@}g[Ԅӻք{a_@ʾ

]ִ Unicode rЕrֱ print ҲԿ֡

Python print ܌ Unicode ԄaִDQ^ݔ

Python c̎

3

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download