How to Use UTF-8 with Python python中使用utf-8

[ 2005-October-01 20:15 ]Tim Bray describes why Unicode and UTF-8 are wonderfulmuch better than I could, so go read that for an overview of whatUnicode is, and why all your programs should support it. What I'm goingto tell you is how to use Unicode, and specifically UTF-8, with one ofthe coolest programming languages, Python, but I have also written an introduction to Using Unicode in C/C++.Python has good support for Unicode, but there are a few tricks thatyou need to be aware of. I spent more than a few hours learning thesetricks, and I'm hoping that by reading this you won't have to. This isa very quick and dirty introduction. If you need in depth knowledge, orneed to learn about Unicode in Java or Windows, see Unicode for Programmers. [Updated 2005-09-01: Updated information about XML encoding declarations.]

The Basics

There are two types of strings in Python: byte strings and Unicodestrings. As you may have guessed, a byte string is a sequence of bytes.When needed, Python uses your computer's default locale to convert thebytes into characters. On Mac OS X, the default locale is actuallyUTF-8, but everywhere else, the default is probably ASCII.
This createsa byte string:byteString = "hello world! (in my default locale)"And this creates a Unicode string:unicodeString = u"hello Unicode world!"Convert a byte string into a Unicode string and back again:s = "hello byte string"u = unicode( s )backToBytes = u.encode()The previous code uses your default character set to perform theconversions. However, relying on the locale's character set is a badidea, since your application is likely to break as soon as someone fromThailand tries to run it on their computer. In most cases it isprobably better to explicitly specify the encoding of the string:s = "hello normal string"u = unicode( s, "utf-8" )backToBytes = u.encode( "utf-8" )Now, the byte string s will be treated as a sequence of UTF-8 bytes to create the Unicode string u. The next line stores the UTF-8 representation of u in the byte string backToBytes.

Working With Unicode Strings

Thankfully, everything in Python is supposed to treat Unicodestrings identically to byte strings. However, you need to be careful inyour own code when testing to see if an object is a string. Do not do this:if isinstance( s, str ): # BAD: Not true for Unicode strings!Instead, use the generic string base class, basestring:if isinstance( s, basestring ): # True for both Unicode and byte strings

Reading UTF-8 Files

You can manually convert strings that you read from files, however there is an easier way:import codecsfileObj = codecs.open( "someFile", "r", "utf-8" )u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the fileThe codecs modulewill take care of all the conversions for you. You can also open a filefor writing and it will convert the Unicode strings you pass in to write into whatever encoding you have chosen. However, take a look at the note below about the byte-order marker (BOM).

Working with XML and minidomI

 use the minidom modulefor my XML needs mostly because I am familiar with it. Unfortunately,it only handles byte strings so you need to encode your Unicode stringsbefore passing them to minidom functions. For example:import xml.dom.minidomxmlData = u"<français>Comment ça va ? Très bien ?</français>"dom = xml.dom.minidom.parseString( xmlData )The last line raises an exception: UnicodeEncodeError: 'ascii' codec can't encode character '/ue7' in position 5: ordinal not in range(128). To work around this error, encode the Unicode string into the appropriate format before passing it to minidom, like this:import xml.dom.minidomxmlData = u"<français>Comment ça va ? Très bien ?</français>"dom = xml.dom.minidom.parseString( xmlData.encode( "utf-8" ) )Minidom can handle any format of byte string, such as Latin-1 orUTF-16. However, it will only work reliably if the XML document has an encoding declaration (eg. <?xml version="1.0" encoding="Latin-1"?>).If the encoding declaration is missing, minidom assumes that it isUTF-8. In is a good habit to include an encoding declaration on allyour XML documents, in order to guarantee compatability on all systems.When you get XML out of minidom by calling dom.toxml() or dom.toprettyxml(), minidom returns a Unicode string. You can also pass in an additional encoding="utf-8" parameter to get an encoded byte string, perfect for writing out to a file.

The Byte-Order Marker (BOM)

UTF-8 files sometimes start with a byte-order marker (BOM) toindicate that they are encoded in UTF-8. This is commonly used onWindows. On Mac OS X, applications (eg. TextEdit) ignore the BOM andremove it if the file is saved again. The W3C HTML Validatorwarns that older applications may not be able to handle the BOM.Unicode effectively ignores the marker, so it should not matter whenreading the file. You may wish to add this to the beginning of yourfiles to determine if they are encoded in ASCII or UTF-8. The codecs module provides the constant for you to do this:out = file( "someFile", "w" )out.write( codecs.BOM_UTF8 )out.write( unicodeString.encode( "utf-8" ) )out.close()You need to be careful when using the BOM and UTF-8. Frankly, Ithink this is a bug in Python, but what do I know. Python will decodethe value of the BOM into a Unicode character, instead of ignoring it.For example (tested with Python 2.3):>>> codecs.BOM_UTF16.decode( "utf16" )u''>>> codecs.BOM_UTF8.decode( "utf8" )u'/ufeff'For UTF-16, Python decoded the BOM into an empty string, but forUTF-8, it decoded it into a character. Why is there a difference? Ithink the UTF-8 decoder should do the same thing as the UTF-16 decoderand strip out the BOM. However, it doesn't, so you will probably needto detect it and remove it yourself, like this:import codecsif s.beginswith( codecs.BOM_UTF8 ): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8 if u[0] == unicode( codecs.BOM_UTF8, "utf8" ): # The unicode string begins with the BOM: Do something. # For example, remove the character.# Strip the BOM from the beginning of the Unicode string, if it existsu.lstrip( unicode( codecs.BOM_UTF8, "utf8" ) )

Writing Python Scripts in Unicode

As you may have noticed from the examples on this page, you canactually write Python scripts in UTF-8. Variables must be in ASCII, butyou can include Chinese comments, or Korean strings in your sourcefiles. In order for this to work correctly, Python needs to know thatyour script file is not ASCII. You can do this in one of two ways.First, you can place a UTF-8 byte-order markerat the beginning of your file, if your editor supports it. Secondly,you can place the following special comment in the first or secondlines of your script:# -*- coding: utf-8 -*-Any ASCII-compatible encoding is permitted. For details, see the Defining Python Source Code Encodings specification.

Other Resources

Using Unicode in C/C++ by Evan Jones - A brief tour of how to use Unicode in standard C/C++ applications.On the Goodness of Unicode by Tim Bray - An essay about why you should support Unicode.The [...] Minimum Every Software Developer [...] Must Know About Unicode [...] by Joel Spolsky - Another essay about why Unicode is good, and an introduction to how it works.Characters vs. Bytes by Tim Bray - An introduction to the details of Unicode encoding.Unicode in Python by Thijs van der Vossen - Another quick and dirty introduction to Python's Unicode support.Python Unicode Objects by Fredrik Lundh - A collection of tips about Python's Unicode support, like using it in regular expressions.Unicode for Programmers by Jason Orendorff - A detailed guide to Unicode, geared towards Python, Java, and Windows programmers.Unicode Home Page - The official web site for the Unicode specifications.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值