Python编解码小结（二）——Python的编解码

最新推荐文章于 2024-02-18 10:29:54 发布

wwjiang_ustc

最新推荐文章于 2024-02-18 10:29:54 发布

阅读量965

点赞数

分类专栏： python 技巧

本文链接：https://blog.csdn.net/wwjiang_ustc/article/details/43706077

版权

python 技巧专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本章将继续围绕如何在python下进行编解码问题进行讨论。

（一）源代码文件(Source Code Files)的编码

关于Python对代码文件的编码处理，Python官网上的Defining Python Source Code Encodings章节有详细描述(https://www.python.org/dev/peps/pep-0263/),现摘录如下

python缺省认为文件为ASCII编码。(Python will default to ASCII as standard encoding if no other encoding hints are given.)
可在代码头的第一或第二行加入文件编码声明，通知python该文件的编码格式，如
# coding=<encoding name>
或
#!/usr/bin/python
# -*- coding: <encoding name> -*-
或
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
确切的说，python检查形如"coding[:=]\s*([-\w.]+)"的语句来判别编码声明的。(More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)".)。即Python只检查#号、coding关键字和编码字符串这三类信息，其他的字符都是为了加强可读性。
源代码文件中，如果有非ASCII字符，则文件头部必须进行字符编码的声明。若无声明，ASCII编码无法读取大于128 code point的Unicode字符串(Python raises a UnicodeEncodeErrorexception in this case)。文件本身的编码要跟文件头部声明编码一致，不然就会出现问题
唯一例外情况---在Windows平台，由于系统默认使用BOM（文件头三个字节 \xef\xbb\xbf）来申明文件为utf-8编码，即使源代码文件中没有编码声明，python仍然会认为源文件encoding为utf8。但是上述情况下，源代码文件有编码申明但不是utf-8, python则会报错。(If a source file uses both the UTF-8 BOM mark signature and a magic encoding comment, the only allowed encoding for the comment is 'utf-8'.)

以下四种不同情况，如果源码文件中包含非ASCII字符只有情况(b)能够正常运行：

(a) 源代码文件存为UTF-8无BOM格式，无encoding声明

(b) 源代码文件存为UTF-8带BOM格式，无encoding声明

(c) 源代码文件存为UTF-8带BOM格式，但encoding声明为GBK

(d) 源代码文件存为GBK格式，但encoding声明为UTF-8 #中文str存储以GBK编码来，而解析str的时候又以UTF-8来，这样就会报SyntaxError: (unicode error) 'utf8' codec can't decode byte错误。

即如果添加了# -*- coding: utf-8 –*- #，则需注意使用的编辑器，确保文件保存时使用了该编码格式。
源码文件只能自始自终使用同一种编码方案，混合多种编码将会导致源文件解码错误。 (The complete Python source file should use a single encoding. Embedding of differently encoded data is not allowed and will result in a decoding error during compilation of the Python source code.)
多字节编码在源文件中暂不支持。(It does not include encodings which use two or more bytes for all characters like e.g. UTF-16.

以上规则可通过了解Python编译原理来理解：读取源代码文件--->通过声明或默认编码解码成Unicode字符串--->将源代码文件编码成utf-8--->将源文件处理为token内容--->编译代码后重新将代码内容转化成Unicode (Python's tokenizer/compiler combo will need to be updated to work as follows: 1. read the file 2. decode it into Unicode assuming a fixed per-file encoding 3. convert it into a UTF-8 byte string 4. tokenize the UTF-8 content 5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding)

（二）Python字符串类型

Python里有两种字符串，byte string字符串和Unicode字符串 (Byte string is an ordered list of bytes - that is integers between 0 and 255 inclusive. Unicode string is an ordered list of Unicode characters, such as letters, numbers, punctuation, tiny snowmen (☃), etc)。byte string和unicode都是basestring的子类， unicode是由byte str类型的字符串解码后得到，unicode也可以编码成byte str类型。即
byte str --> decode -->unicode
unicode --> encode --> byte str

注：对UTF-8编码的str'汉'使用len()函数时，结果是3，因为实际上，UTF-8编码的'汉' = '\xE6\xB1\x89'。 unicode才是真正意义上的字符串，其en(u'汉') = 1。

Python里字符串的encode()函数负责将unicode字符串转化为另外一种Unicode编码形式。而decode()函数负责将给定编码的字符串进行解码，结果为unicode编码。还有一点是unicode()将给定编码的字符串解码为unicode字符串。需要注意的是，虽然对str调用encode()方法是错误的，但实际上Python不会抛出异常，而是返回另外一个相同内容但不同id的str；对unicode调用decode()也类似。

例：

(a) 若字符串定义为s=’Python大法好’ ，如果是在utf8的文件中，该字符串就是utf8编码，如果是在gb2312的文件中，则其编码为gb2312。

(b) 若字符串定义为s=u’Python大法好’ ，该字符串的编码就被指定为unicode了（可认为是python的内部编码，与代码文件本身的编码无关）。可用isinstance(s, unicode) 来判断是否为unicode

（三）关于Python版本

在Python 2.x版本中，byte string是指形如 'abc'的类型的string，而Unicode string必须在string前标记u，例如 u'abc'.
在Python 3.x版本中，所有情况则相反，如在string前不加标记，则被认为是Unicode string如 'abc' 。如果需要表示成byte string,则需写成b'abc'。而u'abc'则变为是非法形式。

（四）Windows平台下的输入/输出编码转换

Python脚本在运行时，字符串操作都需先进行编码转换成一致格式，处理流程是用decode方法将其转换成unicode编码，再使用encode方法将其转换成其他编码。

在下文开始前，首先用python代码来查看python的一些默认编解码设置（以下内容均在Python 2.7.x下运行）

(a) 系统的缺省编码：sys.getdefaultencoding() #python系统缺省的编码格式为ASCII，若定义s = "abc" + u"bcd", Python会如此转换"abc".decode(sys.getdefaultencoding()) 然后将两个Unicode字符合并输出，及Python在列表里有string和Unicode对象的时候会自动地将字节串解码为Unicode

(b) 系统当前的编码：locale.getdefaultlocale() 文件系统的编码：sys.getfilesystemencoding() ##Windows下getfilesystemencoding输出mbcs（多字节编码，windows的mbcs，也就是ansi，它会在不同语言的windows中使用不同的编码，在中文的windows中就是gb系列的编码)

(d) 终端的输出编码：sys.stdout.encoding

在了解python关于编码的默认设置后，我们考虑编程中碰到非ASCII字符串的情况该如何处理：

(1) 输入方式1 --- 源代码文件中包含字符串

正如上文章节（二）提到的例子，如果定义为Unicode字符串，只需在字符串前加‘u’做标记，而对byte string字符串做处理，则需根据源代码文件编码的设置对byte string做对应解码处理。若对byte string进行编码，对Unicode string进行解码，则会出现如下错误编解码情况：

#coding: utf-8  
u = u'Python大法好'
print repr(u) #u'Python\u5927\u6cd5\u597d'
s = 'Python大法好'  
print repr(s) #'Python\xe5\xa4\xa7\xe6\xb3\x95\xe5\xa5\xbd'
u2 = s.decode("utf-8")  
print repr(u2)  #u'Python\u5927\u6cd5\u597d'
s2 = u.decode("utf-8") #编码错误，应使用encode  
u2 = s.encode("utf-8") #解码错误，应使用decode

上述实例中，对于unicode最好不要直接调用decode，byte string最好不要直接调用encode方法。因为调用，则相当于u.encode(default_encoding).decode("utf-8")，default_encoding是python的unicode实现中用的默认编码，即sys.getdefaultencoding()得到默认编码ascii。因此，unicode字符串包含中文时，超出了ascii编码范围就会报错。同理，如果对byte string直接调用encode方法，那么默认会先对str进行解码，即s.decode(default_encoding).encode("utf-8"), 当byte string包含中文，而default_encoding又是默认ascii的话，解码就会出错，从而导致上面这两行会分别报UnicodeEncodeError: 'ascii' codec can't encode characters in position...错误和UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position...错误。当然，如果byte string或者unicode都在ascii编码范围，就没有问题。比如s = "abc"; s.encode("utf-8")就不会有问题，语句执行后会返回一个跟s的id不同的byte string。

注：str(),repr()和``运算在特性和功能方面都非常相似，事实上repr()和``做的是完全一样的事情，它们返回的是一个对象的“官方”字符串表示，也就是说绝大多数情况下可以通过求值运算（使用内建函数eval()）重新得到该对象。
但str()则有所不同，str()致力于生成一个对象的可读性好的字符串表示，它的返回结果通常无法用于eval()求值，但很适合用于print语句输出。需要再次提醒的是，并不是所有repr()返回的字符串都能够用 eval()内建函数得到原来的对象。也就是说 repr() 输出对 Python比较友好，而str()的输出对用户比较友好。虽然如此，很多情况下这三者的输出仍然都是完全一样的。以下是三者的对比：

#coding=utf-8 
s = 'Hello, world.'
print str(s)  # 'Hello, world.'
print repr(s) #"'Hello, world.'"
print str(0.1) #'0.1'
print repr(0.1) #'0.10000000000000001'
string = '我'
print repr(string) #'\xe6\x88\x91'
string = u'我'
print repr(string) #u'\u6211'

(2) 输入方式2 --- 读取文本文件中包含字符串

若输入文本文件编码已知，则只需逐行读取字符串流对其做相应的解码。采用python的open()方法打开文件时，read()读取的是str，编码就是文件本身的编码（open()函数以字节为单位，从磁盘读取数据，得到str类型的文本，如果某个字节不是asccii码字符，就用转义的十六进制来表示，也就是我们所说的编码字符串。文本在得到的file对象中都是以这种格式的str字符来处理的。这种方式规避来无法预知的编码问题，而把编码解码的问题抛给调用者自己来解决）。此外，python的codecs模块提供了一个open()方法，可以指定编码打开文件，使用这个方法打开文件读取返回是unicode。假定test1.txt存有utf-8编码格式的“Python大法好”，test2.txt存有GBK编码格式的“Python大法好”，正确的读取方式为：

# -*- coding: utf-8 -*-
import codecs
filehandle1 = open('test1.txt','r')
line1 = filehandle1.readline()
string1 = line1.decode('utf-8') # <type 'unicode'>
filehandle2 = open('test2.txt','r')
line2 = filehandle2.readline()
string2 = line2.decode('gbk') # <type 'unicode'>
filehandle1.close()  
filehandle2.close() 

filehandle3 = codecs.open('test1.txt', encoding='utf-8')  
string3 = filehandle3.read()  # <type 'unicode'>
filehandle4 = codecs.open('test2.txt', encoding='gbk')  
string3 = filehandle4.read()  # <type 'unicode'>
filehandle3.close()  
filehandle4.close()

若输入文本文件编码未知，则需首先进行文件的编码检测，常用的是chardet（第三方编码识别模块库）。检测文件编码的方法为：

#coding = utf-8
import chardet
f = open('test.txt','r')
result = chardet.detect(f.read()) #{'confidence': 0.99, 'encoding': 'utf-8'}

即一次性读取全部文件，chardet的探测会在搜集到足够数据之后停止，文件比较大时，会浪费点内存。

(3) 输出方式1--- print

Windows下控制台编码为cp936, 对于print函数，Python会对输出控制台的文本做自动的编码转换， print函数转换成目标编码和环境变量有关，中文Windows环境下是转换为gbk(可使用locale模块的getdefaultlocale函数获得当前环境的编码)。这里会引发一个有趣的问题, 试一下这个简单的例子test.py：

# -*- coding: utf-8 -*-
s = u'Python大法好'
print s

在控制台中分别运行： python test.py 和 python test.py > log.txt

结果你会发现后者会报错，原因是输出到控制台时Python会自动转换编码到sys.stdout.encoding, 而输出到文件时Python不会自动在write()函数调用中进行内部字符转换。

即输出到控制台时，python 会试图按照控制台设置的代码页去编码，而重定向输出到文件时就按默认 ASCII去编码，自然只有128位以内的字符才能显示出来

注：关于上述问题详细探讨可参考如下链接 http://manoon.me/?p=1009。如需要正确运行 python test.py > log.txt ，有两种修改方式：

(a) print s 修改为 print s.encode('utf-8') or print s.encode('gbk')，即指定编码输出。

(b) python中调用sys.setdefaultencoding方法修改默认编码。但是这个方法调用有一些特别。在python解释器中执行：

# -*- coding: utf-8 -*-
import sys
dir(sys)

<span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">则会看到实际上sys模块中并没有这个所谓的setdefaultencoding方法。文档中描述为：这个方法是为site模块的调用而准备的，而一旦被调用之后，它将会被从sys的名称空间中抹掉。(This function is only intended to be used by the site modul</span><span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">implementation and, where needed, by sitecustomize. Once used by the site module, it is removed from the sys module’s namespace.) 当确实有需要调用这个方法，可以使用reload再次导入这个模块：</span>

import sys  
print 'old encoding value:', sys.getdefaultencoding() #ascii
reload(sys)
sys.setdefaultencoding('utf8')
print 'new encoding value:', sys.getdefaultencoding() #utf8

另外在某些IDE中，字符串的输出总是出现乱码，甚至错误，其实是由于IDE的结果输出控制台时没有做自动字符串转码，而不是程序本身的问题。

如在UliPad中运行如下代码：

s=u'Python大法好'
print s

会提示：UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)。这是由于UliPad在控制台信息输出窗口是按照ascii编码输出（默认编码是ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。将最后一句改为：print s.encode('gb2312')则能正确输出“中文”两个字。

(4) 输出方式2--- 文本文件

python程序在输出时，必须将unicode字符串变成一个字节串. 调用write()写文件时，如果参数是unicode，则需要用指定编码encode，如果write()参数是unicode而且没有指定编码，则会采用python默认编码encode后再写入。对于python的codecs模块写入时，如果write参数是unicode，则使用打开文件时的编码写入，如果是str，则先使用默认编码解码成unicode后再以打开文件的编码写入(这里需要注意如果str是中文，而默认编码sys.getdefaultencoding()是ascii的话会报解码错误)。

# -*- coding: utf-8 -*-
#test.py is saved as utf-8 format
import codecs
filehandle1 = open('test1.txt','w') # write utf-8 characters
filehandle2 = open('test2.txt','w') # write gbk characters
u = u'Python大法好'
s = 'Python大法好'
filehandle1.write(u.encode('utf-8'))
filehandle1.write(s)
filehandle2.write(u.encode('gbk'))
filehandle2.write(s.decode('utf-8').encode('gbk'))
filehandle1.close()
filehandle2.close()

filehandle3 = codecs.open('test3.txt', 'w', encoding='utf-8') # write utf-8 characters
filehandle4 = codecs.open('test4.txt', 'w', encoding='gbk') # write gbk characters
filehandle3.write(u)
filehandle3.write(s.decode('utf-8'))
filehandle4.write(u)
filehandle4.write(s.decode('utf-8'))
filehandle3.close()
filehandle4.close()

总结：

1．先解码，后编码。先解码意味着无论何时有字节流输入，尽早将其解码为Unicode。此操作可防止len( )和切分utf-8字节流发生问题。后编码意味着直到需要将文本输出到某个地方时，才把它编码为字节流。这个输出可能是一个文件，一个数据库，一个socket等等。只有在处理完成之后才编码unicode对象。最后编码也意味着，不要让Python为你编码Unicode对象。而Python将只会使用默认ASCII编码。
2．默认使用utf-8编码。默认使用UTF-8编码意味着：因为UTF-8可以处理任何Unicode字符。
3．使用codecs和Unicode对象来简化处理。codecs模块能够让我们在处理诸如文件或socket这样的流的时候能少踩一些坑。如果没有codecs提供的这个工具，你就必须将文件内容读取为字节流，然后将这个字节流解码为Unicode对象。codecs模块能够让你快速的将字节流转化为Unicode对象，省去很多麻烦。
4.同类型的直接相连；不同类型，先对其进行unicode类型解码后再相连。

wwjiang_ustc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python编解码小结（二）——Python的编解码

本章将继续围绕如何在python下进行编解码问题进行讨论。（一）源代码文件(Source Code Files)的编码关于Python对代码文件的编码处理，Python官网上的Defining Python Source Code Encodings章节有详细描述(https://www.python.org/dev/peps/pep-0263/),现摘录如下python缺省认
复制链接

扫一扫