Python编码中的坑及处理方法

最新推荐文章于 2024-10-19 15:44:36 发布

wangshuang1631

最新推荐文章于 2024-10-19 15:44:36 发布

阅读量2.5k

点赞数 2

分类专栏： Python 文章标签： python 编码 codec ascii

本文链接：https://blog.csdn.net/wangshuang1631/article/details/71480246

版权

Python 专栏收录该内容

31 篇文章 0 订阅

订阅专栏

Python虐我千百遍，我待Python如初恋。
使用Python编写模型脚本，其中Python的编码让我一路采坑。首先报的一个错误就是：

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

各种搜索找到的方法都是不外乎设置Python编码

import sys
reload(sys)
sys.setdefaultencoding('utf8')

这种方法可能解决其他问题，但是确实没有解决我的问题！
另外一个Python常见的编码错误如下：

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

尤其是在使用Python处理中文时，例如读写文件、字符串处理、print等，一运行发现一大堆乱码。这时多数人都会各种调用encode/decode进行调试，并没有明确思考为何出现乱码。

str 和 unicode

str和unicode都是basestring的子类，所以有判断是否是字符串的方法。

def is_str(s):
    return isinstance(s, basestring)

str和unicode 的转换与区别

str  -> decode('the_coding_of_str') -> unicode
unicode -> encode('the_coding_you_want') -> str

str是字节串，由unicode经过编码(encode)后的字节组成的。
声明方式及求长度(返回字节数)

s = '中文'
s = u'中文'.encode('utf-8')
>>> type('中文')
<type 'str'>

>>> u'中文'.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(u'中文'.encode('utf-8'))
6

unicode才是真正意义上的字符串，由字符组成
声明方式及求长度(返回字符数)

s = u'中文'
s = '中文'.decode('utf-8')
s = unicode('中文', 'utf-8')

>>> type(u'中文')
<type 'unicode'>

>>> u'中文'
u'\u4e2d\u6587'
>>> len(u'中文')
2

总结

搞明白要处理的是str还是unicode, 使用对的处理方法(str.decode/unicode.encode)
下面是判断是否为unicode/str的方法

>>> isinstance(u'中文', unicode)
True
>>> isinstance('中文', unicode)
False

>>> isinstance('中文', str)
True
>>> isinstance(u'中文', str)
False

简单原则：不要对str使用encode，不要对unicode使用decode

>>> '中文'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

>>> u'中文'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

不同编码转换,使用unicode作为中间编码

#s是code_A的str
s.decode('code_A').encode('code_B')