unicode
unicode是一种编码方案, utf-8是unicode的一种实现方式。
Python2 编码
In [1]: a = '啊哈哈'
In [2]: a
Out[2]: '\xe5\x95\x8a\xe5\x93\x88\xe5\x93\x88'
In [4]: type(a)
Out[4]: str
In [5]: len(a)
Out[5]: 9
In [6]: b = u'姚赫赫'
In [7]: type(b)
Out[7]: unicode
In [8]: len(b)
Out[8]: 3
In [9]: a.decode('utf-8')
Out[9]: u'\u554a\u54c8\u54c8'
In [10]: b
Out[10]: u'\u59da\u8d6b\u8d6b'
In [11]: b.encode('utf-8')
Out[11]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab'
In [12]: c = '姚赫赫'
In [13]: c
Out[13]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab'
In [14]: import sys
In [15]: sys.getdefaultencoding()
Out[15]: 'ascii'
In [16]: b + c
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-16-c6b7c7e5694f> in <module>()
----> 1 b + c
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
In [17]: import sys
In [18]: relaod(sys)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-f73449e725b6> in <module>()
----> 1 relaod(sys)
NameError: name 'relaod' is not defined
In [19]: reload(sys)
<module 'sys' (built-in)>
In [20]: sys.setdefaultencoding('utf-8')
In [21]: b + c
Out[21]: u'\u59da\u8d6b\u8d6b\u59da\u8d6b\u8d6b'
In [22]: type(b + c)
Out[22]: unicode
python2 中a='啊哈哈'
, a的类型是str, 是编码后的字节序列。a的长度是字节数;而b的类型是unicode(存储文本字符串), b的长度是字符数。
相互转化
str –>decode(‘utf-8’) –> unicode
unicode –>encode(‘utf-8’)–> str
写入文件的时候str类型的可以直接写入,unicode类型的必须encode之后写入。