Python 中的字符串编码_xe4 xbb 编码-CSDN博客

对Python字符编码一直没搞明白，今天看《Python参考手册》再次遇到这个问题，重新整理下

Python中字符串字面量用于指定一个字符序列，其定义方法是把文本放入单引号('),双引号(")或者三引号('''或""")中。

Python2中，字符串字面量对应于8位字符或者面向字节的数据。关于这些字符串有一个很重要的限制，即它们无法完全支持国际字符集和Unicode。为了解决这种限制问题，Python2对Unicode数据用了单独的字符串类型。要输入Unicode字符串字面量，要在第一个引号前面加上前缀"u"。

Python3中，不必加这个前缀(如果加上会算作语法错误)，因为所有字符串已是Unicode编码。如果使用-U选项运行解释器，Python2将会模拟这种行为(即所有字符串字面量将被作为Unicode字符串对待，u前缀可以省略)

Python2 Python3 str类型比较

Python2

type("str")  #<type 'str'>
type(b"str") #<type 'str'>
type(u"str") #<type 'unicode'>

Python3

type("str")  #<type 'str'>
type(b"str") #<class 'bytes'>
type(u"str") #<type 'str'>

Python2对Unicode数据用了单独的字符串类型unicode.

要将一个已编码的字节字符串指定为字面量，在第一个引号前面加上"b"，这样才能从字面上创建一个单字节的字符串。字节字面量在大多数程序中极少使用，因为这种语法直到Python2.6才出现，而且在次版本中，字节字面量和普通字符串之间没有差距。但在Python3中，字节字面量变成了与普通字符串不同的新的bytes类型数据。（从上面的代码可以看出）

Python2中的编码：

#!/usr/bin/python
#-*- coding:UTF-8 -*-

s='代码' #python2 会自动将字符串转换为合适编码的字节字符串，自动转换为UTF-8编码的字节字符串 '\xe4\xbb\xa3\xe7\xa0\x81'
u=u'代码' #显式指定字符串类型为unicode类型， 此类型字符串没有编码，保存的是字符在unicode字符集中的代码点(序号) u'\u4ee3\u7801'

print len(s) #6
print len(u) #2

print repr(s) #'\xe4\xbb\xa3\xe7\xa0\x81'
print repr(u) #u'\u4ee3\u7801'

#print s.encode('utf-8') #UnicodeDecodeError: 'ascii' codec can't decode 
　　　　　　　　　　　　　　 #python2 已经自动将其转化为utf-8类型编码，因此再次编码会报错

print repr(s.decode('utf-8')) #u'\u4ee3\u7801'
　　　　　　　　　　　　　　　　　 #python2 可以正常解码，返回的字符串类是无编码的unicode类型　　

print repr(u.encode('utf-8')) #'\xe4\xbb\xa3\xe7\xa0\x81'
#print repr(u.decode('utf-8')) #UnicodeEncodeError: 'ascii' codec can't encode characters

b=b'代码' #已被python2转换为utf-8编码，因此已为字节字符串

print len(b) #6
print repr(b) #'\xe4\xbb\xa3\xe7\xa0\x81
print repr(b.decode('utf-8')) #u'\u4ee3\u7801'
#print repr(b.encode('utf-8')) #UnicodeDecodeError: 'ascii' codec can't decode byte

严格意义上说，str其实是字节串（Python2中，字符串字面量对应于8位字符或者面向字节的数据），它是unicode经过编码后的字节组成的序列；对UTF-8编码的str'代码'使用len()函数时，结果是6，因为实际上，UTF-8编码的'代码' == '\xe4\xbb\xa3\xe7\xa0\x81'；unicode才是真正意义上的字符串，对字节串str使用正确的字符编码进行解码后获得，并且len(u'代码') == 2

Python3中的编码

#!/usr/local/bin/python3
#-*- coding: UTF-8 -*-
s='代码'
print(repr(s)) #代码

print(s.encode('utf-8')) #b'\xe4\xbb\xa3\xe7\xa0\x81'

type(s) # <class 'str'>

type(s.encode('utf-8')) #<class 'bytes'>

type(b'str') #<class 'bytes'>

print(s.decode('utf-8')) #AttributeError: 'str' object has no attribute 'decode'