python 字符编码格式_Python的字符编码类型-CSDN博客

要说清楚Python字符串的编码类型，需要先学习计算机的编码类型。搞清楚什么是ASCII，什么是UNICODE，什么是UTF-8。

然后，始终牢记，在Python的世界，一切都是对象（object）。与字符编码类型有关的对象有str和bytes。str对象存储的字符是UNICODE类型，bytes对象存储的字符就是一串byte。

下面是str对象和bytes对象的定义：

>>>

>>> str1 = ''

>>> str1

>>> str2 = 'abcde'

>>> str2

'abcde'

>>> str3 = '麦新杰'

>>> str3

'麦新杰'

>>> str4 = u'麦新杰de云上小悟' # u开头

>>> str4

'麦新杰de云上小悟'

>>>

>>> byte1 = b'abcde' # b开头

>>> byte1

b'abcde'

>>> byte2 = b'麦新杰' #只能是ASCII字符

File "", line 1

SyntaxError: bytes can only contain ASCII literal characters.

>>>

>>> type(str4) # str object

>>>

>>> type(byte1) # bytes object

>>>

>>> len(str2)

>>> len(str3) # 一个中文算一个字符

>>> len(str4)

str对象只能encode，bytes对象只能decode。

str的encode和bytes的decode默认都是utf-8。

>>> help(str.encode)

Help on method_descriptor:

encode(...)

S.encode(encoding='utf-8', errors='strict') -> bytes

Encode S using the codec registered for encoding. Default encoding

is 'utf-8'. errors may be given to set a different error

handling scheme. Default is 'strict' meaning that encoding errors raise

a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and

'xmlcharrefreplace' as well as any other name registered with

codecs.register_error that can handle UnicodeEncodeErrors.

>>> help(bytes.decode)

Help on method_descriptor:

decode(self, /, encoding='utf-8', errors='strict')

Decode the bytes using the codec registered for encoding.

encoding

The encoding with which to decode the bytes.

errors

The error handling scheme to use for the handling of decoding errors.

The default is 'strict' meaning that decoding errors raise a

UnicodeDecodeError. Other possible values are 'ignore' and 'replace'

as well as any other name registered with codecs.register_error that

can handle UnicodeDecodeErrors.

>>>

str的encode之后得到的是一个bytes对象；

bytes对象decode之后，得到的是一个str对象。

>>>

>>> str3

'麦新杰'

>>> str3.encode()

b'\xe9\xba\xa6\xe6\x96\xb0\xe6\x9d\xb0'

>>>

>>> byte1

b'abcde'

>>> byte1.decode()

'abcde'

>>>

虽然使用的是utf-8的方式decode，但是得到的str对象起内存中的字符依然是unicode存储。

>>>

>>> ord(str3[0])

40614

>>> chr(40614)

'麦'

>>> bin(ord(str3[0]))

'0b1001111010100110'

>>>

>>> byte3 = str3.encode()

>>> byte3 # 非ASCII范围，通过\x的16进制的方式显示

b'\xe9\xba\xa6\xe6\x96\xb0\xe6\x9d\xb0'

>>>

>>> byte3.decode()

'麦新杰'

>>>

>>> type(byte3.decode())

>>> str3

'麦新杰'

>>>

基本上我们编写代码，只会用到ASCII和UTF-8，其它的就不要用了，否则自找麻烦。

如果一个str对象存储的都是ASCII范围的字符，就可以使用ASCII的方式编码成bytes对象：

>>>

>>> str6 = 'abdde123465'

>>>

>>> str6.encode('ascii')

b'abdde123465'

>>>

在放几个代码示例，注意\u这种写法：

>>> '\u4e2d\u6587'

'中文'

>>> 'ABC'.encode('ascii')

b'ABC'

>>> '中文'.encode('utf-8')

b'\xe4\xb8\xad\xe6\x96\x87'

>>> '中文'.encode('ascii')

Traceback (most recent call last):

File "", line 1, in

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

最后，本文用到的几个builtin函数：

bin()：将一个int转换成一个二进制的字符串；

ord()：得到一个字符（只有一个字符的str）的unicode编码值；

chr()：将一个unicode编码值转换成一个字符（只有一个字符的str对象）

常用的type()，help()，len()就不解释了。