python 编码

菜的真真实实

已于 2022-12-04 11:25:51 修改

阅读量158

点赞数

文章标签： python 开发语言 intellij-idea

于 2022-09-20 14:28:57 首次发布

本文链接：https://blog.csdn.net/qq_15098623/article/details/126952581

版权

编码：将人能够读懂的信息（明文）转换为计算机能够读懂的信息（二进制）。
解码：将计算机存储的信息转换为明文。

Python2
python2默认的编码方式是ASCII，但是ASCII编码只用了一个字节八位，最多表示常用的128个字符，其他各国语言如中文不能支持。因业务需要，通常会将python2的默认编码方式设置为utf-8。

python2中字符的表示方式有两种，unicode与str字符串，需要注意的是，两种字符串不能想加。

unicode字符串是采用unicode字符集来表示字符串，str直接采用二进制即bytes（python2中str类型即是bytes类型）的形式表示字符串，无论是unicode字符串还是str字符串，最终在磁盘中存储的都是二进制。

a = u"虾皮" # 以“u”开头的字符串表示unicode字符串
b = “虾皮” # str字符串
def main():
a = u"虾皮"
b = “虾皮”
print "a 的类型是： " + str(type(a))
print “b 的类型是：” + str(type(b))
print isinstance(b, bytes)
print isinstance(b, str)

a 的类型是： <type ‘unicode’>
b 的类型是：<type ‘str’>
True
True
编码与解码
python2 中一般采用encode函数对字符串进行编码，即将unicode转换为str，采用decode函数对字符串解码，即将str转化为unicode

def main():
a = u"虾皮"
c = “虾皮”
b = a.encode(encoding=“gbk”) # 将unicode按照gbk编码规则编码
d = a.encode(encoding=“utf-8”) # 将unicode按照utf-8编码规则编码
e = a.encode(encoding=“utf-16”) # 将unicode按照utf-16编码规则编码 print u"明文：" + a
print “unicode 码点为：” + repr(a) + " 其type为：" + str(type(a))
print “gbk 编码：” + repr(b) + " 其type为：" + str(type(b))
print “原始字符串：” + repr© + " 其type为：" + str(type©)
print “utf-8编码：” + repr(d) + " 其type为：" + str(type(d))
print “utf-16编码：” + repr(e) + " 其type为：" + str(type(e))

明文：虾皮
unicode 码点为：u’\u867e\u76ae’ 其type为：<type ‘unicode’>
gbk 编码：‘\xcf\xba\xc6\xa4’ 其type为：<type ‘str’>
原始字符串：‘\xe8\x99\xbe\xe7\x9a\xae’ 其type为：<type ‘str’>
utf-8编码：‘\xe8\x99\xbe\xe7\x9a\xae’ 其type为：<type ‘str’>
utf-16编码：‘\xff\xfe~\x86\xaev’ 其type为：<type ‘str’>
在这个例子中，a定义为unicode字符串，c定义为str字符串。

Python3
python3中，默认的编码方式是utf-8，区别月python2中的ASCII

和python2类似，python3中的字符串也有两种类型，分别是str和bytes两种类型。python3中的str字符串实际就是unicode字符串，和python2 中的unicode字符串对应，bytes字符串就是二进制字符串，和python2 中的str类型对应。

Unicode字符串

python2

a = u"虾皮"

python3

a = “虾皮”
byte字符串（在python3中， byte字符串使用ASCII编码，所以不能定义 b=b“中文”）

python2

b = “xiapi”

python3

b = b"xiapi"

编码与解码
python3 中同样采用encode函数对字符串进行编码，采用decode函数对字符串解码

def main():
a = “虾皮”
c = b"xiapi"
b = a.encode(encoding=“gbk”)
d = a.encode(encoding=“utf-8”)
e = a.encode(encoding=“utf-16”)
print (“明文：” + a)
print (“unicode字符串为：” + repr(a) + " 其type为：" + str(type(a)))
print (“gbk 编码：” + repr(b) + " 其type为：" + str(type(b)))
print (“原始字符串：” + repr© + " 其type为：" + str(type©))
print (“utf-8编码：” + repr(d) + " 其type为：" + str(type(d)))
print (“utf-16编码：” + repr(e) + " 其type为：" + str(type(e)))

明文：虾皮
unicode字符串为：‘虾皮’ 其type为：<class ‘str’>
gbk 编码：b’\xcf\xba\xc6\xa4’ 其type为：<class ‘bytes’>
原始字符串：b’xiapi’ 其type为：<class ‘bytes’>
utf-8编码：b’\xe8\x99\xbe\xe7\x9a\xae’ 其type为：<class ‘bytes’>
utf-16编码：b’\xff\xfe~\x86\xaev’ 其type为：<class ‘bytes’>

Goland
golang默认采用的是utf-8编码

在go中，一个string类型的值可以使用[]rune来拆分成字符序列，字符序列是为了人类可读，也可以使用[]byte拆分成字节序列，字节序列是为了机器可读。

rune是go的基本类型，他的一个值就代表一个unicode字符，如果是中文，一个unicode中文字符采用utf-8编码就会占用三个字节。

rune类型实际是int32类型的别名，也就是占用4个字节，所以它一定可以保存一个采用utf-8编码的字符。

如上，字符串“BGBial 人生”用[]rune拆分成人类可读的字符序列，用[]byte拆分成机器可读的字节序列

需要注意的是，golang自带的len函数判断的字符串的字节长度，而不是字符长度

如果字符串中有中文，计算字符序列长度需要使用以下函数：

bytes.Count()
strings.Count()
将字符串转换为 []rune 后调用 len 函数进行统计
utf8.RuneCountInString() 统计