python | 字符串编码问题怎么破

一位代码

于 2024-08-20 22:32:24 发布

阅读量760

点赞数 23

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/lhjcsdnyl/article/details/141369748

版权

python 专栏收录该内容

26 篇文章 11 订阅

订阅专栏

python字符串常见两种类型：str和 bytes类型
str表示Unicode字符，bytes表示二进制数据
两者之间转换使用：encode()和decode()方法

一、enocde()和decode()方法

（一）encode()方法

encode()—编码，语法：

str.encode([encoding="utf-8"][,errors="strict"])

参数	释义
str	表示要进行转换的字符串
encoding = “utf-8”	可选参数，指定进行编码时采用的字符编码类型，常用有utf-8、gb2312。默认值为utf-8，当只使用encoding参数时，可写成：str.encode(‘utf-8’)
errors = “strict”	指定错误处理方式—strict：遇到非法字符抛出异常，ignore：忽略非法字符，replace：用“？”替换非法字符，xmlcharrefreplace：使用 xml 的字符引用。默认值为strict

（二）decode()方法

decode()—解码，语法：

bytes.decode([encoding="utf-8"[,errors="strict"])

参数	释义
bytes	表示要进行转换的二进制数据
encoding=“utf-8”	指定解码时采用的字符编码，默认采用utf-8格式，当只使用encoding参数时，可写成：bytes.decode(‘utf-8’)
errors = “strict”	同encode的errors参数一样

注： python3.x默认采用utf-8编码格式，较好解决了中文乱码问题。

二、常见转化使用场景

（一）将str类型字符串进行编码

a_str = '当时只道是寻常'
a_stru = a_str.encode('utf-8')
a_struc = a_str.encode('unicode-escape')
print('str转换为bytes(utf-8):\n', a_stru)
print('str转换为bytes(unicode):\n', a_struc)

在这里插入图片描述

（二）将二进制（bytes）字符串解码为str类型字符串

在进行解码之前，要先确定目标字符串是什么类型的数据格式
因为不是所有以\x、\u开头的字符串都是bytes类型
python3中字符串默认都为unicode（str类型），只有加上前缀b的字符串，才是bytes类型

b_str = '\xe5\xbd\x93'
b_str1 = b'\xe5\xbd\x93'
print('b_str数据类型：', type(b_str))
print('b_str1数据类型：', type(b_str1))

在这里插入图片描述
python常见字符串前缀含义

字符	释义
u	u’当时只道是寻常’，前缀u表示该字符串是unicode（str类型）。python2中，用在含有中文字符的字符串前，防止因为编码问题，导致中文出现乱码。python3中，所有字符串默认都是unicode（str类型），可以不用。
r	r’当时只道是寻常\n\n’，前缀r表示该字符串是原始字符串，即\不是转义符，只是单纯的一个符号，常用于文件路径。
b	b’赌书消得泼茶香，当时只道是寻常。'，表示该字符串类型为bytes，用于python3中，python3字符串默认都是unicode（str类型）。python2的字符串本身就是bytes类型，可以不用。

确定目标字符串的类型后，就可以根据需求解码字符串。
常见有以下几种用法

1、用例1

decode()方法的常规操作，就是把bytes类型字符串进行解码

b_str = b'\xe5\xbd\x93\xe6\x97\xb6\xe5\x8f\xaa\xe9\x81\x93\xe6\x98\xaf\xe5\xaf\xbb\xe5\xb8\xb8'
b_str1 = b'\\u5f53\\u65f6\\u53ea\\u9053\\u662f\\u5bfb\\u5e38'
print('b_str解码：', b_str.decode('utf-8'))
print('b_str1解码：', b_str1.decode('unicode_escape'))

在这里插入图片描述

2、用例2

有时，需要把一些以\x、\u开头的str类型字符串进行解码。常见于网络爬虫程序文本抓取后。
当直接使用decode()方法对str类型字符串进行解码，将会报错，如下：

str1 = '\xe5\xbd\x93'
print(str1.decode('utf-8'))

在这里插入图片描述
解决办法： 先将str类型字符串编码成bytes类型，再进行解码。

str1 = '\xe5\xbd\x93\xe6\x97\xb6\xe5\x8f\xaa\xe9\x81\x93\xe6\x98\xaf\xe5\xaf\xbb\xe5\xb8\xb8'
str2 = '\\u5f53\\u65f6\\u53ea\\u9053\\u662f\\u5bfb\\u5e38'
str1j = str1.encode('iso-8859-1').decode()
str2j = str2.encode().decode('unicode_escape')
print('str1解码：', str1j)
print('str2解码：', str2j)