python中文decode和encode转码

最新推荐文章于 2024-06-28 21:30:26 发布

cats_miao

最新推荐文章于 2024-06-28 21:30:26 发布

阅读量1.1k

点赞数 1

分类专栏： python基础文章标签： python

python基础专栏收录该内容

33 篇文章 1 订阅

订阅专栏

"""
字符串在Python内部的表示是unicode编码，因此，在做编码转换时，通常需要以unicode作为中间编码，即先将其他编码的字符串解码（decode）成unicode，再从unicode编码（encode）成另一种编码。

decode的作用是将其他编码的字符串转换成unicode编码，如str1.decode('gb2312')，表示将gb2312编码的字符串str1转换成unicode编码。

encode的作用是将unicode编码转换成其他编码的字符串，如str2.encode('gb2312')，表示将unicode编码的字符串str2转换成gb2312编码。

因此，转码的时候一定要先搞明白，字符串str是什么编码，然后decode成unicode，然后再encode成其他编码



解决正则出现中文的BUG结论：
1、打开文件
myfile = codecs.open("right.html","r")
不需要设置其编码的！

设置编码格式
str = myfile.read()                             
content = str.replace("\n"," ")
content = content.decode('utf-8','ignore')   #使用utf-8解码成unicode格式

正则：
regex3 = regex3.decode('utf-8','ignore')    #正则也统一使用utf-8解码成unicode格式

然后就可以
p=re.compile(regex3)
results = p.findall(content)
调用正则了！

"""