python codecs模块解决UnicodeEncodeError

最新推荐文章于 2024-10-02 10:53:34 发布

weixin_34293246

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量104

点赞数

文章标签： python

原文链接：https://my.oschina.net/lCQ3FC3/blog/270577

版权

2019独角兽企业重金招聘Python工程师标准>>>

通过python在html文件中截取table标签内容时，出现了UnicodeEncodeError: 'gbk' codec can't encode character u'\xc7' in position 2: illegal multibyte sequence。

运行代码：

#!-*- coding: utf-8 -*-

import re
from BeautifulSoup import *

html_tags = open('html_tags_su.txt', 'r').read()# get the html detail
soup = BeautifulSoup(html_tags) # creat a soup.
table = soup.find('table')# find the table tag

table_tags = open('table_tags.txt','w')
table_tags.write(table.text.encode('gbk'))
table_tags.close()

错误图片：

如果将‘gbk’替换为‘gb18030’，虽然可以消除错误，但是输出的结果为乱码。

乱码

经过多次尝试，发现了模块codecs。

codecs模块是专门用作编码转换，在原有代码中引入codecs模块UnicodeEncodeError问题解决了。

修正后的代码：

#!-*- coding: utf-8 -*-

import re
import codecs
from BeautifulSoup import *

html_tags = codecs.open('html_tags_su.txt', 'r', 'GBK').read()# get the html detail
soup = BeautifulSoup(html_tags) # creat a soup.
table = soup.find('table')# find the table tag

table_tags = open('table_tags.txt','w')
table_tags.write(table.text.encode('gbk'))
table_tags.close()

运行输出结构为：

---------------------------------------------------------------------------------------------------------

codecs模块的应用形式主要有两种示例代码分别如下

# -*- encoding: utf-8 -*- 
import  codecs

#  创建gb2312编码器 
look   =  codecs.lookup( " gb2312 " )
#  创建utf-8编码器 
look2  =  codecs.lookup( " utf-8 " )

a  =   " 我爱北京 " 

b  =  look.decode(a)#注意这时的b为包含(b[0], b[1])的元组，b[0]是字符串，b[1]为字符串的长度。

# -*- encoding: utf-8 -*- 
 import  codecs

 #  用codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode 
 bfile  =  codecs.open( " dddd.txt " ,  ' r ' ,  " big5 " )
 # bfile = open("dddd.txt", 'r') 
 
 ss  =  bfile.read()
 bfile.close()
 #  输出，这个时候看到的就是转换后的结果。如果使用语言内建的open函数来打开文件，这里看到的必定是乱码 
 print  ss, type(ss)

转载于:https://my.oschina.net/lCQ3FC3/blog/270577