python中codecs模块_python codecs模块解决UnicodeEncodeError

最新推荐文章于 2023-11-14 11:40:04 发布

weixin_39695323

最新推荐文章于 2023-11-14 11:40:04 发布

阅读量241

点赞数

文章标签： python中codecs模块

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39695323/article/details/111848841

版权

通过python在html文件中截取table标签内容时，出现了UnicodeEncodeError: 'gbk' codec can't encode character u'\xc7' in position 2: illegal multibyte sequence。

运行代码：

#!-*- coding: utf-8 -*-

import re

from BeautifulSoup import *

html_tags = open('html_tags_su.txt', 'r').read()# get the html detail

soup = BeautifulSoup(html_tags) # creat a soup.

table = soup.find('table')# find the table tag

table_tags = open('table_tags.txt','w')

table_tags.write(table.text.encode('gbk'))

table_tags.close()

错误图片：

如果将‘gbk’替换为‘gb18030’，虽然可以消除错误，但是输出的结果为乱码。

乱码

经过多次尝试，发现了模块codecs。

codecs模块是专门用作编码转换，在原有代码中引入codecs模块UnicodeEncodeError问题解决了。

修正后的代码：

#!-*- coding: utf-8 -*-

import re

import codecs

from BeautifulSoup import *

html_tags = codecs.open('html_tags_su.txt', 'r', 'GBK').read()# get the html detail

soup = BeautifulSoup(html_tags) # creat a soup.

table = soup.find('table')# find the table tag

table_tags = open('table_tags.txt','w')

table_tags.write(table.text.encode('gbk'))

table_tags.close()

运行输出结构为：

---------------------------------------------------------------------------------------------------------

codecs模块的应用形式主要有两种示例代码分别如下

# -*- encoding: utf-8 -*-

import codecs

# 创建gb2312编码器

look = codecs.lookup( " gb2312 " )

# 创建utf-8编码器

look2 = codecs.lookup( " utf-8 " )

a = " 我爱北京 "

b = look.decode(a)#注意这时的b为包含(b[0], b[1])的元组，b[0]是字符串，b[1]为字符串的长度。

# -*- encoding: utf-8 -*-

import codecs

# 用codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode

bfile = codecs.open( " dddd.txt " , ' r ' , " big5 " )

# bfile = open("dddd.txt", 'r')

ss = bfile.read()

bfile.close()

# 输出，这个时候看到的就是转换后的结果。如果使用语言内建的open函数来打开文件，这里看到的必定是乱码

print ss, type(ss)

weixin_39695323

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中codecs模块_python codecs模块解决UnicodeEncodeError

通过python在html文件中截取table标签内容时，出现了UnicodeEncodeError: 'gbk' codec can't encode character u'\xc7' in position 2: illegal multibyte sequence。运行代码：#!-*-coding:utf-8-*-importrefromBeautifulSoupimport...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。