Python3.x 爬虫爬取GB2312编码的网页出现乱码的解决方式

最新推荐文章于 2021-07-21 10:34:07 发布

m0_37664103

最新推荐文章于 2021-07-21 10:34:07 发布

阅读量1.7k

点赞数 2

分类专栏： Python爬虫学习文章标签： python 乱码

本文链接：https://blog.csdn.net/m0_37664103/article/details/104564655

版权

Python爬虫学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

用Python3.x抓取GB2312编码方式的网站很容易出现乱码，原代码如图所示：

import requests
res = requests.get('http://www.jjwxc.net/onebook.php?novelid=1231454&chapterid=1')
res.encoding = res.apparent_encoding 
novel = res.text
res.close()
with open('test1.txt', mode='a+',encoding='utf-8') as file:
        file.write(novel) #生成txt文件，不是乱码

按照以上代码可以得出非乱码的txt格式的源代码,但无论是将txt改后缀成html还是按照如下代码直接爬出html文件，网页都显示为乱码

import requests
res = requests.get('http://www.jjwxc.net/onebook.php?novelid=1231454&chapterid=1')
res.encoding = res.apparent_encoding
novel = res.text
res.close()
with open('test1.html', mode='a+',encoding='utf-8') as file:
        file.write(novel) #直接生成html文件，网页为乱码

本人在网上查阅许多资料，均只能使生成的txt文件非乱码，而html文件依然是乱码，经过研究，增加以下代码：res8 =novel.encode(res.apparent_encoding,"ignore").decode(res.apparent_encoding,"ignore")可以成功爬取html文件。
最终成功的代码如下：

import requests
res = requests.get('http://www.jjwxc.net/onebook.php?novelid=1231454&chapterid=1')
res.encoding = res.apparent_encoding
novel = res.text
res.close()
# 增加以下代码，对源代码进行正确地编码和解码，从而避免乱码
res8 = novel.encode(res.apparent_encoding,"ignore").decode(res.apparent_encoding,"ignore") 
with open('test1.html', mode='a+') as file:
        file.write(res8)

需要注意若不添加“ignore”,程序将会报错：
UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 65885: illegal multibyte sequence
这是因为某些字符无法被gb2312编码，从而导致程序报错，只需忽略这些字符便可

参考：
Python学习笔记（一）转码问题的解决的解决方法：“ignore”
python3 爬虫抓取网页出现乱码问题解决方法