关于python爬虫页面解码问题

最新推荐文章于 2023-10-28 10:37:43 发布

ccc1331

最新推荐文章于 2023-10-28 10:37:43 发布

阅读量1.1k

点赞数 4

文章标签： python web

原文链接：https://zhuanlan.zhihu.com/p/25095566?refer=zjying2000

版权

简介

在写爬虫时候碰到了解码爬取到的页面问题，困惑了好久。记录在这里一下，以便以后自己再次碰到相同问题。

问题如下

本人使用python3爬取一本小说时，获取网页url后进行解析时，在采用decode(‘utf-8’)解码时出现了utf-8无法解码的问题，出现了如下提示：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

但是在查看网页HTML代码头部时发现编码方式确实是UTF-8方式：

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

本人原来的代码：

import urllib.request
def test():
    rootUrl = "xxxxxxxxx"  #要爬取的url
    html = urllib.request.urlopen(rootUrl).read()
    html = html.decode("utf-8")
    print(html)

后来发现是这个网页问了加快传输速率进行了gzip压缩。
查看是否gzip压缩方法：我们打开某个网页后，按F12进入开发者工具，点击Network>>all，查看hearders下方的Requst Headers这一项，看是否包含这样一句：

Accept-Encoding: gzip, deflate

如果包含这样一句说明该网页进行了gzip压缩，直接进行decode是行不通的。
现在介绍两种网上找到的方法：
1、使用gzip库解压后再解码。
2、通过requests库解析网页
方法1的只需要加入这一行：

html = gzip.decompress(html).decode("utf-8")

完整代码如下：

import urllib.request
def test():
    rootUrl = "xxxxxxxxx"  #要爬取的url
    html = urllib.request.urlopen(rootUrl).read()
    html = gzip.decompress(html).decode("utf-8")
    print(html)

方法2使用requests库解析代码如下：

import requests
def test():
	rootUrl = "xxxxxxxxx"  #要爬取的url
	response= requests.get(rootUrl)
    response.encoding = response.apparent_encoding	#
    html = response.text
    print(html)

ccc1331

关注

4
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
关于python爬虫页面解码问题

简介在写爬虫时候碰到了解码爬取到的页面问题，困惑了好久。记录在这里一下，以便以后自己再次碰到相同问题。问题如下本人使用python3爬取一本小说时，获取网页url后进行解析时，在采用decode(‘utf-8’)解码时出现了utf-8无法解码的问题，出现了如下提示：UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in pos...
复制链接

扫一扫