使用urllib爬取压缩过的网页

最新推荐文章于 2021-12-18 10:45:07 发布

阿农安贵人

最新推荐文章于 2021-12-18 10:45:07 发布

阅读量449

点赞数

分类专栏： Python；文章标签：爬虫 python

本文链接：https://blog.csdn.net/sfw_123817/article/details/80735711

版权

Python；专栏收录该内容

3 篇文章 0 订阅

订阅专栏

最近在使用urllib爬取网页的时候发现一个非常奇怪的问题，就是使用浏览器或者postman都可以正常访问的一个网页，但是使用urllib的话获取到的网页信息都是乱码，无论使用utf-8解码还是使用GBK解码都不行。

原始代码：

cookies = http.cookiejar.LWPCookieJar()
handlers = [
urllib.request.HTTPHandler(),
urllib.request.HTTPSHandler(),
urllib.request.HTTPCookieProcessor(cookies)
]
opener = urllib.request.build_opener(*handlers)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36')]

request = urllib.request.Request(url)
text = opener.open(request).read()

排除错误的过程

首先怀疑web page本身有问题，使用浏览器和postman，结果都能打开

其次怀疑代码问题，换成requests module，没有问题，可以正常获取

response = requests.request('GET', url, headers=headers)

但问题是我这里整个爬虫的框架都是用的是urllib，而且对于大多数web（几乎所有了）都是可以的。为什么偏偏对某些不行呢？总不能为了这一个来修改整体的代码吧。

继续钻研:
发现postman显示，accept-encoding: gzip，猜想难道web发过来的时候是压缩过的数据。那么试一下解压缩呢，于是将上面的代码修改为

cookies = http.cookiejar.LWPCookieJar()
handlers = [
urllib.request.HTTPHandler(),
urllib.request.HTTPSHandler(),
urllib.request.HTTPCookieProcessor(cookies)
]
opener = urllib.request.build_opener(*handlers)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'),
('Accept-encoding', 'gzip')]

request = urllib.request.Request(url)
text = opener.open(request).read()
html = zlib.decompress(text, 16+zlib.MAX_WBITS)