爬取中文网站request返回 x开头的乱码

最新推荐文章于 2021-12-02 10:19:43 发布

PaulXerxes

最新推荐文章于 2021-12-02 10:19:43 发布

阅读量1.2k

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/m0_37960566/article/details/105809599

版权

1.对于Python3爬虫抓取网页中文出现输出乱码

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
html = response.read()
print(html)

上面的代码正常但是运行的时候结果遇到中文会以\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80代替，这是一种byte字节。
python 3输出位串，而不是可读的字符串，需要对其进行转换
使用str(string[, encoding])对数组进行转换

str(response.read(),'utf-8')

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
html =str(response.read(),'utf-8')
print(html)

这样就解决了中文不能输出问题。

2.解析报错

如果解析时报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

首先我们观察第一个print输出的字节码可以看到它是以"b’\x1f\x8b\x08"开头的，说明它是gzip压缩过的数据，这也是报错的原因，所以我们需要对我们接收的字节码进行一个解码操作。修改如下：

from urllib import request
from io import BytesIO
import gzip

response = urllib.request.urlopen('http://www.baidu.com')
html = response.read()
buff = BytesIO(htmls)
f = gzip.GzipFile(fileobj=buff)
html = f.read().decode('utf-8')
print(html)

3.判断是否是gzip

有的网站请求不确定返回的数据，需要先判断是否经过压缩，然后返回相应的。

		USER_AGENT = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36'
        req = request.Request(url,headers={'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip'})
        response = request.urlopen(req)
        if response.getcode() != 200:
            return None
        htmls = response.read()
        fEncode = response.info().get('Content-Encoding')
        if fEncode == 'gzip':
            buff = BytesIO(htmls)
            f = gzip.GzipFile(fileobj=buff)
            htmls = f.read().decode('utf-8')
        return htmls