Python爬虫requests后的html乱码解决(gzip, deflate, br)

最新推荐文章于 2024-11-18 10:47:39 发布

十一姐

最新推荐文章于 2024-11-18 10:47:39 发布

阅读量4.6k

点赞数 3

分类专栏： # SpiderCrawl

本文链接：https://blog.csdn.net/weixin_43411585/article/details/100083362

版权

SpiderCrawl 专栏收录该内容

47 篇文章 264 订阅

订阅专栏

headers = {
     'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
     'accept-Encoding': "gzip, deflate, br",
     'accept-Language': "zh-CN,zh;q=0.9",
     'connection': "close",
     'Upgrade-Insecure-Requests': '1',
     'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
      }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text)

在这里插入图片描述

修改print(resp.text)，出现如下乱码

headers = {
   'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
   'accept-Encoding': "gzip, deflate, br",
   'accept-Language': "zh-CN,zh;q=0.9",
   'connection': "close",
   'Upgrade-Insecure-Requests': '1',
   'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
    }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text.encode('gbk', 'ignore').decode('gbk'))

在这里插入图片描述

2、解决问题

将 ‘accept-Encoding’: "gzip, deflate, br"里面的br去掉即可，或者这一行直接注释掉

headers = {
     'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
     'accept-Encoding': "gzip, deflate",
     'accept-Language': "zh-CN,zh;q=0.9",
     'connection': "close",
     'Upgrade-Insecure-Requests': '1',
     'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
      }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text.encode('gbk', 'ignore').decode('gbk'))