当爬取网页内容为乱码时,解决办法
r.apparent_encoding 从内容中分析响应内容的编码方式
encoding和apparent_encoding的区别
- r.encoding:如果header中不存在charset,则认为编码为ISO-8859-1
- r.apparent_encoding:根据网页内容分析出的编码方式
- r.apparent_encoding比r.encoding更为准确
如爬取 https://www.dxsbb.com/news/44368.html
import requests
url = 'https://www.dxsbb.com/news/44368.html'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"cookie": "Hm_lvt_0fde2fa52b98e38f3c994a6100b45558=1605139839,1605441318,1605521212,1606205900; Hm_lpvt_0fde2fa52b98e38f3c994a6100b45558=1606205905; ASPSESSIONIDCGQRRTAQ=PCKBDNECHGHNJPBGINBIINKP"
}
r = requests.get(url, headers=headers)
r.encoding = 'utf-8'
print(r.text)
content = r.content
# print(content)
print(content.decode('ISO-8859-1'))
爬取结果:
是乱码的
解决办法:
import requests
url = 'https://www.dxsbb.com/news/44368.html'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
"cookie": "Hm_lvt_0fde2fa52b98e38f3c994a6100b45558=1605139839,1605441318,1605521212,1606205900; Hm_lpvt_0fde2fa52b98e38f3c994a6100b45558=1606205905; ASPSESSIONIDCGQRRTAQ=PCKBDNECHGHNJPBGINBIINKP"
}
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)
爬取结果: