爬取的网页不管是'gbk'解码还是'utf-8'解码,以下是通用的解决乱码方法
先转换成二进制格式,再进行编码
用requests获取网页时
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36",
}
res = requests.get('http://www.eeo.com.cn/yaowen/',headers = headers)
response = bytes(res.text,res.encoding).decode('utf-8','ignore')
print(response)
用pyquery获取网页时
from pyquery import PyQuery as pq
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36",
}
res = pq('http://www.eeo.com.cn/yaowen/',headers = headers)
response = bytes(res.html(),res.encoding).decode('utf-8','ignore')
print(response)
二者区别仅在于获取网页内容时的函数是text还是html(),另外,文中要获取的网页解码方式是'utf-8'