如果爬取的中文形如’\x9d\x9cå\x8f\x8bç\x94’,则多半是编码有问题。其中一种解决方法为,通过requests.get获取网页访问的response后,查看response的encoding和apparent_encoding是否一致,若不一致,则编码有误。代码如下(其中url、headers自行定义):
response = requests.get(url, headers = headers)
print(response.encoding == response.apparent_encoding)
若打印结果为false,则不一致,则需在获取response后,改变其编码格式,代码如下:
response = requests.get(url, headers = headers)
print(response.encoding == response.apparent_encoding)
response.encoding = response.apparent_encoding
print(response.encoding == response.apparent_encoding)
然后再获取文本信息即可