如何自动对页面解码

最新推荐文章于 2024-05-04 04:17:15 发布

会编程的漂亮小姐姐

最新推荐文章于 2024-05-04 04:17:15 发布

阅读量560

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/u014229742/article/details/84848722

版权

Python 专栏收录该内容

171 篇文章 2 订阅

订阅专栏

在爬虫中，经常遇到需要页面界面都才能看到正确的内容，否则中文会显示为乱码，但每个页面的编码方式不一样，此时最好能自动获取页面的编码方式，然后对页面进行解码，才能获取到我们想要的内容。
代码如下：

from lxml import etree import chardet import requests import
urllib

# 自动获取页面源码的编码方式 def automatic_detect(url):
    # python2,urllib.urlopen
    content = urllib.request.urlopen(url).read()
    result = chardet.detect(content)
    encoding = result['encoding']
    return encoding


url = 'http://www.yongdasbkj.com/' resp = requests.get(url,
verify=False) html_string =
resp.content.decode(automatic_detect(url))
# print(html_string) html = etree.HTML(html_string)
# 获取titile titles = html.xpath('//title') for title in titles:
    title = title.xpath('text()')
    print(title)