python提取网页数据/爬虫入门第一课

最新推荐文章于 2024-05-14 06:48:47 发布

排队的萝卜

最新推荐文章于 2024-05-14 06:48:47 发布

阅读量496

点赞数 2

分类专栏： Python 文章标签： error

本文链接：https://blog.csdn.net/totosj/article/details/102695817

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

动手做爬虫

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://movie.douban.com/subject/1292052/')
print(r.text)

用到requests_html，一开始按要求使用的pip install requests_html

1\安装完运行发现报错ModuleNotFoundError: No module named 'requests_html'

在stackoverflow上找到了解决办法：pip3 install requests_html

2\再次运行，发现显示的结果中汉字都是乱码

遂又百度，找到以下代码

首先查看源网页代码使用的编码类型，发现是zh-cmn-Hans

再利用request库的功能查看默认输出的编码类型

url = 'https://movie.douban.com/subject/3075287/'

#检测默认输出编码类型
response = session.get(url)
print(response.encoding)

输出结果是utf-8，并不是源网页的编码类型

所以要转换一下，利用request库转换输出结果的编码

def get_html(url):
    try:
        response.encoding = 'zh-cmn-Hans'  # 改变编码
        print(response.encoding)
        html = response.text
        return html
    except:
        print('请求网址出错')

最后的代码和输出结果是这样的

from requests_html import HTMLSession
session= HTMLSession()

url = 'https://movie.douban.com/subject/3075287/'

#检测默认输出编码类型
response = session.get(url)
print(response.encoding)  


def get_html(url):
    try:
        response.encoding = 'zh-cmn-Hans'  # 改变编码
        print(response.encoding)
        html = response.text
        return html
    except:
        print('请求网址出错')

print(response.text)