5 实战1—利用Python获取新闻网页源代码

啥都鼓捣的小yao

已于 2023-12-01 02:39:27 修改

阅读量1.3k

点赞数 2

分类专栏： Python大数据挖掘与分析文章标签： python web 爬虫 request

于 2021-08-04 14:43:16 首次发布

本文链接：https://blog.csdn.net/Eric005/article/details/119383287

版权

Python大数据挖掘与分析专栏收录该内容

39 篇文章 5 订阅

订阅专栏

利用Python获取新闻网页源代码

通过Requests库来尝试获取百度新闻的网页源代码

import requests
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
print(res)
'''
输出结果为：
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
'''

我们并没有获取真正的网页源代码，这是因为网站只认可浏览器发送的访问请求，不认可通过Python发送的访问请求，所以我们需要设置requests.get()中的headers参数，模拟浏览器的访问请求。headers参数提供的是网站访问者的信息，headers中的Users-Agent（用户代理）表示的是用什么浏览器访问的。
修改与结果如下：

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
print(res)

在这里插入图片描述

这里的headers是一个字典，第一个元素的键名为‘User-Agent’，值为’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36’。
User-Agent表示的是访问网站的浏览器是那种浏览器。

方法：在浏览器的地址栏输入：about:version，即可查看用户代理，他就是User-Agent的值

然后每次用requests.get()访问网站时，加上headers=headers即可。

啥都鼓捣的小yao

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
3
评论
5 实战1—利用Python获取新闻网页源代码

利用Python获取新闻网页源代码通过Requests库来尝试获取百度新闻的网页源代码import requestsurl = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'res = requests.get(url, headers=headers).textprint(res)'''输出结果为：<html><head> <script> l
复制链接

扫一扫