python网络爬虫Simple(2) requests和beautifulsoup4安装和使用

最新推荐文章于 2023-08-21 10:11:15 发布

令狐飞侠

最新推荐文章于 2023-08-21 10:11:15 发布

阅读量785

点赞数

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/afei8080/article/details/103522375

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1 requests

1.1 requests packages简介

requests. 库基于 urllib开发。

requests的主要方法：
requests.request() 构造一个请求，支持以下各种方法
requests.get() 获取html的主要方法
requests.head() 获取html头部信息的主要方法
requests.post() 向html网页提交post请求的方法
requests.put() 向html网页提交put请求的方法
requests.patch() 向html提交局部修改的请求
requests.delete() 向html提交删除请求

response的主要属性：
r.status_code #响应状态码
r.raw #返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content #字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text #字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers #以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None
r.json() #Requests中内置的JSON解码器
r.raise_for_status() #失败请求(非200响应)抛出异常

1.2 安装requests

输入pip install request执行
在这里插入图片描述
打开文件目录可以看到安装的package：

1.3 requests 的简单使用

首先引入request库：

import requests

get方式进行请求：

response1 = requests.get(url='http://www.baidu.com')

然后打印返回状态码

print(response1.status_code)

我们也可以设置代理：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}
response2 = requests.get(url='http://www.baidu.com', headers=headers)

然后输出response的文本：

print(response2.text)

代码详见：https://github.com/linghufeixia/python-spider-simple chapter2中chapter2-1.py

2 beautifulsoup4

2.1 beautifulsoup4简介

beautifulsoup 4的中文官方版本：
https://beautifulsoup.readthedocs.io/zh_CN/latest/

beautifulsoup库是解析、遍历、维护“标签树”的功能库。
5种基本元素
在这里插入图片描述
Tag：标签
Name：标签的名字：tag.name
Attribute：提取标签的属性：tag[‘attribute’]
NavigableString 标签中的文本内容tag.string
Comment：HTML和XML中的注释

常用find_all函数说明：
find_all( name , attrs , recursive , string , **kwargs )
name 参数：可以查找所有名字为 name 的tag。
attr 参数：就是tag里的属性。
string 参数：搜索文档中字符串的内容。
recursive 参数：调用tag的 find_all() 方法时，Beautiful Soup会检索当前tag的所有子孙节点。如果只想搜索tag的直接子节点，可以使用参数 recursive=False

2.2 安装beautifulsoup4

需要安装xml 的lib：
pip install lxml
在这里插入图片描述
安装beautifulsoup4：
pip install beautifulsoup4
安装成功后提示：

2.3 beautifulsoup的简单实用

首先BeautifulSoup的实例bsoup

reponse_text = reponse1.text
bsoup = BeautifulSoup(reponse_text, 'html.parser')

然后打印第一个p标签：

p=bsoup.p
print("第一个P标签:")
print(p)

代码详见：https://github.com/linghufeixia/python-spider-simplechapter2中chapter2-2.py

3 Requests和beautifulsoup4结合使用

获取天气预报项目实战
1 使用requests 库获取图片
2 使用BeautifulSoup 库解析抓取网页内容。
3 使用os 库创建文件夹和获取文件夹中的文件名称列表

第1步：在Chrome中打开http://www.weather.com.cn/weather/101010100.shtml
在这里插入图片描述
第2步：开发者工具，找到代码中天气的元素，右键在弹出的快捷菜单中选择“Copy”➔“Copy
Selector”命令，便可以自动复制路径。
将路径粘贴在文档中，代码如下:
#\37 d > ul > li.sky.skyid.lv2.on
对路径稍作处理：
ul > li.sky.skyid.lv2.on
Beautiful 直接select：
datas = bsoup.select(‘ul > li.sky.skyid.lv2.on’)

第3步：清洗和组织数据

weathers = bsoup.find_all('li', class_='sky skyid lv2 on')

…

date = weather.find(‘h1’).text
print(“日期:”, end=" ")
print(date)

wea = weather.find(‘p’, class_=‘wea’).text
print(“天气:”, end=" ")
print(wea)
第4步：运行结果：
在这里插入图片描述
代码详见：https://github.com/alifeidao/python-spider-simplechapter2 中chapter2-3.py
需要说明：
今日天气的tag有可能会发生变化，代码需要相应调整。

令狐飞侠

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫Simple(2) requests和beautifulsoup4安装和使用

1 requests1.1 requests packages简介requests. 库基于 urllib开发。requests的主要方法：requests.request() 构造一个请求，支持以下各种方法requests.get() 获取html的主要方法requests.head() 获取html头部信息的主要方法requests.post() 向html网页提交post请求...
复制链接

扫一扫

专栏目录