20210626_26期_办公自动化_Task5_爬虫入门

最新推荐文章于 2024-06-20 14:06:09 发布

余柳成荫

最新推荐文章于 2024-06-20 14:06:09 发布

阅读量82

点赞数

本文链接：https://blog.csdn.net/yuliuchenyin/article/details/118251421

版权

五、爬虫入门

在这里插入图片描述

1 request

requests库7个主要方法
方法说明
requsts.requst() 构造一个请求，最基本的方法，是下面方法的支撑
requsts.get() 获取网页，对应HTTP中的GET方法
requsts.post() 向网页提交信息，对应HTTP中的POST方法
requsts.head() 获取html网页的头信息，对应HTTP中的HEAD方法
requsts.put() 向html提交put方法，对应HTTP中的PUT方法
requsts.patch() 向html网页提交局部请求修改的的请求，对应HTTP中的PATCH方法
requsts.delete() 向html提交删除请求，对应HTTP中的DELETE方法

import requests
# 发出http请求
re=requests.get("https://www.baidu.com")
# 查看响应状态
print(re.status_code)

2 beautifulsoup库

Beautiful Soup是一个HTML/XML的解析器，主要的功能是解析和提取 HTML/XML 数据,
Beautiful Soup中lxml解析器速度较快.

提取信息核心:

精确定位标签
从标签中提取内容

find()和find_all()
find(name, attrs, recursive, text) 返回一个BeautifulSoup的标签对象
- name：检索标签的名称
- attrs：对标签属性值的检索字符串，可标注属性检索
- recursive：是否对子孙全部检索，默True
- text：<>…</>中字符串区域得检索字符串
find_all()与find()相同参数所有该标签，返回一个list

a = '''<h1>标题1</h1><h2>标题2</h2><h2>标题3</h2>'''
# 将提取到的字符串转化为beautifulsoup的对象

soup = BeautifulSoup(a, "html.parser")
# 提取唯一标签
soup.h1
soup.find('h1')
soup.find_all('h1')[0]
# 上面三条结果都是

<h1>标题1</h1>

print(soup.find_all(['h1','h2']))
print('--------正则表达式-------------')
print(soup.find_all(re.compile('^h')))   #re.compile该函数根据包含的正则表达式的字符串创建模式对象,
                                         #可以实现更有效率的匹配

[<h1>标题1</h1>, <h2>标题2</h2>, <h2>标题3</h2>]
--------正则表达式-------------
[<h1>标题1</h1>, <h2>标题2</h2>, <h2>标题3</h2>]

进一步的, 利用find_all识别标签,并提取属性和值

a = '''<p id='p1'>段落1</p><p id='p2'>段落2</p>
       <p class='p3'>段落3</p><p class='p3' id='pp'>段落4</p>'''
       
soup = BeautifulSoup(a, "html.parser")

# 第一种，直接将属性名作为参数名，但是有些属性不行，比如像a-b这样的属性
soup.find_all('p', id = 'p1') # 一般情况
soup.find_all('p', class_='p3') # class是保留字比较特殊，需要后面加一个_

# 最通用的方法
soup.find_all('p', attrs={'class':'p3'}) # 包含这个属性就算，而不是只有这个属性
soup.find_all('p', attrs={'class':'p3','id':'pp'}) # 使用多个属性匹配
soup.find_all('p', attrs={'class':'p3','id':False}) # 指定不能有某个属性
soup.find_all('p', attrs={'id':['p1','p2']}) # 属性值是p1或p2

[<p id="p1">段落1</p>, <p id="p2">段落2</p>]

获取网页里的所有类别

website_url = requests.get('https://arxiv.org/category_taxonomy').text 
soup = BeautifulSoup(website_url,'lxml') #爬取数据，这⾥使⽤lxml的解析器，加速

root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签⼊⼝
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags 中'h2,h3,h4,p'标签

参考资料

余柳成荫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
20210626_26期_办公自动化_Task5_爬虫入门

五、爬虫入门目录五、爬虫入门来源1 request2 beautifulsoup库参考资料来源Datewhle26期__Python办公自动化 :https://github.com/datawhalechina/team-learning-program/tree/master/OfficeAutomation作者:牧小熊、刘雯静、张晓东、吴争光、隆军论坛地址：http://datawhale.club/t/topic/15741 requestrequests库7个主要方法方法
复制链接

扫一扫