爬虫-bs4

最新推荐文章于 2023-08-07 15:37:51 发布

zjb5599

最新推荐文章于 2023-08-07 15:37:51 发布

阅读量170

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/zjb5599/article/details/106320407

版权

爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

1.bs4（ BeautifulSoup4）简介

1.1 基本概念

Beautiful Soup 是⼀个可以从HTML或XML⽂件中提取数据的⽹⻚信息提取库。

1.2 安装

pip install lxml
pip install bs4

2.bs4的使⽤

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'lxml')
#print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.p)
print(soup.a)
r=soup.find_all('a')
for link in r:
    print(link.get('href'))

2.2 bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释

3. 遍历树遍历⼦节点

bs⾥⾯有三种情况，第⼀个是遍历，第⼆个是查找，第三个是修改。

3.1 contents children descendants

contents 返回的是⼀个列表
children 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
descendants 返回的是⼀个⽣成器遍历⼦⼦孙孙

3.2 .string .strings .stripped strings

string获取标签⾥⾯的内容
strings 返回是⼀个⽣成器对象⽤过来获取多个标签内容
stripped strings 和strings基本⼀致但是它可以把多余的空格去掉。

4. 遍历树遍历⽗节点

parent 和 parents
parent直接获得⽗节点
parents获取所有的⽗节点

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
soup = BeautifulSoup(html_doc,'lxml')
 title_tag = soup.title
 print(title_tag)
print(title_tag.parent)
a_tag = soup.a
 print(a_tag)
print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>
for x in a_tag.parents:
    print(x)
    print('----------------')

5. 遍历树遍历兄弟结点

next_sibling 下⼀个兄弟结点 ```
previous_sibling 上⼀个兄弟结点
next_siblings 下⼀个所有兄弟结点
previous_siblings上⼀个所有兄弟结点

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')

# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)
a_tag = soup.a
# print(a_tag)
for x in a_tag.next_siblings:
    print(x)

6. 搜索树

1.字符串过滤器
2.正则表达式过滤器
我们⽤正则表达式⾥⾯compile⽅法编译⼀个正则表达式传给 find 或者 findall这个⽅法可以实现⼀个正则表达式的⼀个过滤器的搜索。
3. 列表过滤器
4. True过滤器
5.⽅法过滤器

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 我想要找到所有t 打头的标签 正则表达式
# print(soup.find_all(re.compile('t')))
# 我想要找p标签和a标签  列表过滤器
# print(soup.find_all(['p','a']))
# print(soup.find_all(['title','b']))
# print(soup.find_all(True)) # True过滤器
def fn(tag):
    return tag.has_attr('class')
print(soup.find_all(fn))

7.修改文档树

修改tag的名称和属性
修改string 属性赋值,就相当于⽤当前的内容替代了原来的内容
append() 像tag中添加内容,就好像Python的列表的 .append() ⽅法
decompose() 修改删除段落，对于⼀些没有必要的⽂章段落我们可以给他删除掉。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 1.修改tag的名称和属性
# tag_p = soup.p
# print(tag_p)
# tag_p.name = 'w' # 修改名称
# tag_p['class'] = 'content' # 修改属性
# print(tag_p)
# 2. 修改string
tag_p = soup.p
# print(tag_p.string)
# tag_p.string = '521 wo ai ni men'
# print(tag_p.string)
# 3.tag.append() 方法 像tag中添加内容
# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)
# 4.decompose() 修改删除段落
result = soup.find(class_ = 'title')
result.decompose()
print(soup)

zjb5599

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫-bs4

1.bs4（ BeautifulSoup4）简介1.1 基本概念Beautiful Soup 是⼀个可以从HTML或XML⽂件中提取数据的⽹⻚信息提取库。1.2 安装pip install lxmlpip install bs42.bs4的使⽤from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head&g
复制链接

扫一扫