爬虫-bs4

1.bs4( BeautifulSoup4)简介

1.1 基本概念

Beautiful Soup 是⼀个可以从HTML或XML⽂件中提取数据的⽹⻚信息提取库。

1.2 安装

pip install lxml
pip install bs4

2.bs4的使⽤

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'lxml')
#print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.p)
print(soup.a)
r=soup.find_all('a')
for link in r:
    print(link.get('href'))
2.2 bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释

3. 遍历树 遍历⼦节点

bs⾥⾯有三种情况,第⼀个是遍历,第⼆个是查找,第三个是修改 。

3.1 contents children descendants

contents 返回的是⼀个列表
children 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
descendants 返回的是⼀个⽣成器遍历⼦⼦孙孙

3.2 .string .strings .stripped strings

string获取标签⾥⾯的内容
strings 返回是⼀个⽣成器对象⽤过来获取多个标签内容
stripped strings 和strings基本⼀致 但是它可以把多余的空格去掉 。

4. 遍历树 遍历⽗节点

parent 和 parents
parent直接获得⽗节点
parents获取所有的⽗节点

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
soup = BeautifulSoup(html_doc,'lxml')
 title_tag = soup.title
 print(title_tag)
print(title_tag.parent)
a_tag = soup.a
 print(a_tag)
print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>
for x in a_tag.parents:
    print(x)
    print('----------------')

5. 遍历树 遍历兄弟结点

next_sibling 下⼀个兄弟结点 ```
previous_sibling 上⼀个兄弟结点
next_siblings 下⼀个所有兄弟结点
previous_siblings上⼀个所有兄弟结点

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')

# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)
a_tag = soup.a
# print(a_tag)
for x in a_tag.next_siblings:
    print(x)

6. 搜索树

1.字符串过滤器
2.正则表达式过滤器
我们⽤正则表达式⾥⾯compile⽅法编译⼀个正则表达式传给 find 或者 findall这个⽅法可以实现⼀个正则表达式的⼀个过滤器的搜索。
3. 列表过滤器
4. True过滤器
5.⽅法过滤器

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 我想要找到所有t 打头的标签 正则表达式
# print(soup.find_all(re.compile('t')))
# 我想要找p标签和a标签  列表过滤器
# print(soup.find_all(['p','a']))
# print(soup.find_all(['title','b']))
# print(soup.find_all(True)) # True过滤器
def fn(tag):
    return tag.has_attr('class')
print(soup.find_all(fn))

7.修改文档树

修改tag的名称和属性
修改string 属性赋值,就相当于⽤当前的内容替代了原来的内容
append() 像tag中添加内容,就好像Python的列表的 .append() ⽅法
decompose() 修改删除段落,对于⼀些没有必要的⽂章段落我们可以给他删 除掉。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 1.修改tag的名称和属性
# tag_p = soup.p
# print(tag_p)
# tag_p.name = 'w' # 修改名称
# tag_p['class'] = 'content' # 修改属性
# print(tag_p)
# 2. 修改string
tag_p = soup.p
# print(tag_p.string)
# tag_p.string = '521 wo ai ni men'
# print(tag_p.string)
# 3.tag.append() 方法 像tag中添加内容
# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)
# 4.decompose() 修改删除段落
result = soup.find(class_ = 'title')
result.decompose()
print(soup)
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值