Python爬虫之数据解析——BeautifulSoup亮汤模块(二):搜索(再接上文,2023美赛春季赛帆船数据解析sailboatdata.com)

目录

四、beautifulsoup搜索(直接定位我们需要的东西)

1.find_all(name, attrs, recursive, text, **kwargs)

(1)name参数(只能匹配标签,属性、内容不在搜索范围内)

规则形式一:字符串(找出所有其标签与字符串完全匹配的tag)

规则形式二:正则表达式

规则形式三:方法

规则形式四:字符串列表

(2)kwargs参数:指定属性指定值 或 指定属性有无

(3)text参数:匹配文本

(4)attrs参数

(5)limit参数,限定查找个数,只要前n个

2.find()

3.css选择器

(1)通过标签来找,直接找,并获取内容

(2)通过类名来找,加.号

(3)通过id来找,加#号

(4)组合查找

(5)通过属性来查找

(6)通过标签和属性来查找

(7)通过子标签来查找

(8)通过兄弟节点标签查找


四、beautifulsoup搜索(直接定位我们需要的东西)

1.find_all(name, attrs, recursive, text, **kwargs)

(1)name参数(只能匹配标签,属性、内容不在搜索范围内)

查找所有匹配的tag,返回一个包含元素类型为tag的list

规则形式一:字符串(找出所有其标签与字符串完全匹配的tag)
list=bs.find_all("a")
for item in list:
    print(item)

print("-"*500)
list=bs.find_all("Skip to content")
for item in list:
    print(item)

print("-"*500)
list=bs.find_all("tr")
for item in list:
    for child in item.contents:
        if str(child.string)!="\n":
            print(child.string,end=" ")
    print()
print(type(item),sep="\n")

<a class="skip-link screen-reader-text" href="#primary"><!--Skip to content--></a>
<a class="skip-link screen-reader-text" href="#primary">Skip to content</a>
--------------------------------------------------
--------------------------------------------------
Hull Type: Fin w/bulb & spade rudder 
Rigging Type: Fractional Sloop 
LOA: 33.80 ft / 10.30 m 
LWL: 26.90 ft / 8.20 m 
S.A. (reported): 450.00 ft² / 41.81 m² 
Beam: 8.20 ft / 2.50 m 
Displacement: 3,527.00 lb / 1,600 kg 
Ballast: 1,598.00 lb / 725 kg 
Max Draft: 5.90 ft / 1.80 m 
Construction: FG w/core,Composite 
Ballast Type: Lead 
First Built: 1990 
# Built: 350 
Designer: Ron Holland & Rolf Gyhlenius 
<class 'bs4.element.Tag'>

规则形式二:正则表达式
list = bs.find_all(re.compile("t"))
for item in list:
    print(item)
    print("-"*50)
<html lang="en-US">
<head>
<meta charset="utf-8"/>
......
</meta></meta></link></meta>
--------------------------------------------------
<title>11 METER - sailboatdata</title>
--------------------------------------------------
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type">
<meta content="11 METER - sailboatdata" property="og:title"/>
<meta content="https://sailboatdata.com/sailboat/11-meter/" property="og:url"/>
<meta content="sailboatdata" property="og:site_name"/>
<meta content="https://www.facebook.com/sailboatdata" property="article:publisher"/>
<meta content="2023-05-12T18:11:13+00:00" property="article:modified_time"/>
<meta content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg" property="og:image"/>
<meta content="483" property="og:image:width"/>
<meta content="640" property="og:image:height"/>
<meta content="image/jpeg" property="og:image:type"/>
</meta></meta>
--------------------------------------------------
<meta content="article" property="og:type">
<meta content="11 METER - sailboatdata" property="og:title"/>
<meta content="https://sailboatdata.com/sailboat/11-meter/" property="og:url"/>
<meta content="sailboatdata" property="og:site_name"/>
<meta content="https://www.facebook.com/sailboatdata" property="article:publisher"/>
<meta content="2023-05-12T18:11:13+00:00" property="article:modified_time"/>
<meta content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg" property="og:image"/>
<meta content="483" property="og:image:width"/>
<meta content="640" property="og:image:height"/>
<meta content="image/jpeg" property="og:image:type"/>
</meta>
--------------------------------------------------
<meta content="11 METER - sailboatdata" property="og:title"/>
--------------------------------------------------
<meta content="https://sailboatdata.com/sailboat/11-meter/" property="og:url"/>
--------------------------------------------------
<meta content="sailboatdata" property="og:site_name"/>
--------------------------------------------------
<meta content="https://www.facebook.com/sailboatdata" property="article:publisher"/>
--------------------------------------------------
<meta content="2023-05-12T18:11:13+00:00" property="article:modified_time"/>
--------------------------------------------------
<meta content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg" property="og:image"/>
--------------------------------------------------
<meta content="483" property="og:image:width"/>
--------------------------------------------------
<meta content="640" property="og:image:height"/>
--------------------------------------------------
<meta content="image/jpeg" property="og:image:type"/>
--------------------------------------------------
<table class="table">
<tbody class="table-light">
<!-- start hull section -->
<tr>
.....
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
<!-- end designer name section -->
</tbody>
</table>
--------------------------------------------------
<tbody class="table-light">
<!-- start hull section -->
<tr>
......
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
<!-- end designer name section -->
</tbody>
--------------------------------------------------
<tr>
<td>Hull Type:</td>
<td>Fin w/bulb &amp; spade rudder</td>
</tr>
--------------------------------------------------
<td>Hull Type:</td>
--------------------------------------------------
<td>Fin w/bulb &amp; spade rudder</td>
--------------------------------------------------
<tr>
<td>Rigging Type:</td>
<td>Fractional Sloop</td>
</tr>
--------------------------------------------------
<td>Rigging Type:</td>
--------------------------------------------------
<td>Fractional Sloop</td>
--------------------------------------------------
<tr>
<td>LOA:</td>
<td>33.80 ft / 10.30 m</td>
</tr>
--------------------------------------------------
<td>LOA:</td>
--------------------------------------------------
<td>33.80 ft / 10.30 m</td>
--------------------------------------------------
<tr>
<td>LWL:</td>
<td>26.90 ft / 8.20 m</td>
</tr>
--------------------------------------------------
<td>LWL:</td>
--------------------------------------------------
<td>26.90 ft / 8.20 m</td>
--------------------------------------------------
<tr>
<td>S.A. (reported):</td>
<td>450.00 ft² / 41.81 m²</td>
</tr>
--------------------------------------------------
<td>S.A. (reported):</td>
--------------------------------------------------
<td>450.00 ft² / 41.81 m²</td>
--------------------------------------------------
<tr>
<td>Beam:</td>
<td>8.20 ft / 2.50 m</td>
</tr>
--------------------------------------------------
<td>Beam:</td>
--------------------------------------------------
<td>8.20 ft / 2.50 m</td>
--------------------------------------------------
<tr>
<td>Displacement:</td>
<td>3,527.00 lb / 1,600 kg</td>
</tr>
--------------------------------------------------
<td>Displacement:</td>
--------------------------------------------------
<td>3,527.00 lb / 1,600 kg</td>
--------------------------------------------------
<tr>
<td>Ballast:</td>
<td>1,598.00 lb / 725 kg</td>
</tr>
--------------------------------------------------
<td>Ballast:</td>
--------------------------------------------------
<td>1,598.00 lb / 725 kg</td>
--------------------------------------------------
<tr>
<td>Max Draft:</td>
<td>5.90 ft / 1.80 m</td>
</tr>
--------------------------------------------------
<td>Max Draft:</td>
--------------------------------------------------
<td>5.90 ft / 1.80 m</td>
--------------------------------------------------
<tr>
<td>Construction:</td>
<td>FG w/core,Composite</td>
</tr>
--------------------------------------------------
<td>Construction:</td>
--------------------------------------------------
<td>FG w/core,Composite</td>
--------------------------------------------------
<tr>
<td>Ballast Type:</td>
<td>Lead</td>
</tr>
--------------------------------------------------
<td>Ballast Type:</td>
--------------------------------------------------
<td>Lead</td>
--------------------------------------------------
<tr>
<td>First Built:</td>
<td>1990</td>
</tr>
--------------------------------------------------
<td>First Built:</td>
--------------------------------------------------
<td>1990</td>
--------------------------------------------------
<tr>
<td># Built:</td>
<td>350</td>
</tr>
--------------------------------------------------
<td># Built:</td>
--------------------------------------------------
<td>350</td>
--------------------------------------------------
<tr>
<td>Designer:</td>
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
--------------------------------------------------
<td>Designer:</td>
--------------------------------------------------
<td>Ron Holland &amp; Rolf Gyhlenius</td>
--------------------------------------------------

运行结果如上,它找出了所有标签中包含t的tag:html,title,meta,meta,每个meta,table,tbody,每个tr,每个tr中的每个td.

规则形式三:方法

传入一个函数,根据函数的要求来搜索:

list = bs.find_all(nameIsExists)
for item in list:
    print(item)
    print("-"*50)

<meta content="width=device-width, initial-scale=1" name="viewport"/>
--------------------------------------------------
<meta content="app-id=1545197438" name="apple-itunes-app"/>
--------------------------------------------------
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<!-- This site is optimized with the Yoast SEO plugin v20.6 - https://yoast.com/wordpress/plugins/seo/ -->
<title>11 METER - sailboatdata</title>
<link href="https://sailboatdata.com/sailboat/11-meter/" rel="canonical">
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type">
<meta content="11 METER - sailboatdata" property="og:title"/>
<meta content="https://sailboatdata.com/sailboat/11-meter/" property="og:url"/>
<meta content="sailboatdata" property="og:site_name"/>
<meta content="https://www.facebook.com/sailboatdata" property="article:publisher"/>
<meta content="2023-05-12T18:11:13+00:00" property="article:modified_time"/>
<meta content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg" property="og:image"/>
<meta content="483" property="og:image:width"/>
<meta content="640" property="og:image:height"/>
<meta content="image/jpeg" property="og:image:type"/>
</meta></meta></link></meta>
--------------------------------------------------
 

找出了所有标签中包含name属性的tag

规则形式四:字符串列表

(2)kwargs参数:指定属性指定值 或 指定属性有无

返回一个包含元素类型为tag的list

list = bs.find_all(class_=True)
for item in list:
    print(item)
    print("-"*50)

print(bs.find_all(content="app-id=1545197438"))

 找出了有class属性的tag

--------------------------------------------------
[<meta content="app-id=1545197438" name="apple-itunes-app"/>]

找出了content是指定值的tag

(3)text参数:匹配文本

返回一个包含元素类型为NavigableString或comment的list

list = bs.find_all("Skip to content")  # 搜索不到,因为它只匹配标签
for item in list:
    print(item)

print("-" * 50)

list = bs.find_all(text="Skip to content")  # 搜索到了,因为它匹配内容
for item in list:
    print(type(item),item)

--------------------------------------------------
<class 'bs4.element.Comment'> Skip to content
<class 'bs4.element.NavigableString'> Skip to content

当然,直接这样用没用什么意义,返回的值与我们传入的值看起来是一样的。

事实上,我们一般用列表,或者正则表达式放在text=后

应用正则表达式,来获取整个html所有navigableString(包含comment)中的所有符合正则条件的完整内容。

(4)attrs参数

print(bs.find_all(name="viewport"))
print(bs.find_all(attrs={"name":"viewport"}))

[]
[<meta content="width=device-width, initial-scale=1" name="viewport"/>]

第一种直接搜就搜不到了,因为name是find_all方法的一个形参呐

(5)limit参数,限定查找个数,只要前n个


2.find()

相当于是限制limit=1

3.css选择器

BeautifulSoup 支持部分的 CSS 选择器,在 Tag 获取 BeautifulSoup 对象的 .select()方法中传入字符串参数,即可使用 CSS 选择器的语法找到 Tag

(1)通过标签来找,直接找,并获取内容

print("通过标签来找,直接找,并获取内容")
list = bs.select("a")
for item in list:
    print(type(item.getText()),item.getText())

通过标签来找,直接找,并获取内容
<class 'str'> 
<class 'str'> Skip to content

(2)通过类名来找,加.号

print("通过类名来找,加.号")
list = bs.select(".boats-dimension")
for item in list:
    print(item)

通过类名来找,加.号
<div class="boats-dimension">
<h1>11 METER</h1>
</div>

(3)通过id来找,加#号

print("通过id来找,加#号")
list = bs.select("#primary")
for item in list:
    print(item)

通过id来找,加#号
<main class="site-main" id="primary">
<div class="container">
<div class="row">
<div class="col-lg-8 sailboat-content-wrapper">
<div class="boats-dimension">
<h1>11 METER</h1>
</div>
<div class="spec">
<div class="spec-bord">
<table class="table">
<tbody class="table-light">
<!-- start hull section -->
<tr>
<td>Hull Type:</td>
<td>Fin w/bulb &amp; spade rudder</td>
</tr>
<!-- end hull section -->
<!-- start rig section -->
<tr>
<td>Rigging Type:</td>
<td>Fractional Sloop</td>
</tr>
<!-- end rig section -->
<!-- start loa section -->
<tr>
<td>LOA:</td>
<td>33.80 ft / 10.30 m</td>
</tr>
<!-- end loa section -->
<!-- start lod section -->
<!-- end lod section -->
<!-- start lwl section -->
<tr>
<td>LWL:</td>
<td>26.90 ft / 8.20 m</td>
</tr>
<!-- end lwl section -->
<!-- start sa section -->
<tr>
<td>S.A. (reported):</td>
<td>450.00 ft² / 41.81 m²</td>
</tr>
<!-- end sa section -->
<!-- start beam section -->
<tr>
<td>Beam:</td>
<td>8.20 ft / 2.50 m</td>
</tr>
<!-- end beam section -->
<!-- start beam wl section -->
<!-- end beam wl section -->
<!-- start disp section -->
<tr>
<td>Displacement:</td>
<td>3,527.00 lb / 1,600 kg</td>
</tr>
<!-- end disp section -->
<!-- start ballast section -->
<tr>
<td>Ballast:</td>
<td>1,598.00 lb / 725 kg</td>
</tr>
<!-- end ballast section -->
<!-- start max draft section -->
<tr>
<td>Max Draft:</td>
<td>5.90 ft / 1.80 m</td>
</tr>
<!-- end max draft section -->
<!-- start min draft section -->
<!-- end min draft section -->
<!-- start hull construction section -->
<tr>
<td>Construction:</td>
<td>FG w/core,Composite</td>
</tr>
<!-- end hull construction section  -->
<!-- start ballast type section -->
<tr>
<td>Ballast Type:</td>
<td>Lead</td>
</tr>
<!-- end ballast type section  -->
<!-- start Bridgedeck Clearance section -->
<!-- end Bridgedeck Clearance section -->
<!-- start first built section -->
<tr>
<td>First Built:</td>
<td>1990</td>
</tr>
<!-- end first built section -->
<!-- start last built section -->
<!-- end last built section -->
<!-- start number built section -->
<tr>
<td># Built:</td>
<td>350</td>
</tr>
<!-- end number built section -->
<!-- start builder name section -->
<!-- end builder name section -->
<!-- start designer name section -->
<tr>
<td>Designer:</td>
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
<!-- end designer name section -->
</tbody>
</table>
</div>
</div>
<!-- Auxiliary Power/Tanks -->
</div>
</div>
</div>
</main>

(4)组合查找

print("组合查找")
list = bs.select("div .boats-dimension")
for item in list:
    print(item)

组合查找
<div class="boats-dimension">
<h1>11 METER</h1>
</div>

(5)通过属性来查找

print("通过属性来查找")
list = bs.select('[href="#primary1"]')
for item in list:
    print(item)

通过属性来查找
<a class="skip-link screen-reader-text" href="#primary1"><!--Skip to content--></a>

(6)通过标签和属性来查找

print("通过标签和属性来查找")
list = bs.select('a[href="#primary1"]')
for item in list:
    print(item)

通过标签和属性来查找
<a class="skip-link screen-reader-text" href="#primary1"><!--Skip to content--></a>

(7)通过子标签来查找

print("通过子标签来查找")
list = bs.select('div > div > div >div >div')
print(list)

通过子标签来查找
[<div class="spec-bord">
<table class="table">
<tbody class="table-light">
<!-- start hull section -->
<tr>
<td>Hull Type:</td>
<td>Fin w/bulb &amp; spade rudder</td>
</tr>
<!-- end hull section -->
<!-- start rig section -->
<tr>
<td>Rigging Type:</td>
<td>Fractional Sloop</td>
</tr>
<!-- end rig section -->
<!-- start loa section -->
<tr>
<td>LOA:</td>
<td>33.80 ft / 10.30 m</td>
</tr>
<!-- end loa section -->
<!-- start lod section -->
<!-- end lod section -->
<!-- start lwl section -->
<tr>
<td>LWL:</td>
<td>26.90 ft / 8.20 m</td>
</tr>
<!-- end lwl section -->
<!-- start sa section -->
<tr>
<td>S.A. (reported):</td>
<td>450.00 ft² / 41.81 m²</td>
</tr>
<!-- end sa section -->
<!-- start beam section -->
<tr>
<td>Beam:</td>
<td>8.20 ft / 2.50 m</td>
</tr>
<!-- end beam section -->
<!-- start beam wl section -->
<!-- end beam wl section -->
<!-- start disp section -->
<tr>
<td>Displacement:</td>
<td>3,527.00 lb / 1,600 kg</td>
</tr>
<!-- end disp section -->
<!-- start ballast section -->
<tr>
<td>Ballast:</td>
<td>1,598.00 lb / 725 kg</td>
</tr>
<!-- end ballast section -->
<!-- start max draft section -->
<tr>
<td>Max Draft:</td>
<td>5.90 ft / 1.80 m</td>
</tr>
<!-- end max draft section -->
<!-- start min draft section -->
<!-- end min draft section -->
<!-- start hull construction section -->
<tr>
<td>Construction:</td>
<td>FG w/core,Composite</td>
</tr>
<!-- end hull construction section  -->
<!-- start ballast type section -->
<tr>
<td>Ballast Type:</td>
<td>Lead</td>
</tr>
<!-- end ballast type section  -->
<!-- start Bridgedeck Clearance section -->
<!-- end Bridgedeck Clearance section -->
<!-- start first built section -->
<tr>
<td>First Built:</td>
<td>1990</td>
</tr>
<!-- end first built section -->
<!-- start last built section -->
<!-- end last built section -->
<!-- start number built section -->
<tr>
<td># Built:</td>
<td>350</td>
</tr>
<!-- end number built section -->
<!-- start builder name section -->
<!-- end builder name section -->
<!-- start designer name section -->
<tr>
<td>Designer:</td>
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
<!-- end designer name section -->
</tbody>
</table>
</div>]

(8)通过兄弟节点标签查找

print("通过兄弟节点标签查找")
list = bs.select('.spec ~ .boats-dimension')
list = bs.select('.boats-dimension ~ .spec')  # 与顺序还有关系
print(list)

通过兄弟节点标签查找
[<div class="spec">
<div class="spec-bord">
<table class="table">
<tbody class="table-light">
<!-- start hull section -->
<tr>
<td>Hull Type:</td>
<td>Fin w/bulb &amp; spade rudder</td>
</tr>
<!-- end hull section -->
<!-- start rig section -->
<tr>
<td>Rigging Type:</td>
<td>Fractional Sloop</td>
</tr>
<!-- end rig section -->
<!-- start loa section -->
<tr>
<td>LOA:</td>
<td>33.80 ft / 10.30 m</td>
</tr>
<!-- end loa section -->
<!-- start lod section -->
<!-- end lod section -->
<!-- start lwl section -->
<tr>
<td>LWL:</td>
<td>26.90 ft / 8.20 m</td>
</tr>
<!-- end lwl section -->
<!-- start sa section -->
<tr>
<td>S.A. (reported):</td>
<td>450.00 ft² / 41.81 m²</td>
</tr>
<!-- end sa section -->
<!-- start beam section -->
<tr>
<td>Beam:</td>
<td>8.20 ft / 2.50 m</td>
</tr>
<!-- end beam section -->
<!-- start beam wl section -->
<!-- end beam wl section -->
<!-- start disp section -->
<tr>
<td>Displacement:</td>
<td>3,527.00 lb / 1,600 kg</td>
</tr>
<!-- end disp section -->
<!-- start ballast section -->
<tr>
<td>Ballast:</td>
<td>1,598.00 lb / 725 kg</td>
</tr>
<!-- end ballast section -->
<!-- start max draft section -->
<tr>
<td>Max Draft:</td>
<td>5.90 ft / 1.80 m</td>
</tr>
<!-- end max draft section -->
<!-- start min draft section -->
<!-- end min draft section -->
<!-- start hull construction section -->
<tr>
<td>Construction:</td>
<td>FG w/core,Composite</td>
</tr>
<!-- end hull construction section  -->
<!-- start ballast type section -->
<tr>
<td>Ballast Type:</td>
<td>Lead</td>
</tr>
<!-- end ballast type section  -->
<!-- start Bridgedeck Clearance section -->
<!-- end Bridgedeck Clearance section -->
<!-- start first built section -->
<tr>
<td>First Built:</td>
<td>1990</td>
</tr>
<!-- end first built section -->
<!-- start last built section -->
<!-- end last built section -->
<!-- start number built section -->
<tr>
<td># Built:</td>
<td>350</td>
</tr>
<!-- end number built section -->
<!-- start builder name section -->
<!-- end builder name section -->
<!-- start designer name section -->
<tr>
<td>Designer:</td>
<td>Ron Holland &amp; Rolf Gyhlenius</td>
</tr>
<!-- end designer name section -->
</tbody>
</table>
</div>
</div>]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值