BeautifulSoup库中的find与find_all方法

最新推荐文章于 2024-04-30 15:07:49 发布

Coaa.

最新推荐文章于 2024-04-30 15:07:49 发布

阅读量3.5k

点赞数 2

文章标签： python

原文链接：https://www.cnblogs.com/keye/p/7868059.html

版权

在分析一些复杂的HTML页面时候，灵活使用这两种方法十分重要，总结下这两种方法的使用

主要用用户标签组或者单个标签的查找：

**find_all方法：

**(找到所有匹配结果出现的地方故如果多次匹配返回的是匹配结果列表)
.find_all(name,attrs,recursive,text,limit,**kwargs)
① tag.find_all(…)
②soup.find_all(…)

<1>name：可以传一个标签的名称或多个标签名称组成的Python字典做这个tag参数
<2>属性参数attributes：可以传一个用python字典封装起来的某个标签的若干属性
及其对应的属性值做属性参数
stock_info=stockinfo.find_all(" ",attrs={‘class’:‘bets-name’})
<3>递归参数recursive：一个布尔变量。如果recursive设置为True，findAll就会根据我们的要求去查找标签参数的所有子标签，以及子标签的子标签。如果recursive设置为False，findAll就会只查找文档的一级标签。findAll默认支持递归查找（recursive默认值是True）。一般情况下，这个参数不需要设置。
<4>文本参数text：用标签的文本内容去匹配，而不是用标签的属性去匹配。
<5>…用得较少

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
、
soup = BeautifulSoup(html_doc)

#输出soup对象中所有标签名为"title"的标签
print(soup.findAll("title"))


#输出soup对象中**所有**标签名为"title"和"a"的标签
print(soup.findAll({"title","a"}))

#输出soup对象中**所有**属性为"class"属性值为“sister”的标签
print(soup.findAll("",attrs={"class" : "sister"}))


#输出soup对象中**所有**属性为"id"属性值为“link1”的标签
print(soup.findAll("",attrs={"id":"link1"}))

#输出soup对象中**所有**属性为“class”属性值为“story”或“title”或“sister”的标签
print(soup.findAll("",attrs={"class":{"story","title","sister"}}))

#输出soup对象中包含“The Dormouse's story”内容的标签数量（通过文本参数text）
print(len(soup.findAll("",text = "The Dormouse's story")))

find方法

(找到第一个匹配结果出现的地方)
首先下面的HTML代码在这里插入图片描述
以上代码是一个生态金字塔的简单展示，为了找到第一生产者，第一消费者或第二消费者，可以使用Beautiful Soup。
找到第一生产者的名字在第一个url标签里面,可以用find()方法找到第一生产者

from bs4 import BeautifulSoup
with open('ecologicalpyramid.html', 'r') as ecological_pyramid:　　　　# ecological 生态系统  pyramid 金字塔
　　soup = BeautifulSoup(ecological_pyramid)
producer_entries = soup.find('ul')
//返回第一个ul标签内容 T返回的是ag对象
print(producer_entries.li.div.string)
//li标签中的div标签中的string形式 即plants

find(name, attrs, recursive, text, **wargs)

//name参数
from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
　　soup = BeautifulSoup(ecological_pyramid, 'html')
producer_entries = soup.find('ul')
print(type(producer_entries)) 
//输出结果： <class 'bs4.element.Tag'>

//text参数
from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
　　soup = BeautifulSoup(ecological_pyramid, 'html')
producer_string = soup.find(text = 'plants')
print(plants_string)
//输出：plants

//同样，可以在传递text参数时传递一个字符串列表，那么find_all()会找到挨个在列表中定义过的字符串。
all_texts_in_list = soup.find_all(text=['plants', 'algae'])
print(all_texts_in_list)
//输出 [u'plants', u'alage']

//attrs参数
from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
    soup = BeautifulSoup(eccological_pyramid, 'html')
primary_consumer = soup.find(" ",attrs={'id':'primaryconsumers'})
print(primary_consumer.li.div.string)
//输出deer

all_tertiaryconsumers = soup.find_all(" ",attrs={"class_ ": 'tertiaryconsumerslist')       
for tertiaryconsumer in all_tertiaryconsumers:
print(tertiaryconsumer.div.string) 
//输出结果：lion
         tiger

正则表达式查找

import re
from bs4 import BeautifulSoup

email_id_example = """<br/>
<div>The below HTML has the information that has email ids.</div> 
abc@example.com 
<div>xyz@example.com</div> 
<span>foo@example.com</span> 
"""

soup = BeautifulSoup(email_id_example)
emailid_regexp = re.compile("\w+@\w+\.\w+")　　　　# regexp 表达式对象
first_email_id = soup.find(text=emailid_regexp)　　
print(first_email_id)
//输出abc@example.com

email_ids = soup.find_all(text=emailid_regexp)
print(email_ids)
//输出[u'abc@example.com',u'xyz@example.com',u'foo@example.com']

！！参考博客！！
https://www.cnblogs.com/keye/p/7868059.html
https://blog.csdn.net/bear_n/article/details/52067523

Coaa.

关注

2
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup库中的find与find_all方法

在分析一些复杂的HTML页面时候，灵活使用这两种方法十分重要，总结下这两种方法的使用主要用用户标签组或者单个标签的查找：find_all方法：(找到所有匹配结果出现的地方).find_all(name,attrs,recursive,text,limit,**kwargs)① tag.find_all(…)②soup.find_all(…)<1>name：可以传一个标签的名...
复制链接

扫一扫