Python爬虫_第一篇爬虫之路（3）_BeautifulSoup搜索文档树

最新推荐文章于 2022-08-28 12:28:06 发布

SMT深海的鱼

最新推荐文章于 2022-08-28 12:28:06 发布

阅读量317

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/ab19920904/article/details/107320450

版权

爬虫专栏收录该内容

10 篇文章 2 订阅

订阅专栏

3、Beautiful Soup定义了很多搜索方法,这里着重介绍2个: `find()` 和 `find_all()`

3.1 过滤器

贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中。

3.2 find_all() / findAll()

常用通过find_all()方法来查找标签元素：<>.find_all(name, attrs, recursive, string, **kwargs) ，返回一个列表类型，存储查找的结果

• name：对标签名称的检索字符串
• attrs：对标签属性值的检索字符串，可标注属性检索
• recursive：是否对子孙全部检索，默认True
• string：<>…</>中字符串区域的检索字符串

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

查找相应标签内容

# 使用find_all()方法通过标签名称查找a标签,返回的是一个列表类型
print("所有a标签的内容：",soup.findAll('a'))

print("***********************************************************")

 # 把a标签和b标签作为一个列表传递，可以一次找到a标签和b标签
print("所有a标签和b标签的内容：",soup.find_all(['a','b']))

循环遍历所有相应标签内容

for t in soup.findAll('a'):  # for循环遍历所有a标签，并把返回列表中的内容赋给t
    print("t的值是:",t)      # url是标签对象
    print("t的类型是:",type(t))
    print('a标签中的href属性是:',t.get('href'))# 得到a标签中的url标签
    print('a标签的字符串:',t.string)
    print("************************************************")

查找特定信息

# 标注属性检索
print('href属性为http 的a标签元素是:',soup.findAll('a',href="http://example.com/lacie"))
print('*******************************************************************************')
print('id属性为link1的标签元素是：',soup.find_all(id='link1'))  # 查找id属性为link1的标签元素
print('********************************************************************************')
# 指定属性，查找class属性为title的标签元素，注意因为class是python的关键字，所以这里需要在class后面加个下划线'_'
print('class属性title的标签元素是：',soup.find_all(class_='title'))

SMT深海的鱼

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫_第一篇爬虫之路（3）_BeautifulSoup搜索文档树

3、Beautiful Soup定义了很多搜索方法,这里着重介绍2个:find()和find_all()3.1过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中。3.2 find_all()find_all(name,attrs,recursive,string,**kwargs)...
复制链接

扫一扫