Python 3爬虫网易云（三）—— BeautifulSoup库用法下篇

最新推荐文章于 2020-10-29 21:22:24 发布

时光机丶

最新推荐文章于 2020-10-29 21:22:24 发布

阅读量641

点赞数

分类专栏： python爬虫文章标签： html 爬虫 python

本文链接：https://blog.csdn.net/qq_39293290/article/details/77992244

版权

python爬虫专栏收录该内容

10 篇文章 3 订阅

订阅专栏

上一篇演示了使用BeautifulSoup解析网页的HTML数据。今天演示如何使用BeautifulSoup模块来遍历HTML数据并提取我们想要的数据。

BeautifulSoup遍历方法

1>节点和标签名
可以使用子节点、父节点、及标签名的方式遍历：

print(soup.title) #查找title标签
print(soup.a) #查找第一个a标签

#对标签的直接子节点进行循环
title_tag = soup.li
for child in title_tag.children:
    print(child)

soup.parent #父节点

# 所有父节点
link = soup.a
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

运行结果

<title>网易云音乐</title>
<a hidefocus="true" href="/#">网易云音乐</a>


<span><a data-module="discover" hidefocus="true" href="/#"><em>发现音乐</em><sub class="cor"> </sub></a></span>


h1
div
div
div
body
html
[document]

其次兄弟节点操作为

# 兄弟节点
sibling_soup.b.next_sibling #后面的兄弟节点
sibling_soup.c.previous_sibling #前面的兄弟节点

#所有兄弟节点
for sibling in soup.a.next_siblings:
    print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

搜索文档树

最常用的当然是find()和find_all()啦，当然还有其他的。比如find_parent() 和 find_parents()、 find_next_sibling() 和 find_next_siblings() 、find_all_next() 和 find_next()、find_all_previous() 和 find_previous() 等等。
我们就看几个常用的，其余的如果用到就去看官方文档哦。

BeautifulSoup官方文档
find_all()
搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。返回值类型是bs4.element.ResultSet。
完整的语法

find_all( name , attrs , recursive , string , **kwargs )

网易云HTML代码过于复杂，这里用官方文档中的例子

soup.find_all("title")
# [<title>The Dormouse's story</title>]
#
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
# 
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

name 参数：可以查找所有名字为 name 的tag。
attr 参数：就是tag里的属性。
string 参数：搜索文档中字符串的内容。
recursive 参数：调用tag的 find_all() 方法时，Beautiful Soup会检索当前tag的所有子孙节点。如果只想搜索tag的直接子节点，可以使用参数 recursive=False 。

find()
与find_all()类似，只不过只返回找到的第一个值。返回值类型是bs4.element.Tag。
完整语法：

find( name , attrs , recursive , string , **kwargs )

比如

soup.find('title')
# <title>The Dormouse's story</title>
#
soup.find("head").find("title")
# <title>The Dormouse's story</title>

这就是BeautifulSoup库的基本用法啦。

时光机丶

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录