【爬虫学习笔记】BeautifulSoup用法分析（二）

最新推荐文章于 2024-05-29 17:18:03 发布

城市里的元

最新推荐文章于 2024-05-29 17:18:03 发布

阅读量655

点赞数

分类专栏： Python 文章标签： BeautifulSoup python 爬虫

本文链接：https://blog.csdn.net/sc_lilei/article/details/78558532

版权

Python 专栏收录该内容

26 篇文章 2 订阅

订阅专栏

本文章介绍BeautifulSoup的主要函数用法，对于BeautifulSoup的概念介绍请点击下方分析（一）查阅或自行查询，此处不再赘述。为了方便文章的编写，下文将用BS代表BeautifulSoup。

笔记分为以下两篇文章：

BeatifulSoup用法分析（一）

BeatifulSoup用法分析（二）-本文

上一部分主要介绍BS最基本，也是实战中很少用到的函数，这部分我们开始介绍BS的常见函数用法

1、导入模块

from bs4 import BeautifulSoup

2、创建一个html字符串用来模拟网页

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""

3、创建BS对象

hello = BeautifulSoup(html,'html.parser')

这个hello是自定义的变量，这个变量是个BS对象，通过使用BS对象的方法，进行下一步的数据提取。

4、搜索DOM树

DOM树是文档对象模型的意思，简单说就是html树形结构，在这个树形结构中搜索想要的内容。

（1）find_all( name , attrs , recursive , list，text , **kwargs)

这个函数同findAll( )，简单来说就是用于获取当前Tag下满足括号内条件的子tag，

可供填写的过滤参数较多，先从name说起，name是个活用参数，它可以是‘字符串’、‘列表’、‘正则表达式’、‘True’、‘函数’，首先是

A、name参数的用法

name--传字符串，举例：

print hello.find_all('title')

输出：

[<title>The Dormouse's story</title>]

这个应该很容易理解，输出了所有标签名=name的tag。再举个例子：

for a in hello.find_all('a'):
    print a

输出：

说明：若获取到多个tag，返回的结果是列表，所以在这里我用for遍历输出会更好的展示结果

下一个是

name--传正则表达式，举例：

for a in hello.find_all(re.compile('^b')):
    print a

输出：

<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
</body>
The Dormouse's story

再举一个例子：

for a in hello.find_all(re.compile('.*The')):
     print a

输出为空，因为没有标签的名称包含The这个字符，也就是说name参数只能是Tag名称，不能是字符串。

ok，我们继续，下一个是

name--传列表，举例：

for a in hello.find_all(['a','b']):
     print a

输出：

The Dormouse's story
<a class="sister" href="http://example.com/elsie" id="link1"></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

说明：传列表作为name参数，该函数则会获取列表中任意一个或多个能匹配到tag名称的tag，同样返回结果是一个列表。

name--传 True，举例：

for a in hello.find_all(True):
    print a

输出结果是整个html内容，也就是说，True参数可以匹配任意tag，所以会传回整个html数据内容，但要注意，单纯的字符串tag对象不会被返回。

name--传函数，举例：

def has_class_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print hello.find_all(has_class_no_id)

输出：

The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...

说明：其实输出结果中的第2-5行是没有前面的空格的，我特意加上的，为了便于理解。通过看代码可以明白定义的函数的作用是返回有class属性但没有id属性的tag，但是在输出结果中貌似看到了有id属性的a标签结果，是吧。其实函数获取到的是符合条件的父标签p本身，下面的子标签有什么属性跟它无关，但是既然是子标签，所以在获取p的时候，会把p自己和其子标签、后代标签都获取到。

B、keyword参数的用法（关键词）

简单的说，这个参数就是去查找属性值对应的tag，属性有哪些？class，href，id，这些都是tag的属性。

ok，举例说明：

通过class属性查找：

for a in hello.find_all(class_='sister'):
    print a

输出：

说明：这儿的class我写的 'class_'，这是个特殊用法，因为‘class’这个变量被python征用了，所以我们不能再用它。注意这儿的返回结果是列表

通过id属性查找：

for a in hello.find_all(id='link1'):
    print a

输出：

通过href属性查找：

for a in hello.find_all(href='http://example.com/elsie'):
    print a

输出：

特殊属性查找：

注意在html5中会出现data-xxx的属性（或者其他不可直接查询的属性），比如data-status='1',data-online='1234'，这类属性是不支持直接属性查询的，但是可以间接，对，这时候就要用到的attrs参数了，现在，我修改一下html模板：

The Dormouse's story

我给p标签添加一个data-type属性，然后通过属性来查询这个p标签：

for a in hello.find_all(attrs={'data-type':'aaa'}):
    print a

输出：
The Dormouse's story

C、text参数的用法（文本）

前面我们已经通过标签名、属性查询tag了，剩下的也不剩什么了（一个标签里边能查的都用的差不多了），无非就是string了。

ok，举例吧：

for a in hello.find_all(text=['The',' Elsie ']):
    print a
print '---------'
for a in hello.find_all(text=re.compile('Lacie')):
    print a

输出：

Elsie
---------
Lacie

说明：text和name一样是个活用参数，也接受字符串、列表、True、正则。第一部分我们用的是列表，注意了，若是用列表作为text参数，findall方法将会进行严格匹配，即把列表内任意一个值拿去和tag内的string进行整体匹配，必须完全一致才能匹配成功（包括空格）；第二部分是用正则，把正则作为text参数的话，那就是模糊匹配，主要看正则的写法。这里返回的结果是字符串，也算作是一种tag，叫叶子tag，它也有父tag。

D、limit参数的用法（数量限制）

for a in hello.find_all('a',limit=2):
    print a

输出：

<a class="sister" href="http://example.com/elsie" id="link1"></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

这个也比较好理解，不作过多解释。

E、recursive 参数

for a in  hello.head.find_all('title',recursive=False):
    print a
print '----------'
for a in  hello.find_all('title',recursive=False):
    print a

输出：

<title>The Dormouse's story</title>
----------

说明：这个参数默认是 recursive=True，意思是允许搜索后代标签，也就是说搜索范围是当前tag本身以及其后代tag；若=False，那就是将范围限定在直接子标签了，看例子中第一部分，我们在head中的直接子tag中搜索名为'title’的tag，之所以能搜索到，是因为title标签就是head 的子标签；而例中的第二部分是在以整个html树为当前tag的直接子标签范围中搜索名为‘title’的子tag，那肯定不能搜索到，因为title标签是整个html树标签的后代标签，只有html标签才是它的直接子标签。（如不能理解，那就多调试一下，多试几个例子）

（2）find（）函数用法同find_all一致，唯一的区别是它返回的结果类型是tag

print  hello.find('p')

输出：

The Dormouse's story

说明：这个函数它不会遍历DOM树，只会获取到第一个匹配到的结果并返回，或许你会想什么时候会用到这个函数呢？在DOM树中head和body两个标签都只有一个，如果用find_all就不太合适，用find会恰当。

（3）find_parents() find_parent()

print  hello.find('a'),'\n----'
a_tag=hello.find('a')
print a_tag.find_parent('p')

输出：

<a class="sister" href="http://example.com/elsie" id="link1"></a>
----
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

上述代码第一部分是获取第一个a标签，第二部分代码是获取a标签的父亲标签。我们再举个例子解释find_parents()：

print  hello.find(string = 'Lacie'),'\n----'
a_tag=hello.find(string = 'Lacie')
for a in  a_tag.find_parents('a'):
    print a

输出：

Lacie
----
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

说明：其实find_parent（）是查找符合条件的当前tag 的父tag，如果加个s，find_parents（），那么程序会遍历DOM树，在当前tag的所有父辈tag中查找满足条件的tag并返回一个列表。

（4）find_next_siblings() find_next_sibling()

这个就不举例了，find_next_siblings()是在当前tag的所有后面的兄弟tag中查找符合条件的tag并返回一个列表；find_next_sibling()是在当前tag的所有后面的兄弟tag中查找符合条件的第一个tag并返回该tag。

（5）find_previous_siblings() find_previous_sibling()

find_previous_siblings()是在当前tag的所有前面的兄弟tag中查找符合条件的tag并返回一个列表；find_previous_sibling()是在当前tag的所有前面的兄弟tag中查找符合条件的第一个tag并返回该tag。

（6）find_all_next() find_next()

find_all_next()是在当前tag的所有后面的tag中（忽略父子关系）查找符合条件的tag并返回一个列表；是在当前tag的所有后面的tag中（忽略父子关系）查找符合条件的第一个tag并返回该tag。（DOM树中，字符串也算作tag，称叶子tag）

（7）find_all_previous() 和 find_previous()

find_all_previous()是在当前tag的所有前面的tag中（忽略父子关系）查找符合条件的tag并返回一个列表； find_previous()是在当前tag的所有前面的tag中（忽略父子关系）查找符合条件的第一个tag并返回该tag。（DOM树中，字符串也算作tag，称叶子tag）

到这里BS对象的find_xxx的函数方法介绍的差不多了，现在我想说的是还有一种更方便更简单的方法可以在DOM树中获取个性化数据。那就是：

5、CSS选择器

CSS选择器的写法很简单，筛选ｔａｇ的时候写class的时候直接写 .class值，写tag名的时候直接写 tag名，写id的时候直接写 #id值，所用的方法是BS.select(）。

话不多说，上例子，首先是根据

（１）、标签名筛选：

for a in  hello.select('a'):
    print a

输出：
<a class="sister" href="http://example.com/elsie" id="link1"></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

说明：因为返回的是列表，所以用for进行遍历输出更直观。

（２）、ｃｌａｓｓ值筛选：

for a in  hello.select('.sister'):
    print a

输出同上

（３）、ｉｄ值筛选：

for a in  hello.select('#link1'):
    print a

输出：

（４）、组合筛选：

for a in  hello.select('p #link1'):
    print a,'\n----'
for a in  hello.select('body p #link1'):
    print a

输出：

在实际环境中，我们更多的是用到组合筛选来获取我们想要的数据，上述代码中的第一部分是在DOM树中获取标签名为p的所有直接子标签Ｘ，子标签Ｘ必须满足id值等于link1的条件；第二部分是DOM树中获取标签名为body的子标签名为p的所有直接子标签Ｘ，子标签Ｘ必须满足id值等于link1的条件。

说明：或许上面的解释比较书面化，简言之，参数之间的空格就是用来表示父子关系，前面的父，后面的是子，不能是隔代关系。

那如果我们想获取指定Tag下的子tag，怎么写呢？

for a in  hello.select('p > a'):
    print a
print '\n----'
for a in  hello.select('body > p #link1'):
    print a

输出：

上述代码中的第一部分是获取ｐ标签下的所有ａ标签，写法中的大于符号可加可不加，空格必须要有，第二部分是获取body标签的子标签p的子标签中id值等于link1的tag。

过滤条件在细一点，获取指定Tag下满足条件的子tag：

for a in  hello.select('p .sister'):
    print a
print '----'
for a in  hello.select('p a["class"="sister"]'):
    print a

输出：

<a class="sister" href="http://example.com/elsie" id="link1"></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
----
<a class="sister" href="http://example.com/elsie" id="link1"></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

上述代码中两个部分运行的结果一致，只是写法不同；第一部分是获取ｐ标签下满足class条件的子标签，第二部分是获取p标签下满足class条件的a标签。列表就是用来携带属性值的，可以是class，id，href等

（４）、获取tag中的值（属性值或文本）：

print  hello.select('p .sister')[0].get('href')
print  hello.select('p .sister')[0].get('class')
print  hello.select('p .sister')[0].get('id')
print  hello.select('p .sister')[1].get_text()

输出：

http://example.com/elsie
[u'sister']
link1
Lacie

怎样，是不是比前面的find好用太多，而且功能更强，喜欢就点个赞吧！

城市里的元

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【爬虫学习笔记】BeautifulSoup用法分析（二）

本文章介绍BeautifulSoup的主要函数用法，对于BeautifulSoup的概念介绍请点击下方分析（一）查阅或自行查询，此处不再赘述。为了方便文章的编写，下文将用BS代表BeautifulSoup。笔记分为以下两篇文章：BeatifulSoup用法分析（一）BeatifulSoup用法分析（二）-本文上一部分主要介绍BS最基本，也是实战中很少用到的函数，这部
复制链接

扫一扫

专栏目录