beautifulsoup4教程（三）遍历和搜索文档树

最新推荐文章于 2024-07-29 16:54:00 发布

tyson Lee

最新推荐文章于 2024-07-29 16:54:00 发布

阅读量7.1k

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/chinaltx/article/details/86748763

版权

爬虫专栏收录该内容

6 篇文章 2 订阅

订阅专栏

beautifulsoup4教程（一）基础知识和第一个爬虫

 beautifulsoup4教程（二）bs4中四大对象

 beautifulsoup4教程（三）遍历和搜索文档树

 beautifulsoup4教程（四）css选择器

四、遍历文档树

4.1 直接子节点

.contents

tag 对象的.contents属性可以将某个tag的子节点以列表的方式输出,当然列表会允许用索引的方式来获取列表中的元素。

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

print soup.body.contents
print soup.body.contents[1]

result:
[u'\n', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, u'\n', <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>, u'\n', <p class="story">...</p>, u'\n']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

.children

Tag对象的children属性是一个迭代器

print soup.head.children
for child in soup.body.children:
    print child
    
<listiterator object at 0x00000000039C3080>

result:
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

4.2 所有子孙结点

.descendants属性

与Tag对象的children和contents仅包含Tag对象的直接子节点不同，该属性是将Tag对象的所有子孙结点进行递归循环，然后生成生成器

print soup.head.descendants
for child in soup.body.descendants:
    print child

result:
<generator object descendants at 0x0000000003970E58>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

4.3 结点内容

Tag对象内没有标签的情况

print soup.title
print soup.title.string

result:
<title>The Dormouse's story</title>
The Dormouse's story

Tag对象内有一个标签的情况

print soup.head
print soup.head.string

result:
<head><title>The Dormouse's story</title></head>
The Dormouse's story

Tag对象内有多个标签的情况

仍然使用string是不可行的

print soup.body
print soup.body.string

result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
None

应该使用.strings属性或.stripped_strings，他们获得的都是一个生成器。

print soup.strings
for string in soup.strings:
    print string

result:
<generator object _all_strings at 0x0000000003170E58>
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

使用Tag对象的.stripped_strings属性获得去掉空白行的标签内的众多内容。

print soup.stripped_strings
for string in soup.stripped_strings:
    print string
    
result:
<generator object stripped_strings at 0x00000000030D0E58>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

4.4 直接父节点

标签的父节点

p = soup.p
print p.parent.name

result:
body

内容的父节点：是包在内容外的第一层标签

content = soup.head.title.string
print content
print content.parent.name

result:
The Dormouse's story
title

4.5 全部父节点

.parents属性，得到的也是一个生成器

content = soup.head.title.string
print content
for parent in content.parents:
    print parent.name
    
result:
The Dormouse's story
title
head
html
[document]

4.6 兄弟结点

.next_sibling和.previous_sibling属性分别是获取下一个兄弟结点和获取上一个兄弟结点。

通常情况下，使用这两个属性会得到空白或者换行。因为beautifulsoup会将空白和换行识别成一个结点

print soup.p.next_sibling
print soup.a.previous_sibling
print soup.p.next_sibling.next_sibling
result:


Once upon a time there were three little sisters; and their names were

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

4.7 全部兄弟结点

.next_siblings和.previous_siblings可以对当前的兄弟结点迭代输出

for next in soup.a.next_siblings:
    print next

result:
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.

4.8 前后元素

.next_element和.previous_element属性，是获得不分层次的前后元素（同一层的才叫兄弟结点）

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

4.9 所有前后元素

.next_elements和.previous_elements属性可以向前或向后解析文档内容

soup = BeautifulSoup(html,features="lxml")

for element in soup.a.next_elements:
    print(repr(element))
    
result:
u' Elsie '
u',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
u'Lacie'
u' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
<p class="story">...</p>
u'...'
u'\n'
u'\n'
u'\n'

五、搜索文档树

5.1 find_all

使用方法：find_all(name,attrs,recursive,text,**kwargs)
搜索范围：当前tag的所有tag子节点。
作用：判断当前tag的所有tag子节点是否符合过滤器的条件。
name参数：查找所有名字为name的tag，字符串会被自动忽略掉。

传入字符串

print soup.find_all('a')

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传入正则表达式

import re
for tag in soup.find_all(re.compile("^b")):
    print tag
    
result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<b>The Dormouse's story</b>

传入列表

print soup .find_all(["a","b"])

result:
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传入True：找到所有的Tag

for tag in soup.find_all(True):
    print tag.name
    
result:
html
head
title
body
p
b
p
a
a
a
p

传入方法：自行构造过滤器，方法的参数是tag对象，返回值是Ture | False。

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print soup.find_all(has_class_but_no_id)

result:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>, <p class="story">...</p>]

keyword参数

通过name参数是搜索tag的标签类型名称，如a、head、title。
如果要通过标签内属性的值来搜索，要通过键值对的形式来指定。例如:soup_findall(id='link2')。

import re
print soup.find_all(id='link2')
print soup.find_all(href=re.compile("elsie"))

result:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

如果指定的key是python的内置参数，后面需要加下划线，例如class_=“sister”

print soup.find_all(class_="sister")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

html5的data-*属性是无法用来直接指定的，可以通过attr参数自定义参数字典：soup.find_all(attrs={"data-foo":"value"})

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>',features="lxml")
print data_soup.find_all(attrs={"data-foo":"value"})

result:
[<div data-foo="value">foo!</div>]

text参数

作用和name参数类似，但是text参数的搜索范围是文档中的字符串内容（不包含注释），并且是完全匹配，当然也接受正咋表达式、列表、True。

import re
print soup.a
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.find_all(text="story")
print soup.find_all(text="The Dormouse's story")
print soup.find_all(text=re.compile("story"))

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
[]
[u'Lacie', u'Tillie']
[]
[u"The Dormouse's story", u"The Dormouse's story"]
[u"The Dormouse's story", u"The Dormouse's story"]

limit参数
可以通过limit参数来限制使用name参数或者attr参数过滤出来的条目的数量。

print soup.find_all("a")
print "==============="
print soup.find_all("a",limit=2)

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
===============
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recuresive参数
通常情况下使用find_all方法时，会返回

print soup.body
print "==============================="
print soup.body.find_all("a",recursive=False)
print soup.body.find_all("a")

result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
===============================
[]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

在这个例子中，a标签都是在p标签内的，所以在body的直接子节点下搜索a标签是无法匹配到a标签的。

5.2 find

使用方法：find(name,attrs,recursive,text,**kwargs)
与find_all的区别：find_all将所有匹配的条目组合成一个列表，而find仅返回第一个匹配的条目。
除此之外，用法相同

print soup.body.find_all("a")
print soup.body.find("a")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

5.3 find_parents find_parent

find_all()和find()搜索的范围是当前节点的所有子孙节点（recursive默认的情况下）。
而find_parents和find_parent的搜索范围则是当前节点的父节点。
两个函数的特性和其他用法与上面所述相同。

print soup.body.find_all("a")
print soup.body.find("a")
print soup.body.find_parents("a")
print soup.body.find_parents("html")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
[]
[<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title" name="dromouse"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body>\n</html>]