beautifulsoup4教程(三)遍历和搜索文档树

beautifulsoup4教程(一)基础知识和第一个爬虫

beautifulsoup4教程(二)bs4中四大对象

beautifulsoup4教程(三)遍历和搜索文档树

beautifulsoup4教程(四)css选择器


四、遍历文档树

4.1 直接子节点
  1. .contents
  • tag 对象的.contents属性可以将某个tag的子节点以列表的方式输出,当然列表会允许用索引的方式来获取列表中的元素。
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

print soup.body.contents
print soup.body.contents[1]

result:
[u'\n', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, u'\n', <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>, u'\n', <p class="story">...</p>, u'\n']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  1. .children
  • Tag对象的children属性是一个迭代器
print soup.head.children
for child in soup.body.children:
    print child
    
<listiterator object at 0x00000000039C3080>

result:
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>



4.2 所有子孙结点
  1. .descendants属性
  • 与Tag对象的children和contents仅包含Tag对象的直接子节点不同,该属性是将Tag对象的所有子孙结点进行递归循环,然后生成生成器
print soup.head.descendants
for child in soup.body.descendants:
    print child

result:
<generator object descendants at 0x0000000003970E58>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...



4.3 结点内容
  1. Tag对象内没有标签的情况
print soup.title
print soup.title.string

result:
<title>The Dormouse's story</title>
The Dormouse's story
  1. Tag对象内有一个标签的情况
print soup.head
print soup.head.string

result:
<head><title>The Dormouse's story</title></head>
The Dormouse's story
  1. Tag对象内有多个标签的情况
  • 仍然使用string是不可行的
print soup.body
print soup.body.string

result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
None
  • 应该使用.strings属性或.stripped_strings,他们获得的都是一个生成器。
print soup.strings
for string in soup.strings:
    print string

result:
<generator object _all_strings at 0x0000000003170E58>
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...



使用Tag对象的.stripped_strings属性获得去掉空白行的标签内的众多内容。

print soup.stripped_strings
for string in soup.stripped_strings:
    print string
    
result:
<generator object stripped_strings at 0x00000000030D0E58>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

4.4 直接父节点
  • 标签的父节点
p = soup.p
print p.parent.name

result:
body
  • 内容的父节点:是包在内容外的第一层标签
content = soup.head.title.string
print content
print content.parent.name

result:
The Dormouse's story
title
4.5 全部父节点

.parents属性,得到的也是一个生成器

content = soup.head.title.string
print content
for parent in content.parents:
    print parent.name
    
result:
The Dormouse's story
title
head
html
[document]
4.6 兄弟结点

.next_sibling.previous_sibling属性分别是获取下一个兄弟结点和获取上一个兄弟结点。

  • 通常情况下,使用这两个属性会得到空白或者换行。因为beautifulsoup会将空白和换行识别成一个结点
print soup.p.next_sibling
print soup.a.previous_sibling
print soup.p.next_sibling.next_sibling
result:


Once upon a time there were three little sisters; and their names were

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
4.7 全部兄弟结点

.next_siblings.previous_siblings可以对当前的兄弟结点迭代输出

for next in soup.a.next_siblings:
    print next

result:
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.
4.8 前后元素

.next_element.previous_element属性,是获得不分层次的前后元素(同一层的才叫兄弟结点

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
4.9 所有前后元素

.next_elements.previous_elements属性可以向前或向后解析文档内容

soup = BeautifulSoup(html,features="lxml")

for element in soup.a.next_elements:
    print(repr(element))
    
result:
u' Elsie '
u',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
u'Lacie'
u' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
<p class="story">...</p>
u'...'
u'\n'
u'\n'
u'\n'

五、搜索文档树

5.1 find_all
  1. 使用方法:find_all(name,attrs,recursive,text,**kwargs)
  2. 搜索范围:当前tag的所有tag子节点。
  3. 作用:判断当前tag的所有tag子节点是否符合过滤器的条件。
  4. name参数:查找所有名字为name的tag,字符串会被自动忽略掉。
  • 传入字符串
print soup.find_all('a')

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  • 传入正则表达式
import re
for tag in soup.find_all(re.compile("^b")):
    print tag
    
result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<b>The Dormouse's story</b>
  • 传入列表
print soup .find_all(["a","b"])

result:
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  • 传入True:找到所有的Tag
for tag in soup.find_all(True):
    print tag.name
    
result:
html
head
title
body
p
b
p
a
a
a
p
  • 传入方法:自行构造过滤器,方法的参数是tag对象,返回值是Ture | False。
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print soup.find_all(has_class_but_no_id)

result:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>, <p class="story">...</p>]

  1. keyword参数
  • 通过name参数是搜索tag的标签类型名称,如a、head、title。
  • 如果要通过标签内属性的值来搜索,要通过键值对的形式来指定。例如:soup_findall(id='link2')
import re
print soup.find_all(id='link2')
print soup.find_all(href=re.compile("elsie"))

result:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
  • 如果指定的key是python的内置参数,后面需要加下划线,例如class_=“sister”
print soup.find_all(class_="sister")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  • html5的data-*属性是无法用来直接指定的,可以通过attr参数自定义参数字典:soup.find_all(attrs={"data-foo":"value"})
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>',features="lxml")
print data_soup.find_all(attrs={"data-foo":"value"})

result:
[<div data-foo="value">foo!</div>]
  1. text参数
  • 作用和name参数类似,但是text参数的搜索范围是文档中的字符串内容(不包含注释),并且是完全匹配,当然也接受正咋表达式、列表、True。
import re
print soup.a
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.find_all(text="story")
print soup.find_all(text="The Dormouse's story")
print soup.find_all(text=re.compile("story"))

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
[]
[u'Lacie', u'Tillie']
[]
[u"The Dormouse's story", u"The Dormouse's story"]
[u"The Dormouse's story", u"The Dormouse's story"]
  1. limit参数
    可以通过limit参数来限制使用name参数或者attr参数过滤出来的条目的数量。
print soup.find_all("a")
print "==============="
print soup.find_all("a",limit=2)

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
===============
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  1. recuresive参数
    通常情况下使用find_all方法时,会返回
print soup.body
print "==============================="
print soup.body.find_all("a",recursive=False)
print soup.body.find_all("a")

result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
===============================
[]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

在这个例子中,a标签都是在p标签内的,所以在body的直接子节点下搜索a标签是无法匹配到a标签的。

5.2 find
  • 使用方法:find(name,attrs,recursive,text,**kwargs)
  • 与find_all的区别:find_all将所有匹配的条目组合成一个列表,而find仅返回第一个匹配的条目。
  • 除此之外,用法相同
print soup.body.find_all("a")
print soup.body.find("a")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
5.3 find_parents find_parent
  • find_all()和find()搜索的范围是当前节点的所有子孙节点(recursive默认的情况下)。
  • 而find_parents和find_parent的搜索范围则是当前节点的父节点。
  • 两个函数的特性和其他用法与上面所述相同。
print soup.body.find_all("a")
print soup.body.find("a")
print soup.body.find_parents("a")
print soup.body.find_parents("html")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
[]
[<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title" name="dromouse"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body>\n</html>]
5.4 find_next_siblings和find_next_sibling
  • 搜索范围是当前结点后面的兄弟结点。
  • 其他特性和用法与上面的完全相同。
5.5 find_previous_siblings和find_previous_sibling
  • 搜索范围是当前结点前面的兄弟结点。
  • 其他特性和用法与上面的完全相同。
5.6 find_all_next和find_next
  • 搜索范围是当前结点后面的结点或字符串。
  • 其他特性和用法与上面的完全相同
5.6 find_all_previous和find_previous
  • 搜索范围是当前结点前面的结点或字符串。
  • 其他特性和用法与上面的完全相同
  • 2
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值