本文使用的是BeautifulSoup 3,现在已经有BeautifulSoup4了,名字改为bs4
(1)下载与安装
1
2
|
# BeautifulSoup 的下载与安装
pip install BeautifulSoup
|
另外也可以下载安装包进行安装
(2)快速开始
1
2
3
4
|
# BeautifulSoup 快速开始
soup
=
BeautifulSoup(html_doc)
print
soup.title
|
结果:
1
2
|
# BeautifulSoup 结果
<title>前门大街_百度百科<
/
title>
|
(3)BeautifulSoup对象介绍
BeautifulSoup中主要包含三种类型的对象:
- BeautifulSoup.BeautifulSoup
- BeautifulSoup.Tag
- BeautifulSoup.NavigableString
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# BeautifulSoup 示例
from
BeautifulSoup
import
BeautifulSoup
import
urllib2
soup
=
BeautifulSoup(html_doc)
print
type
(soup)
print
type
(soup.title)
print
type
(soup.title.string)
print
soup.title
print
soup.title.string
|
结果为
1
2
3
4
5
6
7
8
|
# BeautifulSoup 示例结果
<
class
'BeautifulSoup.BeautifulSoup'
>
<
class
'BeautifulSoup.Tag'
>
<
class
'BeautifulSoup.NavigableString'
>
<title>百度一下,你就知道<
/
title>
百度一下,你就知道
print
soup.title
print
soup.title.string
|
从上面的例子可以比较清晰的看到BeautifulSoup主要包括三种类型的对象。
- BeautifulSoup.BeautifulSoup //BeautifulSoup对象
- BeautifulSoup.Tag //标签对象
- BeautifulSoup.NavigableString //导航string文本对象
(4)BeautifulSoup剖析树
1. BeautifulSoup.Tag对象方法
获取标记对象,通过点号获取Tag对象
1
2
3
4
5
6
7
8
9
10
|
# BeautifulSoup 示例
title
=
soup.title
print
type
(title.contents)
print
title.contents
print
title.contents[
0
]
# BeautifulSoup 示例结果
<
type
'list'
>
[u
'\u767e\u5ea6\u4e00\u4e0b\uff0c\u4f60\u5c31\u77e5\u9053'
]
百度一下,你就知道
|
contents方法
获得当前标签的内容list,如果该标签没有子标签,那么string方法和contents[0]得到的内容是一样的。见上面示例
next,parent方法
获得当前的标签的子标签和父标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# BeautifulSoup 示例
html
=
soup.html
print
html.
next
print
''
print
html.
next
.
next
print
html.
next
.
next
.nextSibling
# BeautifulSoup 示例结果
<head><meta http
-
equiv
=
"content-type"
content
=
"text/html;charset=utf-8"
/
><meta http
-
equiv
=
"X-UA-Compatible"
content
=
"IE=Edge"
/
><meta content
=
"always"
name
=
"referrer"
/
><meta name
=
"theme-color"
content
=
"#2932e1"
/
><link rel
=
"shortcut icon"
href
=
"/favicon.ico"
type
=
"image/x-icon"
/
><link rel
=
"icon"
sizes
=
"any"
mask
=
"mask"
href
=
"//www.baidu.com/img/baidu.svg"
/
><link rel
=
"dns-prefetch"
href
=
"//s1.bdstatic.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t1.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t2.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t3.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t10.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t11.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//t12.baidu.com"
/
><link rel
=
"dns-prefetch"
href
=
"//b1.bdstatic.com"
/
><title>百度一下,你就知道<
/
title>
......
<
/
head>
<meta http
-
equiv
=
"content-type"
content
=
"text/html;charset=utf-8"
/
>
<meta http
-
equiv
=
"X-UA-Compatible"
content
=
"IE=Edge"
/
>
|
nextSibling,previousSibling
获得当前标签的下一个兄弟标签和前一个兄弟标签