python beautifulsoup库_python BeautifulSoup库用法总结-CSDN博客

The Dormouse's story

Once upon a time there were three little sisters; and their names were

8 ,

9 Lacie and

10 Tillie;

11 and they lived at the bottom of a well.

...

13 """

15 soup = BeautifulSoup(html)

16 soup = BeautifulSoup(open('index.html')) #使用本地文件创建对象

打印一下 soup 对象的内容，格式化输出

1 print soup.prettify()

指定编码：当html为其他类型编码（非utf-8和ascii），比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析。

htmlCharset = "GB2312"

soup = BeautifulSoup(respHtml, fromEncoding=htmlChars

from bs4 import BeautifulSoup

import bs4

import re

# 待分析字符串

html_doc = """

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie

and

Tillie;

and they lived at the bottom of a well.

...

"""

# 每一段代码中注释部分即为运行结果

# html字符串创建BeautifulSoup对象

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

# 输出第一个 title 标签

print(soup.title)

The Dormouse's story

# 输出第一个 title 标签的标签名称

print(soup.title.name)

# title

# 输出第一个 title 标签的包含内容

print(soup.title.string)

# The Dormouse's story

# 输出第一个 title 标签的父标签的标签名称

print(soup.title.parent.name)

# head

# 输出第一个 p 标签

print(soup.p)

"""

The Dormouse's story

"""

# 输出第一个 p 标签的 class 属性内容

print(soup.p['class'])

# ['title', 'aq']

# 输出第一个 a 标签的 href 属性内容

print(soup.a['href'])

# http://example.com/elsie

'''''

soup的属性可以被添加,删除或修改. 操作方法与字典一样

'''

# 修改第一个 a 标签的href属性为 http://www.baidu.com/

# soup.a['href'] = 'http://www.baidu.com/'

# 给第一个 a 标签添加 name 属性

# soup.a['name'] = u'百度'

# 删除第一个 a 标签的 class 属性为

# del soup.a['class']

##输出第一个 p 标签的所有子节点

print(soup.p.contents)

"""

['\n',

The Dormouse's story

, '\n']

"""

# 输出第一个 a 标签

print(soup.a)

# Elsie

# 输出所有的 a 标签，以列表形式显示

print(soup.find_all('a'))

"""

[Elsie,

Lacie,

Tillie]

"""

# 输出第一个 id 属性等于 link3 的 a 标签

print(soup.find(id="link3"))

# Tillie

# 获取所有文字内容

print(soup.get_text())

"""

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie

and

Tillie;

and they lived at the bottom of a well.

...

"""

# 输出第一个 a 标签的所有属性信息

print(soup.a.attrs)

# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

for link in soup.find_all('a'):

# 获取 link 的 href 属性内容

print(link.get('href'))

"""

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie

"""

# 对soup.p的子节点进行循环输出

for child in soup.p.children:

print("对soup.p的子节点进行循环输出", child)

"""

对soup.p的子节点进行循环输出

The Dormouse's story

对soup.p的子节点进行循环输出

"""

# 正则匹配，名字中带有b的标签

for tag in soup.find_all(re.compile(r"b")):

print(tag.name)

"""

body

"""

4. 四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag

NavigableString

BeautifulSoup

Comment

（1）Tag

Tag 是什么？通俗点讲就是 HTML 中的一个个标签，例如

The Dormouse's story

Elsie

上面的< title>< a> 等、标签加上里面包括的内容就是 Tag，利用 soup加标签名轻松地获取这些标签的内容，是不是感觉比正则表达式方便多了？不过有一点是，它查找的是在所有内容中的第一个符合要求的标签。soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all函数。find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.。

我们可以验证一下这些对象的类型

1 print type(soup.a)

2 #

对于 Tag，它有两个重要的属性，是 name 和 attrs

name

1 print soup.name

2 print soup.head.name

3 #[document]

4 #head

soup 对象本身比较特殊，它的 name 即为 [document]，对于其他内部标签，输出的值便为标签本身的名称。

attrs

1 print soup.p.attrs