BeautifulSoup4解析器

最新推荐文章于 2024-08-06 23:23:32 发布

张行之

最新推荐文章于 2024-08-06 23:23:32 发布

阅读量9.7k

点赞数 6

分类专栏： Python 文章标签： Python 爬虫 Beautiful Soup

本文链接：https://blog.csdn.net/qq_33689414/article/details/78585304

版权

Python 专栏收录该内容

37 篇文章 6 订阅

订阅专栏

BeautifulSoup4解析器

BeautifulSoup4是一个HTML/XML的解析器，主要的功能是解析和提取HTML/XML的数据。和lxml库一样。

lxml只会局部遍历，而BeautifulSoup4是基于HTML DOM的，会加载整个文档，解析整个DOM树，因此内存开销比较大，性能比较低。

BeautifulSoup4用来解析HTML比较简单，API使用非常人性化，支持CSS选择器，是Python标准库中的HTML解析器，也支持lxml解析器。

BeautifulSoup4的安装

pip install beautifulsoup4

BeautifulSoup4的使用

以下面一段html文档为例子，如：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1. 使用BeautifulSoup类解析这段代码，获取一个BeautifulSoup的对象，然后按照标准格式输出。

from bs4 import BeautifulSoup


soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

输出结果：

2. 几个简单的浏览结构化数据的方法

# 获取title标签
print(soup.title)
# <title>The Dormouse's story</title>


# 获取title标签名称
print(soup.title.name)
# title


# 获取title标签的内容
print(soup.title.string)
# The Dormouse's story


# 获取title的父标签
print(soup.title.parent)
# <head><title>The Dormouse's story</title></head>


# 获取title的父标签名称
print(soup.title.parent.name)
# head


# 获取p标签
print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>


# 获取p标签class属性
print(soup.p['class'])
#  ['title']    #返回的是list


# 获取所有的a标签
print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


# 获取id='link3'的标签
print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


# 获取所有的a标签的链接
for link in soup.find_all('a'):
    print(link.get('href'))

#   http://example.com/elsie
#   http://example.com/lacie
#   http://example.com/tillie


# 获取文档中所有文字内容
print(soup.get_text())

# The Dormouse's story
#
# The Dormouse's story
# Once upon a time there were three little sisters; and their names were
#    Elsie,
#    Lacie and
#    Tillie;
#    and they lived at the bottom of a well.
# ...

3. BeautifulSoup的解析器

3.1 Python标准库

使用方法: BeautifulSoup(html_doc,"html.parser")

优势：Python内置，执行速度适中，文档容错能力强

劣势：Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差

3.2 lxml解析器(推荐使用)

使用方法：BeautifulSoup(html_doc,'lxml')

优势：速度快，文档容错能力强（C编写），推荐使用

3.3 html5lib

使用方法：BeautifulSoup(html_doc,"html5lib")

优势：最好的容错性，已浏览器的方式解析文档，生成Html5格式的文档

劣势：速度慢，不依赖外部扩展

4. 解析一个html文件

from bs4 import BeautifulSoup

# 通过open返回一个文件对象，采用lxml解析，获取BeautifulSoup对象
soup = BeauifulSoup(open('index.html'),"lxml")

5. 对象的种类

BeautifulSoup将复杂HTML文档转换成一个复杂的属性结构，每个节点都是对象，所有对象分为4种类型：Tag，NavigabString，BeautifulSoup，Comment。

5.1 Tag：对象与XML或HTML原生文档中的tag相同

print(soup.title)
# <title>The Dormouse's story</title>

print(type(soup.title))
# <class 'bs4.element.Tag'>

Tag有2个重要的属性：name , attrs

name：tag的标签名称

print(soup.title.name)
# title

attrs：tag的属性

print(soup.p.attrs)
# {'class': ['title']}

5.2 NatigabString：标签的文本内容

print(soup.p.string)
# The Dormouse's story

print(type(soup.p.string))
# <class 'bs4.element.NavigableString'>

5.3 BeautifulSoup：表示一个文档内容，大部分时候，我们可以把它当做一个特殊的`Tag`。

print(soup.name)
# [document]

print(type(soup))
# <class 'bs4.BeautifulSoup'>

5.4 Comment：是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

makeup='<p><!--Hello--></p>'
soup = BeatuifulSoup(makeup,'lxml')

print(soup.p.string)    
# Hello

print(type(soup.p.string))
# <class 'bs4.element.Comment'>

6. 遍历文档树

6.1 子节点：`.contents`，`.children`属性

tag的.contents属性可以将tag的子节点以列表的方式输出：

print(soup.p.contents)
# [<b>The Dormouse's story</b>]     # 因为只有一个节点

print(soup.p.contents[0])
# [<b>The Dormouse's story</b>] 我们也可以获取列表的第一个标签。如果没有，会报错

tag的.children返回一个生成器，可以对tag的子节点进行循环。

print(type(soup.p.children))
# <class 'list_iterator'>

for child in soup.p.children:
    print(child)    # <b>The Dormouse's story</b>

6.2 所有子孙节点`.descendants`属性

.descendants属性可以对所有的tag子孙节点进行递归循环，和.childern类似。

for tag in soup.body.descendants:
    print(tag)

# 输出结果：
# <b>The Dormouse's story</b>
# The Dormouse's story

7. 搜索文档树

7.1 find_all(name, attrs , recursive, text,limit, **kwargs)

find_all()参数：

name：查找名字为name的tag。（可以传入string，正则，列表）

attrs：tag的属性

recursive：是否递归，默认True

text：tag标签文本

limit：限制条数

7.1.1 name传入string

print(soup.find_all('p', attrs = {'class': 'title'}))
# [<p class="title"><b>The Dormouse's story</b></p>]

print(soup.find_all('p', text='...'))
# [<p class="story">...</p>]

print(soup.find_all('a', limit=2))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

7.1.2 name传入re正则表达式

for tag in soup.find_all(re.compile('^b')):
    print(tag.name)

# body
# b

7.1.3 name传入列表

for tag in soup.find_all(['body','b']):
    print(tag.name)

# body
# b

7.2 按CSS选择器搜索

7.2.1 通过标签名查找

print(soup.select('title'))
[<title>The Dormouse's story</title>]

7.2.2 通过类名查找

print(soup.select('.sister'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

7.2.3 通过id名查找

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

7.2.4 组合查找

print(soup.select('#link1,title'))
# [<title>The Dormouse's story</title>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

7.2.5 属性查找

print(soup.select('a[class="sister"]'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

7.2.6 获取内容

for tag in soup.select('a'):
    print(tag.get_text())

# Elsie
# Lacie
# Tillie

具体参考BeautifulSoup4官方文档

http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

张行之

关注

6
点赞
踩
24

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

BeautifulSoup4解析器

BeautifulSoup4解析器

BeautifulSoup4的安装

BeautifulSoup4的使用

1. 使用BeautifulSoup类解析这段代码，获取一个BeautifulSoup的对象，然后按照标准格式输出。

2. 几个简单的浏览结构化数据的方法

3. BeautifulSoup的解析器

3.1 Python标准库

3.2 lxml解析器(推荐使用)

3.3 html5lib

4. 解析一个html文件

5. 对象的种类

5.1 Tag：对象与XML或HTML原生文档中的tag相同

5.2 NatigabString：标签的文本内容

5.3 BeautifulSoup：表示一个文档内容，大部分时候，我们可以把它当做一个特殊的Tag。

5.4 Comment：是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

6. 遍历文档树

6.1 子节点：.contents，.children属性

6.2 所有子孙节点.descendants属性

7. 搜索文档树

7.1 find_all(name, attrs , recursive, text,limit, **kwargs)

7.2 按CSS选择器搜索

5.3 BeautifulSoup：表示一个文档内容，大部分时候，我们可以把它当做一个特殊的`Tag`。

6.1 子节点：`.contents`，`.children`属性

6.2 所有子孙节点`.descendants`属性