Python BeautifulSoup基础总结

最新推荐文章于 2023-09-21 10:07:20 发布

凯耐

最新推荐文章于 2023-09-21 10:07:20 发布

阅读量933

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_36279318/article/details/79240138

版权

Python 专栏收录该内容

32 篇文章 70 订阅

订阅专栏

（一）BeautifulSoup4简介**

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。Beautiful Soup会帮你节省数小时甚至数天的工作时间。

使用BeautifulSoup

构建一个BeautifulSoup对象需要两个参数，第一个参数解析文本字符串，第二个参数告诉BeautifulSoup使用哪个解析器解析HTML。

解析器负责把HTML解析成相关的对象，而BeautifulSoup负责操作数据（增删查改），“html.parse”是Python内置的解析器，“lxml”则是一个C语言开发的解析器，它的执行速度更快，不过需要额外安装。

BeautifulSoup将HTML抽象成4类主要的数据类型，分别是Tag，NavigableString，BeautifulSoup，Comment。每个标签节点就是一个Tag对象，NavigableString对象一般是包裹在Tag对象中的字符串，BeautifulSoup对象代表整个HTML文档。

NavigableString

获取标签中的内容，直接使用.string即可获取，它是一个NavigableString对象

1.HTML代码内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析html_doc,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

知识点01

方法	说明
BeautifulSoup()	解析HTML文本，并能按照标准的缩进格式的结构输出。
prettify()	对HTML文本编码和解码。
soup.title	获取title标签和标签中的内容。
soup.title.string	只获取title标签中的内容。
soup.title.parent	获取父节点及其内容。
soup.p	获取第一个p标签，如果要得到其内容：soup.p.string。
soup.p[‘class’]	获取第一个p标签的class属性的内容，返回列表数据类型。
find_all( )	返回所有满足条件的标签，如果要处理标签使用遍历方法。
find()	获取某一个满足条件的标签。
get_text()	出去HTML标签获取标签中的内容。
get(‘href’)	获取链接内容

2获取HTML结构化数据的方法:

soup.title
#1.获取html中的title标签
# <title>The Dormouse's story</title>

soup.title.name
#2.获取html中的title标签的名称
# u'title'

soup.title.string
#3.获取html中的title标签的内容，使用string去除标签
# u'The Dormouse's story'

soup.title.parent.name
#4.获取title标签的父节点
# u'head'

soup.p
#5.获取p标签
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
#6.获取第一个p标签中class属性的值
# u'title'

soup.a
#7.获取第一个a标签
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
#8.获取所有的a标签
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
#9.获取某一个满足条件的标签
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3.从文档中找到所有a标签的链接:使用遍历方式

#通过遍历获取所有a标签的链接
for link in soup.find_all('a'):
    #print(link.get_text())
    print(link.get('href'))
    #Elsie
    #Lacie
    #Tillie
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

4.从文档中获取所有文字内容:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

5.获取p标签以及p标签包含的子标签中的内容

#1.获取指定标签下的全部内容
print(soup.find('p',{'class':'story'}).get_text())
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.

注：

 findAll(tag,attrs,recursive,text,limit,**keywords):
 参数：
 tag:参数是一个字符串或者是一个列表。findAll("h1"),finAll({'h'..'hn'})
 attrs:参数是一个字典，配合tag使用。findAll('a',{'id':'link1','link2'}})
 recursive:一般不使用
 text：使用此关键字参数目的是统计某内容的标签数。findAll(text='Tillie')
 limit:限制获取标签的数量。findAll('a',limit=1)等价find('a')
 **keywords:关键字参数指定标签的属性和其属性值，返回的是一个列表。
 findAll(id='link2')等价findAll('',{'id':'link2'})

6.通过正在表达匹配出满足条件的标签

#1.匹配满足标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
    # body
    # b
#2.匹配出满足条件的链接
s=soup.findAll('a',href=re.compile(r'^http://example\.com/'))
print(s)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

知识点02–对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种： Tag 、NavigableString 、 BeautifulSoup 、 Comment 。

1.Tag中最重要的属性: name和attributes

1.获取Tag的类型
soup = BeautifulSoup('<b class="Person">Kaina</b>','html.parser')
tag = soup.b
print(type(tag))
# <class 'bs4.element.Tag'>


2.获取标签的名称
2.soup = BeautifulSoup('<b class="Person">Kaina</b>','html.parser')
tag = soup.b
print(tag.name)
# b


3.修改标签名
soup = BeautifulSoup('<b class="Person">Kaina</b>','html.parser')
tag = soup.b
tag.name = "bb"
print(tag)
# <bb class="Person">Kaina</bb>

2.单值属性Attributes

一个tag可能有很多个属性. tag < b class=“Person”> 有一个 “class” 的属性,值为 “Person” . tag的属性的操作方法与字典相同.

1.获取标签的属性值、修改标签的属性值、删除标签的属性值
soup = BeautifulSoup('<b class="Person">Kaina</b>','html.parser')
tag = soup.b
######################################################################
#01获取标签的属性值的两种方式
print(tag['class'])#返回的数据类型为列表
print(tag.attrs)#返回的数据类型为字典
# ['Person']
# {'class': ['Person']}
#######################################################################
#02修改标签的属性值
tag['class'] = 'Student'
#<b class="Student">Kaina</b>
#######################################################################
#03删除标签的属性及其属性值
soup = BeautifulSoup('<b class="Person">Kaina</b>','html.parser')
tag = soup.b
del tag['class']
print(tag)
# <b>Kaina</b>

3.多值属性

在Beautiful Soup中多值属性的返回类型是list。

1.获取class属性的值返回的是一个列表集合。
soup = BeautifulSoup('<b class="Person Student">Kaina</b>','html.parser')
tag = soup.b
print(tag['class'])
# ['Person', 'Student']

如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回。

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

将tag转换成字符串时,多值属性会合并为一个值。

soup = BeautifulSoup('<a>My name is<b rel="one">Kaina</a>','html.parser')
value=soup.b['rel']
print(value)
#one

#1.修改属性值
soup.b['rel']=['one','two']
print(soup.a)
# <a>My name is<b rel="one two">Kaina</b></a>

知识点03-- 遍历文档数

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性。
注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点。

1.获取标签的及其内容

1.使用BeautifulSoup对象+ 标签名
soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>

2.获取标签下的子标签
soup.body.b
# <b>The Dormouse's story</b>

3.获取所有a标签
data=soup.find_all('a')
print(data)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

01.contents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
head_tag=soup.head
# 1.打印标签
print(head_tag)
#<head><title>The Dormouse's story</title></head>
#2.打印head标签的子节点，以列表的方式输出
print(head_tag.contents)
# [<title>The Dormouse's story</title>]
#3.输出子节点
title_tag=head_tag.contents[0]
print(title_tag)
# <title>The Dormouse's story</title>
#4.子节点的内容
print(title_tag.contents)
["The Dormouse's story"]
########################################################################
#字符串没有 .contents 属性,因为字符串没有子节点:
text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
########################################################################

02.children （处理子标签）

HTML文件

html='''
<table>
<tr>
  <th>方法</th>
  <th>说明</th>
</tr>
<tr>
  <th>姓名</th>
  <th>年龄</th>
</tr>
<tr>
  <td>张三</td>
  <td>18</td>
</tr>
<tr>
  <td>李四</td>
  <td>20</td>
</tr>
</table>
'''

通过tag的 children 生成器,可以对tag的子节点进行循环:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
#1.children是NavigableString对象，用来表示标签中的文字，而不是标签
for child in soup.find("table").children:
    # 2.打印所有子标签
    print(child)

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:

from bs4 import BeautifulSoup
import csv
soup=BeautifulSoup(html,'html.parser')
data=[]
#1.children是NavigableString对象，用来表示标签中的文字，而不是标签
for child in soup.find("table").stripped_strings:
    data.append(child)
print(data)

输出结果：

['姓名', '年龄', '张三', '18', '李四', '20']

03.处理父标签

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了a标签到根节点的所有节点.

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.tr
for parent in tag.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# table
# [document]

通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,标签是标签的父节点:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.tr
#1.打印tr标签
print(tag)
#2.打印tr的父标签
print(tag.parent)

04兄弟节点

01:.next_sibling :查找兄弟节点的下一个标签
02：.previous_sibling：查找兄弟节点的前一个标签
03.next_siblings：对当前节点的兄弟节点迭代输出
04…previous_siblings ：对当前节点的兄弟节点迭代输出

from bs4 import BeautifulSoup
html="<a><b>666</b><c>888</c></a>"
soup=BeautifulSoup(html,'html.parser')
print(soup.b.next_sibling)
print(soup.c.previous_sibling)

# <c>888</c>
# <b>666</b>

参考文献：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#parent

凯耐

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python BeautifulSoup基础总结

（一）BeautifulSoup4简介Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。Beautiful Soup会帮你节省数小时甚至数天的工作时间。使用BeautifulSoup构建一个BeautifulSoup对象需要两个参数，第一个参数解析文本字符串，第二个参数告诉Be...
复制链接

扫一扫

专栏目录