如何使用BeautifulSoup库来解析HTML和XML文档

python资深爱好者

于 2024-07-16 20:12:55 发布

阅读量171

点赞数 4

分类专栏： python 文章标签： beautifulsoup html xml

本文链接：https://blog.csdn.net/2402_84885073/article/details/140475667

版权

45 篇文章 0 订阅

订阅专栏

BeautifulSoup是一个Python库，用于从HTML或XML文件中提取数据。它创建了一个解析树，用于遍历HTML或XML文档，从中提取数据。以下是使用BeautifulSoup库来解析HTML和XML文档的基本步骤：

首先，你需要安装BeautifulSoup和lxml（或其他解析器，如html.parser, html5lib）。lxml是一个高效的HTML和XML解析库，它常与BeautifulSoup一起使用。

bash复制代码

pip install beautifulsoup4 lxml

在你的Python脚本中，你需要导入BeautifulSoup类以及用于加载HTML或XML文档的库（如requests用于从网络获取数据，open函数用于从文件加载数据）。

python复制代码

	`from bs4 import BeautifulSoup`

	`# 如果从文件加载`
	`with open('example.html', 'r', encoding='utf-8') as file:`
	`html_doc = file.read()`

	`# 如果从网络加载`
	`# import requests`
	`# response = requests.get('http://example.com')`
	`# html_doc = response.text`

使用上面获取的HTML或XML文档创建一个BeautifulSoup对象。你需要指定一个解析器，如lxml、html.parser或html5lib。

python复制代码

soup = BeautifulSoup(html_doc, 'lxml')

现在，你可以使用BeautifulSoup对象来解析HTML/XML文档了。你可以通过标签名、类名、ID、属性等来搜索元素。

python复制代码

tags = soup.find_all('a') # 查找所有的<a>标签

python复制代码

tags = soup.find_all(class_='classname') # 注意class_作为参数名，因为class是Python的关键字

python复制代码

tag = soup.find(id='unique-id') # 查找ID为'unique-id'的标签

python复制代码

tags = soup.find_all(attrs={"data-custom": "value"}) # 查找具有特定属性的标签

你可以通过.string或.text属性来访问标签内的文本内容，或者使用.get()方法访问标签的属性。

python复制代码

	`for tag in tags:`
	`print(tag.text) # 打印标签内的文本`
	`print(tag.get('href')) # 打印<a>标签的href属性值`

BeautifulSoup对象是一个树形结构，你可以使用.children、.descendants、.parent、.next_sibling、.previous_sibling等方法来遍历文档。

虽然BeautifulSoup主要用于解析和提取数据，但你也可以用它来修改HTML/XML文档。你可以添加、修改或删除标签及其内容。

python复制代码

	`from bs4 import BeautifulSoup`

	`html_doc = """`
	`<html><head><title>The Dormouse's story</title></head>`
	`<body>`
	`<p class="title"><b>The Dormouse's story</b></p>`
	`<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>`
	`<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>`
	`<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>`
	`</body>`
	`</html>`
	`"""`

	`soup = BeautifulSoup(html_doc, 'lxml')`

	`# 查找所有<a>标签并打印其href属性`
	`for link in soup.find_all('a'):`
	`print(link.get('href'))`