4.2.常用的HTML解析方法

最新推荐文章于 2024-02-21 09:11:09 发布

sty3318

最新推荐文章于 2024-02-21 09:11:09 发布

阅读量1.5k

点赞数 15

分类专栏： python学习文章标签： python 学习 html

本文链接：https://blog.csdn.net/sty3318/article/details/136170941

版权

python学习专栏收录该内容

15 篇文章 0 订阅

订阅专栏

4.2.1.常用的HTML解析方法

在Python中，常用的HTML解析方法包括以下几种：

4.2.1.1.Beautiful Soup

Beautiful Soup是一个功能强大且易于使用的HTML解析库，可以快速从HTML文档中提取数据。它支持多种解析器，如Python标准库中的html.parser、lxml和xml等，可以根据需求选择合适的解析器。

Beautiful Soup提供了简洁明了的API，可以通过标签名、属性、CSS选择器等方式来定位和提取HTML中的元素和内容。

不是python自带的库，需要单独安装Beautifulsoup4，安装方法需要自行处理，但要安装Beautifulsoup4这个版本的。

例：

from bs4 import BeautifulSoup
import requests

# 发送HTTP请求获取HTML内容
response = requests.get('http://www.baidu.com')
html_content = response.content

# 使用Beautiful Soup解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 提取标题
title = soup.title.string
print(title)

# 定位某个元素并提取内容
element = soup.find('div', class_='head_wrapper')
if element is not None:
    print(element.text)
else:
    print("Element is None")

response.content若改成response.text的话，有的时候会产生乱码

除了上面的创建beautifulsoup对象的方法外，还有一个方式：

通过文件来创建，假如将字符串保存到index.html文件中，创建方式如下：

soup = BeautifulSoup(open('index.html'))

打印soup对象内容，格式化输出：

soup.prettify()

4.2.1.1.1.下面是 Beautiful Soup 的一些主要特点和使用方法：

解析器选择： Beautiful Soup 可以根据需要选择不同的解析器来解析 HTML 或 XML 文档。默认情况下，Beautiful Soup 使用 Python 标准库中的 html.parser 解析器，但也支持其他解析器，如 lxml 和 html5lib。在创建 BeautifulSoup 对象时，可以通过指定解析器的参数来选择不同的解析器。
对象化文档树： Beautiful Soup 将 HTML 或 XML 文档解析为一个对象化的文档树结构，使用户可以方便地遍历和操作文档中的元素、标签和内容。用户可以像操作 Python 对象一样操作文档树，从而轻松地提取所需的信息。
标签、属性和内容提取： 使用 Beautiful Soup，可以通过标签名、类名、id 等属性来定位并提取文档中的特定元素。通过调用方法如 find()、find_all()、select() 等，用户可以根据指定的条件查找匹配的元素，并提取其标签、属性和文本内容等信息。
CSS 选择器： Beautiful Soup 支持使用 CSS 选择器来定位文档中的元素。通过传递类似 CSS 选择器的字符串给 select() 方法，可以更灵活地定位元素，实现精确的数据提取。
处理异常情况： Beautiful Soup 能够处理一些异常情况，比如不完整的 HTML 文档或解析错误，使用户能够在实际应用中更加稳定地解析网页内容。

总的来说，Beautiful Soup 提供了一个简单而强大的工具，用于解析和提取 HTML 或 XML 文档中的数据。无论是进行网络爬虫开发、数据抓取还是网页内容分析，Beautiful Soup 都是一个非常实用的库。

4.2.1.1.2.beautifulsoup对象种别

在 Beautiful Soup 库中，有四种主要对象类型：

1.Tag（标签）

Tag（标签）是 HTML 或 XML 文档中的一个标签，如 <div>、<a> 等。可以通过直接访问标签名或属性来访问标签，例如 tag.name 或 tag['href']。

例：

from bs4 import BeautifulSoup

html = '<p class="content">This is a paragraph.</p>'
soup = BeautifulSoup(html, 'html.parser')
paragraph = soup.p

print(paragraph.name)  # p
print(paragraph['class'])  # ['content']

2.NavigableString（可导航字符串）

NavigableString（可导航字符串）表示 Beautiful Soup 找到的标签里的文本内容。NavigableString 对象与 Python 字符串类似。

例：

from bs4 import BeautifulSoup

html = '<p>This is a paragraph.</p>'
soup = BeautifulSoup(html, 'html.parser')
text = soup.p.string

print(type(text))  # <class 'bs4.element.NavigableString'>
print(text)

3.BeautifulSoup（美味汤）

BeautifulSoup（美味汤）表示整个文档。在解析文档时，BeautifulSoup 将文档转换成一个复杂的树形结构，每个节点都是一个 BeautifulSoup 对象。BeautifulSoup 对象本身并不是文档的顶层标签，而是包含整个文档的一个高级容器。

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p>Some text here</p>
  </div>
  <div id="second-div">
    <h2>This is the second div</h2>
    <p>Some more text here</p>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 获取整个文档
print(soup)

# 获取文档的 title 标签
title = soup.title
print(title)

# 获取第二个 div 的内容
second_div = soup.find_all('div')[1]
content = second_div.get_text()

print(content)

4.Comment（注释）

Comment（注释）表示文档中的注释内容。例如 HTML 文档中的 。Comment 对象与 NavigableString 类似，但是不能被搜索，因为它们不是真正的标签。

例：

from bs4 import BeautifulSoup, Comment

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p>Some text here</p>
  </div>
  <!-- This is a comment -->
  <div id="second-div">
    <h2>This is the second div</h2>
    <p>Some more text here</p>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 获取文档中的注释内容
comment = soup.find(string=lambda text: isinstance(text, Comment))
print(comment)

4.2.1.1.3. 遍历

beautifulsoup会将html转换成树的结构
例

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p>Some text here</p>
  </div>
  <!-- This is a comment -->
  <div id="second-div">
    <h2>This is the second div</h2>
    <p>Some more text here</p>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
# 1
print(soup.body.div.h1.contents)

# 2
for child in soup.body.children:
    print(child)

# 3
# 对子孙节点进行递归遍历
for child in soup.body.descendants:
    print(child)

# 4
for st in soup.body.div.strings:
    print(repr(st))

# 5
print(soup.body.div.h1.parent)

# 6
el = soup.find_all('div')
print(el[0].h1.next_sibling.next_sibling)

# 7
print(el[0].p.previous_sibling.previous_sibling)

# 8
for e in soup.body.div.next_siblings:
    print(repr(e))

# 9 output:This is the first div
print(soup.body.div.h1.next_element)

# 10
for el in soup.body.div.contents:
    print(repr(el))

例.通过参数，限定检索范围

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p class="content">Some text here</p>
  </div>
  <div id="second-div">
    <h2>This is the second div</h2>
    <p class="content">Some more text here</p>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 使用 kwargs 参数查找 class 为 "content" 的 p 元素，限制只返回一次
paragraphs = soup.find_all('p', class_='content', limit=1)

for paragraph in paragraphs:
    print(paragraph.text)

4.2.1.1.4.select()

下面是关于 Beautiful Soup 的 CSS 选择器的详细讲解以及示例：

基本语法：
- 使用 select() 方法来选择匹配特定 CSS 选择器的元素。
- 在 CSS 选择器中，可以使用标签名、类名、ID 等属性来定位元素。
常见的 CSS 选择器：
- 标签选择器：通过标签名选择元素，例如 div、p、a 等。
- 类选择器：通过类名选择元素，使用点 . 开头，例如 .class-name。
- ID 选择器：通过 ID 名选择元素，使用井号 # 开头，例如 #id-name。
- 属性选择器：通过元素属性选择元素，例如 [attribute=value]。
- 组合选择器：可以组合多个选择器，例如 tag.class、tag#id 等。

例：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p class="content">Some text here</p>
  </div>
  <div class="second-div">
    <h2>This is the second div</h2>
    <p class="content">Some more text here</p>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 选择所有 <div> 元素
divs = soup.select('div')
for div in divs:
    print(div.text)

# 选择具有 class="content" 的 <p> 元素
paragraphs = soup.select('p.content')
for p in paragraphs:
    print(p.text)

# 选择具有 id="first-div" 的 <div> 元素
first_div = soup.select('div#first-div')
print(first_div[0].text)

4.2.1.2.lxml

lxml是一个高性能的XML与HTML解析库，提供了XPath和CSS选择器等强大的定位和提取功能。lxml速度较快，支持XML和HTML的解析和处理，同时也支持XPath查询，适合处理大型文档或复杂的解析任务。

例：

from lxml import etree
import requests

# 发送HTTP请求获取HTML内容
response = requests.get('http://www.baidu.com')
html_content = response.content

print(html_content)

# 使用lxml解析HTML
tree = etree.HTML(html_content)

# 提取标题
title = tree.xpath('//title/text()')[0]
print(title)

# 定位某个元素并提取内容
content = tree.xpath('//a[@href="http://map.baidu.com"]/text()')[0]
if content is not None:
    print(content)
else:
    print("content is None")

例：保持格式输出

from lxml import etree

html_doc = """
<html>
<head>
  <title>Beautiful Soup Demo</title>
</head>
<body>
  <div id="first-div">
    <h1>This is the first div</h1>
    <p>Some text here</p>
  </div>
  <!-- This is a comment -->
  <div id="second-div">
    <h2>This is the second div</h2>
    <p>Some more text here</p>
  </div>
</body>
</html>
"""

html = etree.HTML(html_doc)
result = etree.tostring(html).decode('utf-8')
print(result)

4.2.1.3.html.parser

Python标准库中的html.parser模块提供了简单的HTML解析器，可以用于基本的HTML解析任务。虽然功能不如Beautiful Soup和lxml强大，但对于简单的HTML解析任务来说已经足够。

例：

from html.parser import HTMLParser
import requests

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

# 发送HTTP请求获取HTML内容
response = requests.get('http://www.baidu.com')
html_content = response.text

# 使用HTMLParser解析HTML
parser = MyHTMLParser()
parser.feed(html_content)

这里不能将response.text修改为response.content，修改后会有错误。

sty3318

关注

15
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
4.2.常用的HTML解析方法

lxml速度较快，支持XML和HTML的解析和处理，同时也支持XPath查询，适合处理大型文档或复杂的解析任务。总的来说，Beautiful Soup 提供了一个简单而强大的工具，用于解析和提取 HTML 或 XML 文档中的数据。Beautiful Soup 将 HTML 或 XML 文档解析为一个对象化的文档树结构，使用户可以方便地遍历和操作文档中的元素、标签和内容。Beautiful Soup提供了简洁明了的API，可以通过标签名、属性、CSS选择器等方式来定位和提取HTML中的元素和内容。
复制链接

扫一扫