BeautifulSoup模块

又逢乱世

已于 2024-07-26 18:58:48 修改

阅读量395

点赞数 7

分类专栏： Python 文章标签： python beautifulsoup

于 2024-07-26 17:15:21 首次发布

本文链接：https://blog.csdn.net/a1053765496/article/details/140717953

版权

Python 专栏收录该内容

36 篇文章 0 订阅

订阅专栏

什么是BeautifulSoup

BeautifulSoup 是一个用于从 HTML 和 XML 文件中提取数据的 Python 库。它提供了丰富的 API，可以让你轻松地导航、搜索和修改解析树。它的设计目的是通过处理各种不规则和不完整的 HTML 或 XML 数据，使其结构化并易于处理。

主要特点

处理杂乱的 HTML：BeautifulSoup 可以处理很多不符合标准的 HTML，修复常见的语法错误。
Pythonic API：提供了符合 Python 编程习惯的接口，易于学习和使用。
支持多种解析器：默认使用 Python 的内置解析器，还可以使用更快的解析器如 lxml 和 html5lib。

基本概念

BeautifulSoup 对象：整个文档的容器，可以通过不同的解析器创建。
Tag 对象：文档中的 HTML 或 XML 标签，具有属性和子节点。
NavigableString 对象：标签中的文本。
Comment 对象：HTML 注释。

解析器选择

BeautifulSoup 支持多种解析器，主要包括：

html.parser：Python 标准库解析器，默认解析器，速度适中，兼容性好。
lxml 解析器：lxml，速度快，支持 XML。
html5lib 解析器：html5lib，生成浏览器解析的 HTML5 解析树，最能容错，但速度较慢。

安装BeautifulSoup

首先，安装 BeautifulSoup 及其依赖的解析器库 lxml 或 html5lib（如果使用 html.parser 解析器，则不需要额外安装）。

pip install beautifulsoup4 lxml

主要属性

soup: BeautifulSoup对象本身，表示整个HTML或XML文档。

soup.tag: 获取第一个匹配的标签，例如 soup.title 获取第一个 <title> 标签。

soup.tag.name: 获取或设置标签的名字，例如 soup.title.name 返回 'title'。

soup.tag.attrs: 获取或设置标签的属性，例如 soup.p.attrs 返回标签的所有属性。

soup.tag.string: 获取标签的文本内容，如果标签只包含一个NavigableString子节点。

soup.tag.contents: 获取标签的直接子节点列表。

soup.tag.parent: 获取标签的父节点。

soup.tag.parents: 获取标签的所有祖先节点。

soup.tag.next_sibling: 获取标签的下一个兄弟节点。

soup.tag.previous_sibling: 获取标签的上一个兄弟节点。

soup.tag.next_element: 获取文档中下一个节点。

soup.tag.previous_element: 获取文档中上一个节点。

主要方法

soup.find(name, attrs, recursive, text, **kwargs): 搜索第一个符合条件的标签。
soup.find_all(name, attrs, recursive, text, limit, **kwargs): 搜索所有符合条件的标签。
soup.select(selector): 使用CSS选择器搜索标签。
soup.get_text(separator, strip): 获取所有文本内容。
soup.decompose(): 从文档中移除当前节点及其子节点。
soup.replace_with(new_tag): 用新的标签替换当前节点。
soup.insert_before(new_tag): 在当前节点之前插入新标签。
soup.insert_after(new_tag): 在当前节点之后插入新标签。
soup.clear(): 清空当前节点的所有子节点。
soup.prettify(): 生成格式化的HTML文档字符串。

使用BeautifulSoup

示例：搜索标签

from bs4 import BeautifulSoup

html_data = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Example HTML Page</title>
</head>
<body>
    <header>
        <h1>Welcome to Example HTML Page</h1>
    </header>
    <nav>
        <ul>
            <li><a href="#home">Home</a></li>
            <li><a href="#about">About</a></li>
            <li><a href="#services">Services</a></li>
            <li><a href="#contact">Contact</a></li>
        </ul>
    </nav>
    <section id="home" class="c home">
        <h2>Home</h2>
        <span>
            <p>This is an example of a simple HTML page. You can use it as a template to create your own pages.</p>
        </span>
    </section>
    <section id="about" class="c_about">
        <h2>About</h2>
        <p>This page demonstrates the use of various HTML elements and CSS for styling.</p>
        <ul>
            <li>HTML for structure</li>
            <li>CSS for styling</li>
            <li>JavaScript for interactivity (not included in this example)</li>
        </ul>
    </section>
    <section id="services">
        <h2>Services</h2>
        <p>We offer a range of services to help you build and maintain your website.</p>
        <table>
            <thead>
                <tr>
                    <th>Service</th>
                    <th>Description</th>
                    <th>Price</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>Web Design</td>
                    <td>Creating beautiful and responsive designs</td>
                    <td>$500</td>
                </tr>
                <tr>
                    <td>Web Development</td>
                    <td>Building functional and dynamic websites</td>
                    <td>$1000</td>
                </tr>
                <tr>
                    <td>SEO</td>
                    <td>Optimizing your site for search engines</td>
                    <td>$300</td>
                </tr>
            </tbody>
        </table>
    </section>
    <section id="contact">
        <h2>Contact</h2>
        <form action="#" method="post">
            <label for="name">Name:</label>
            <input type="text" id="name" name="name" required><br><br>
            <label for="email">Email:</label>
            <input type="email" id="email" name="email" required><br><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50" required></textarea><br><br>
            <input type="submit" value="Submit">
        </form>
    </section>
    <footer>
        <p class="pc">&copy; 2024 Example HTML Page. All rights reserved.</p>
    </footer>
</body>
</html>
"""

# 创建 BeautifulSoup 对象
soup = BeautifulSoup(html_data, 'lxml')

# 获取第一个匹配的标签
print(soup.title)  # <title>Example HTML Page</title>
print(soup.h1)  # <h1>Welcome to Example HTML Page</h1>

# 获取第一个匹配标签的名字
print(soup.title.name)  # title
print(soup.h1.name)  # h1

# 获取第一个匹配标签的文本内容
print(soup.title.string)  # Example HTML Page
print(soup.a.string)  # Home

# 获取第一个匹配标签的属性
print(soup.section.attrs)  # {'id': 'home', 'class': ['c', 'home']}
print(soup.section.attrs["class"])  # ['c', 'home']
print(soup.section.attrs["class"][0])  # c
print(soup.a.attrs['href'])  # #home

# 获取第一个标签的中的所有内容
print(soup.section.contents)

# 获取第一个标签的父标签
print(soup.a.parent)  # <li><a href="#home">Home</a></li>

# 获取第一个标签所有上级标签
print(soup.a.parents)  # <generator object PageElement.parents at 0x000002040A014AC0>
a_parents = soup.a.parents
for parent in a_parents:
    print(parent.name)  # li ul nav body html [document]

# 搜索指定的标签
print(soup.find(id="contact"))
# 搜索指定的标签返回结果可以再进行搜索
tag1 = soup.find(id="contact")
print(tag1.h2.string)

# 搜索指定的标签
print(soup.find("section", {"id": "about"}))

# 搜索所有符合条件的标签。
print(soup.find_all('a'))  # [<a href="#home">Home</a>, <a href="#about">About</a>, <a href="#services">Services</a>, <a href="#contact">Contact</a>]
for link in soup.find_all('a'):
    print(link.get('href'), link.text)  # home Home #about About #services Services #contact Contact

# css选择器的使用
# 标签选择器
print(soup.select("footer"))  # [<footer> <p class="pc">© 2024 Example HTML Page. All rights reserved.</p> </footer>]
# 类选择器
print(soup.select("p.pc"))  # [<p class="pc">© 2024 Example HTML Page. All rights reserved.</p>]
print(soup.select(".c_about"))
# id选择器
print(soup.select("#home"))
# 后代选择器
print(soup.select("section table thead"))


# 修改文档
tag = soup.new_tag("a", href="http://example.com")
tag.string = "Link text"
soup.body.append(tag)  # 把tag追加在文档最后面，html标签内
print(soup)  # <a href="http://example.com">Link text</a></body>

示例：从某个网站提取标题和链接

import requests
from bs4 import BeautifulSoup

url = "https://www.baidu.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

for link in soup.find_all('a'):
    print(link.get('href'), link.text)

示例：从文件中提取标题和链接

from bs4 import BeautifulSoup

# 从文件中读取 HTML 或 XML
with open('baidu.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')

    for link in soup.find_all('a'):
        print(link.get('href'), link.text)