046-Python 爬虫 - BeautifulSoup

原创于 2025-02-02 00:00:00 发布 · 2.6k 阅读

36 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #beautifulsoup

Python基础及AI开发专栏收录该内容

151 篇文章

订阅专栏

BeautifulSoup 是一个功能强大的 Python 库，用于从 HTML 和 XML 文件中提取数据。它提供了简单易用的 API，可以快速解析网页内容并提取所需的数据。BeautifulSoup 通常用于静态网页的爬取，与 requests 或其他 HTTP 库结合使用，完成数据抓取。

1. 安装 BeautifulSoup

在开始使用 BeautifulSoup 之前，需要安装以下两个库：

BeautifulSoup4：用于解析 HTML。
lxml 或 html.parser：解析器，推荐使用 lxml。

pip install beautifulsoup4
pip install lxml

2. 快速入门

以下是使用 BeautifulSoup 提取网页内容的基本流程：

使用 HTTP 库（如 requests）获取网页 HTML。
使用 BeautifulSoup 解析 HTML。
使用 BeautifulSoup 提供的方法查找和提取内容。

示例：提取网页标题

import requests
from bs4 import BeautifulSoup

# 获取网页内容
url = "https://www.python.org"
response = requests.get(url)

# 解析 HTML
soup = BeautifulSoup(response.text, "lxml")

# 提取标题
title = soup.title.string
print("网页标题:", title)

3. BeautifulSoup 的基础功能

BeautifulSoup 提供了丰富的 API，用于操作 HTML 文档。

3.1 创建 BeautifulSoup 对象

BeautifulSoup 支持多种解析器：

html.parser：Python 内置，速度较慢，兼容性好。
lxml：推荐，速度快，需要安装。
html5lib：速度慢，但最为宽容，能处理不规范的 HTML。

示例：解析 HTML

from bs4 import BeautifulSoup

html = "<html><head><title>Python 爬虫</title></head><body><h1>Hello, World!</h1></body></html>"
soup = BeautifulSoup(html, "lxml")

print(soup.prettify())  # 格式化输出 HTML

3.2 常见的查找方法

BeautifulSoup 提供了多种方法查找 HTML 元素。

方法	描述
`soup.tag_name`	获取第一个匹配的标签，如 `soup.title` 获取 `<title>` 标签。
`soup.find(name, attrs)`	查找第一个匹配的元素。
`soup.find_all(name, attrs)`	查找所有匹配的元素，返回列表。
`soup.select(css_selector)`	使用 CSS 选择器查找元素，返回列表。

示例：查找 HTML 元素

html = """
<html>
    <head><title>Python 爬虫</title></head>
    <body>
        <h1>标题</h1>
        <p class="content">段落 1</p>
        <p class="content">段落 2</p>
        <a href="https://www.python.org" id="link">Python 官网</a>
    </body>
</html>
"""
soup = BeautifulSoup(html, "lxml")

# 获取标题
print(soup.title.string)  # 输出: Python 爬虫

# 查找第一个段落
print(soup.find("p").string)  # 输出: 段落 1

# 查找所有段落
paragraphs = soup.find_all("p")
for p in paragraphs:
    print(p.string)  # 输出: 段落 1, 段落 2

# 使用 CSS 选择器查找链接
link = soup.select_one("#link")
print(link["href"])  # 输出: https://www.python.org

3.3 提取属性和文本

tag.string：获取标签的文本内容。
tag.attrs：获取标签的属性字典。
tag['attr_name']：获取指定属性。

示例：提取属性和文本

# 提取文本内容
print(soup.h1.string)  # 输出: 标题

# 提取属性
link = soup.find("a")
print(link["href"])  # 输出: https://www.python.org

# 获取所有属性
print(link.attrs)  # 输出: {'href': 'https://www.python.org', 'id': 'link'}

3.4 遍历 HTML 节点

BeautifulSoup 提供了多种方式遍历 HTML 节点。

属性	描述
`.contents`	子节点列表。
`.children`	子节点的迭代器。
`.parent`	父节点。
`.find_next_sibling`	下一个兄弟节点。
`.find_previous_sibling`	上一个兄弟节点。

示例：遍历 HTML 节点

# 遍历子节点
body = soup.body
for child in body.children:
    print(child)

# 获取父节点
print(soup.h1.parent.name)  # 输出: body

# 获取兄弟节点
print(soup.p.find_next_sibling("p").string)  # 输出: 段落 2

4. 实战：爬取网页数据

以下是一些常见的爬取任务和实现方式。

4.1 爬取新闻标题和链接

示例：爬取 CSDN 首页的文章标题和链接

import requests
from bs4 import BeautifulSoup

url = "https://www.csdn.net/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# 查找文章标题和链接
articles = soup.find_all("a", class_="title")
for article in articles:
    title = article.string.strip()
    link = article["href"]
    print(f"标题: {title}, 链接: {link}")

4.2 爬取表格数据

示例：爬取 HTML 表格内容

html = """
<table>
    <tr><th>姓名</th><th>年龄</th></tr>
    <tr><td>张三</td><td>25</td></tr>
    <tr><td>李四</td><td>30</td></tr>
</table>
"""
soup = BeautifulSoup(html, "lxml")

# 提取表格数据
rows = soup.find_all("tr")
for row in rows:
    cells = row.find_all(["th", "td"])
    print([cell.string for cell in cells])

4.3 处理分页

一些网站的数据分布在多个页面上，可以通过循环抓取所有分页内容。

示例：爬取分页数据

import requests
from bs4 import BeautifulSoup

base_url = "https://example.com/page="
for page in range(1, 6):  # 假设有 5 页
    url = f"{base_url}{page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    # 抓取当前页面的数据
    items = soup.find_all("div", class_="item")
    for item in items:
        print(item.text.strip())

5. 注意事项

避免过快爬取： 遵守爬取频率，避免发送过多请求导致 IP 被封禁。
```
import time
time.sleep(2)  # 每次请求间隔 2 秒
```
处理编码问题： 有些网页可能需要设置正确的编码。
```
response.encoding = response.apparent_encoding
```
动态网页处理： 如果数据通过 JavaScript 动态加载，BeautifulSoup 无法直接爬取，可以结合 Selenium 或抓取接口。
合法性： 爬取前请检查目标网站的 robots.txt 文件，确保爬取行为合法。

6. BeautifulSoup 与 pandas 结合

爬取表格数据后，可以用 pandas 将数据保存为 CSV 或进行数据分析。

示例：保存表格数据为 CSV

import pandas as pd
from bs4 import BeautifulSoup

html = """
<table>
    <tr><th>姓名</th><th>年龄</th></tr>
    <tr><td>张三</td><td>25</td></tr>
    <tr><td>李四</td><td>30</td></tr>
</table>
"""
soup = BeautifulSoup(html, "lxml")

# 提取表格数据
data = []
rows = soup.find_all("tr")
for row in rows[1:]:  # 跳过表头
    cells = row.find_all("td")
    data.append([cell.string for cell in cells])

# 保存为 DataFrame
df = pd.DataFrame(data, columns=["姓名", "年龄"])
df.to_csv("data.csv", index=False)
print("数据已保存到 data.csv")

7. 总结

BeautifulSoup 是一个功能强大、易于使用的 HTML 解析库，非常适合静态网页爬取和数据提取。其主要特点包括：

简单易用的 API，支持多种解析器。
灵活的查找和遍历 HTML 元素的方法。
可以与其他库（如 requests、pandas）结合使用，形成完整的数据爬取和分析流程。

通过熟练掌握 BeautifulSoup，您可以快速构建高效的爬虫程序，轻松应对各种静态网页的爬取任务。

以下会继续扩展 Python 爬虫 - BeautifulSoup 的功能应用，包括对高级网页解析的技巧、处理动态内容、异常处理以及结合其他工具实现更高效的爬取。

8. 高级 BeautifulSoup 使用技巧

8.1 灵活的 CSS 选择器

除了 find() 和 find_all() 方法，select() 方法允许使用 CSS 选择器进行复杂的元素筛选。

示例：使用 CSS 选择器

html = """
<div class="content">
    <ul id="list">
        <li class="item">Python</li>
        <li class="item">Java</li>
        <li class="item">C++</li>
    </ul>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 选择所有列表项
items = soup.select(".content .item")
for item in items:
    print(item.string)  # 输出: Python, Java, C++

8.2 正则表达式匹配

通过 re 模块，可以在 find_all() 等方法中使用正则表达式匹配元素。

示例：匹配特定格式的内容

import re
from bs4 import BeautifulSoup

html = """
<div>
    <p>Python 3.9</p>
    <p>Java 8</p>
    <p>C++ 11</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 匹配包含数字的段落
paragraphs = soup.find_all("p", string=re.compile(r"\d"))
for p in paragraphs:
    print(p.string)  # 输出: Python 3.9, Java 8, C++ 11

8.3 处理嵌套结构

BeautifulSoup 支持解析复杂的嵌套 HTML 结构，可以通过 .find() 或 .select() 逐层查找。

示例：提取嵌套结构中的数据

html = """
<div class="outer">
    <div class="inner">
        <span class="text">Hello, World!</span>
    </div>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 查找嵌套的 span 元素
text = soup.find("div", class_="outer").find("span", class_="text").string
print(text)  # 输出: Hello, World!

8.4 提取评论（Comment）节点

HTML 中的注释节点可以通过 Comment 类提取。

示例：提取注释内容

from bs4 import BeautifulSoup
from bs4 import Comment

html = """
<div>
    <!-- 这是注释 -->
    <p>这是正文</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 提取注释
comment = soup.find(string=lambda text: isinstance(text, Comment))
print(comment)  # 输出: 这是注释

9. BeautifulSoup 处理动态内容

BeautifulSoup 只能解析静态 HTML，对于动态加载的内容（如通过 JavaScript 渲染的部分），有以下解决方案：

9.1 抓取接口数据

许多动态网页通过 API 提供数据，可以尝试抓取接口数据而不是直接解析 HTML。

示例：抓取 JSON 数据

import requests

url = "https://api.github.com/repos/python/cpython"
response = requests.get(url)
data = response.json()

# 提取仓库信息
print("仓库名:", data["name"])
print("Star 数:", data["stargazers_count"])

9.2 使用 Selenium 加载动态内容

对于完全依赖 JavaScript 渲染的内容，可以结合 Selenium 模拟浏览器操作。

示例：使用 Selenium 获取动态网页 HTML

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://example.com")

# 获取动态加载的 HTML
html = driver.page_source
soup = BeautifulSoup(html, "lxml")

# 提取动态内容
print(soup.find("div", id="dynamic-content").text)

driver.quit()

9.3 处理分页爬取

动态网页通常需要通过分页加载数据，可以模拟点击下一页按钮，或者直接抓取分页接口。

示例：模拟分页抓取

base_url = "https://example.com/page={}"
for page in range(1, 4):  # 假设有 3 页
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    # 提取当前页数据
    items = soup.find_all("div", class_="item")
    for item in items:
        print(item.text.strip())

10. 异常处理和爬虫优化

在编写爬虫时，可能会遇到网络错误、HTML 结构变化等问题。以下是常见的异常处理和优化方法：

10.1 异常处理

示例：处理请求异常

import requests

url = "https://example.com"
try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()  # 检查 HTTP 状态码
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")

示例：处理解析异常

from bs4 import BeautifulSoup

html = "<div><p>段落</div>"  # 不完整的 HTML
try:
    soup = BeautifulSoup(html, "lxml")
    print(soup.p.string)
except Exception as e:
    print(f"解析失败: {e}")

10.2 设置请求头

添加请求头可以伪装成普通用户，避免被反爬机制拦截。

示例：设置请求头

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get("https://example.com", headers=headers)

10.3 设置延迟

为了避免触发反爬机制，可以在每次请求之间设置随机延迟。

示例：随机延迟

import time
import random

time.sleep(random.uniform(1, 3))  # 延迟 1~3 秒

10.4 使用代理

通过代理 IP，可以隐藏真实 IP 地址，绕过 IP 限制。

示例：设置代理

proxies = {
    "http": "http://123.123.123.123:8080",
    "https": "https://123.123.123.123:8080"
}
response = requests.get("https://example.com", proxies=proxies)
print(response.text)

11. BeautifulSoup 与数据库结合

爬取的数据可以存储到数据库中，方便后续分析或使用。

11.1 存储到 SQLite

示例：保存爬取数据到 SQLite

import sqlite3
from bs4 import BeautifulSoup

# 创建 SQLite 数据库
conn = sqlite3.connect("data.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS data (title TEXT, link TEXT)")

# 示例 HTML
html = """
<div>
    <a href="https://example.com/1">标题 1</a>
    <a href="https://example.com/2">标题 2</a>
</div>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")

# 插入数据
for link in links:
    title = link.string
    url = link["href"]
    cursor.execute("INSERT INTO data (title, link) VALUES (?, ?)", (title, url))

conn.commit()
conn.close()
print("数据已保存到 SQLite 数据库")

11.2 存储到 CSV 文件

示例：保存爬取数据到 CSV

import csv
from bs4 import BeautifulSoup

# 示例 HTML
html = """
<div>
    <a href="https://example.com/1">标题 1</a>
    <a href="https://example.com/2">标题 2</a>
</div>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")

# 保存到 CSV
with open("data.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["标题", "链接"])  # 写入表头
    for link in links:
        writer.writerow([link.string, link["href"]])
print("数据已保存到 data.csv")