有手就会的Python爬虫入门——用Python爬取唐诗三百首-CSDN博客

本文链接：https://blog.csdn.net/codingexpert404/article/details/143171876

Python 爬虫技术是一项非常实用的工具，能够帮助我们自动化获取网页中的数据，特别适合从各种网站上抓取所需信息。本文将带你一步步实现一个简单的Python爬虫，专门爬取《唐诗三百首》的数据。

1. 准备工具

首先，你需要安装Python环境，并且需要安装爬虫常用的两个库：

requests: 用于发送HTTP请求，获取网页内容。
BeautifulSoup: 用于解析HTML文档，提取所需信息。

在命令行中输入以下命令安装这两个库：

pip install requests beautifulsoup4

2. 目标网站分析

本次爬虫的目标是爬取《唐诗三百首》中的唐诗数据。我们选择的目标网站是中国诗词网，该网站收录了唐诗的详细内容。你可以通过查看网页的HTML结构，定位每首诗的标题、内容和作者信息。这个页面包含了唐诗三百首的所有数据，我们可以直接定向爬取此页面的信息并解析。

3. 获取网页内容

我们将使用requests库来发送HTTP请求，获取网页的HTML内容。代码如下：

import requests
from bs4 import BeautifulSoup

# 目标URL
url = "https://gushicionline.com/tags/%e5%94%90%e8%af%97%e4%b8%89%e7%99%be%e9%a6%96"

# 发送HTTP请求获取页面内容
response = requests.get(url)
response.encoding = 'utf-8'  # 确保编码正确

# 检查请求是否成功
if response.status_code == 200:
    html_content = response.text
else:
    print(f"请求失败，状态码：{response.status_code}")

4. 解析网页数据

接下来，我们需要使用BeautifulSoup解析HTML文档，提取每首唐诗的标题、作者和内容。

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 找到所有唐诗的元素，通常是通过HTML标签或类名来确定
poems = soup.find_all('div', class_='poem-item')

# 遍历并提取每首诗的标题、作者、内容
for poem in poems:
    title = poem.find('h2').text  # 获取诗的标题
    author = poem.find('p', class_='author').text  # 获取作者
    content = poem.find('div', class_='content').text  # 获取诗的内容
    print(f"标题: {title}\n作者: {author}\n内容:\n{content}\n")

5. 保存数据

获取到唐诗的数据后，你可以将它们保存到本地文件中，方便后续使用。

with open('tang_poems.txt', 'w', encoding='utf-8') as file:
    for poem in poems:
        title = poem.find('h2').text
        author = poem.find('p', class_='author').text
        content = poem.find('div', class_='content').text
        file.write(f"标题: {title}\n作者: {author}\n内容:\n{content}\n\n")

6. 完整代码

把所有步骤串联起来，最终代码如下：

import requests
from bs4 import BeautifulSoup

url = "https://gushicionline.com/tags/%e5%94%90%e8%af%97%e4%b8%89%e7%99%be%e9%a6%96"
response = requests.get(url)
response.encoding = 'utf-8'

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    poems = soup.find_all('div', class_='poem-item')

    with open('tang_poems.txt', 'w', encoding='utf-8') as file:
        for poem in poems:
            title = poem.find('h2').text
            author = poem.find('p', class_='author').text
            content = poem.find('div', class_='content').text
            file.write(f"标题: {title}\n作者: {author}\n内容:\n{content}\n\n")
else:
    print(f"请求失败，状态码：{response.status_code}")