一个简单爬虫案例，用正则采集小说网站

最新推荐文章于 2024-05-14 01:43:06 发布

Python 学习者

最新推荐文章于 2024-05-14 01:43:06 发布

阅读量1.4k

点赞数 2

分类专栏：爬虫文章标签：爬虫 python

本文链接：https://blog.csdn.net/sinat_38682860/article/details/131071601

版权

爬虫专栏收录该内容

18 篇文章 9 订阅

订阅专栏

使用Python抓取xx阁小说需要用到 requests 库和正则表达式模块 re，下面是一个具体的实现步骤：

1. 首先需要使用 `requests` 库请求小说的页面，例如：

import requests

url = 'https://www.biquge.com.cn/book/123456/'
response = requests.get(url)
response.encoding = 'utf-8'

在请求后需注意设置编码，否则可能会出现乱码。

2. 获取小说的标题，可以使用正则表达式模块中的 `re.findall()` 方法，例如：

import re

title_pattern = re.compile(r'<meta property="og:title" content="(.*?)"/>')
title = title_pattern.findall(response.text)[0]

此处需要用到正则表达式中的捕获组，用来匹配页面上的标题信息。

3. 获取小说的章节列表，也可以使用正则表达式模块中的 `re.findall()` 方法，例如：

chapter_pattern = re.compile(r'<dd><a href="(.*?)">(.*?)</a></dd>')
chapter_list = chapter_pattern.findall(response.text)

此处的正则表达式用来匹配页面上的章节链接和章节标题信息。

4. 获取每个章节的内容，需要遍历章节列表，并使用同样的方式请求每一个章节的页面并提取相应的内容，例如：

content_pattern = re.compile(r'<div id="content">(.*?)</div>', re.S)
for chapter in chapter_list:
    chapter_url = url + chapter[0]
    chapter_title = chapter[1]
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'utf-8'
    chapter_content = content_pattern.findall(chapter_response.text)[0]
    # 过滤掉内容中的一些无用标签和空格
    chapter_content = chapter_content.replace('&nbsp;', ' ')
    chapter_content = chapter_content.replace('<br/>', '\n')
    chapter_content = chapter_content.replace('<br />', '\n')
    chapter_content = chapter_content.replace('<p>', '')
    chapter_content = chapter_content.replace('</p>', '')
    with open(title + '.txt', 'a', encoding='utf-8') as f:
        f.write(chapter_title + '\n\n' + chapter_content + '\n\n')