Python-使用正则表达式爬取斗破苍穹小说文字内容（使用Requests库实现）

最新推荐文章于 2023-04-27 10:59:15 发布

奋发图强向上好青年

最新推荐文章于 2023-04-27 10:59:15 发布

阅读量2k

点赞数

文章标签：爬虫正则表达式 requests

本文链接：https://blog.csdn.net/qq_20594391/article/details/90642998

版权

**Python-爬取斗破苍穹小说文字内容（使用Requests库实现）

**
本次爬取的小说网站为：斗破小说网点击直达网站首页，本人爬取的网站里面的天斗破苍穹，你也可以根据文中提供的代码爬取其他的小说，代码写法类似，这里仅介绍斗破苍穹小说的爬取方法。

1. requests库介绍

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 哦！
开源地址：https://github.com/kennethreitz/requests

中文文档 API： http://docs.python-requests.org/zh_CN/latest/index.html

2. 安装方式

二：安装方式
　1.强烈建议大家使用pip进行安装：pip insrall requests
　
本人使用的是Pycharm安装。
　2.Pycharm安装：file-》default settings-》project interpreter-》搜索requests-》install package-》ok

3. 爬取思路

首先进入你想要爬取的小说目录页，截图如下：
在这里插入图片描述
观察所有章节的链接信息，可以发现只有http://www.doupoxs.com/doupocangqiong/？.html问好哪里变化，截图如下：
右击检查可以看到小说的内容都是写在

…

标签里面的，截图如下：
在这里插入图片描述

代码如下：仅供参考。

import re
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}

f = open('E:/Workspace/Spliders/download_file/doupo.txt', 'a+')


def get_info(url):
    res = requests.get(url, headers=headers)

    if res.status_code == 200:
        # 爬取的标题
        contents1 = re.findall('<h1>(.*?)</h1>', res.content.decode('utf-8'), re.S)
        # 　爬取的内容
        contents2 = re.findall('<p>(.*?)</p>', res.content.decode('utf-8'), re.S)
        # 将标题与内容结合在一起
        contents = contents1 + contents2
        for content in contents:
            print(content)
            print("-----------------------------------这是分隔符----------------------------------------")
            f.write(content + '\n')

    else:
        pass


if __name__ == "__main__":
    urls = ['http://www.doupoxs.com/doupocangqiong/{}.html'.format(number) for number in range(1, 1645)]
    for url in urls:
        get_info(url)

f.close()

此代码知识学习爬虫的一个简单demo，如有问题欢迎进行交流！

参考：
https://www.cnblogs.com/sui776265233/p/9703712.html
https://www.cnblogs.com/mlgjb/p/8012461.html