使用Python3采集小说网站

最新推荐文章于 2024-07-10 17:31:12 发布

wwwcomcn123

最新推荐文章于 2024-07-10 17:31:12 发布

阅读量2.2k

点赞数 1

分类专栏： python 采集小说文章标签： python requests 采集小说 python web flask

本文链接：https://blog.csdn.net/zhwwwcomcn123/article/details/86615893

版权

python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

采集小说

1 篇文章 0 订阅

订阅专栏

本文介绍了使用Python3的requests库采集小说网站数据的步骤，包括获取书籍基本信息、章节信息和章节内容，以及如何实现书籍的定时更新。最后，通过集成Flask来发布采集到的小说，构建了一个简易小说网站。

摘要由CSDN通过智能技术生成

最近使用python3 的 requests 的库，看了下官网的基本教材，及其简单、及其强大。

看完了教程自然就要上手练习了，这次练习就以小说网站为目标。小说网站简单，包含文字和图片，量也大。

1. 首先基本请求

r = requests.get('https://xxx', verify=false) # https网站不验证
soup = bs4.BeautifulSoup(res.text, 'lxml') # 解析器
book_name = soup.select('div.book_name')[0].text
book_author = soup.select('div.book_author')[0].text
book_url = soup.select('a.book_url')[0].get('href')
book_img_url = soup.select('a.img')[0].get('src')

这样书籍的基本信息就采集到了，可以保存到数据库或者其他保存方式。

2. 采集书籍的章节

上面的 book_url 就是书籍的详情地址

r = requests.get(book_url, verify=false) # https网站不验证
soup = bs4.BeautifulSoup(res.text, 'lxml') # 解析器
chapter_list = soup.select('div.chapter_list ul li')
mulu = [] # 将所有章节列表存储起来
for chapter in chapter_list:
    chapter_name = chapter.select('a')[0].get('href')
    chapter_url = chapter.select('a')[0].text
    mulu.append((chapter_name, chapter_url)) # 以元组的方式保存（字典肯定也可以）

这样书籍的所有章节信息就才采集好了，接下来就是采集内容了，内容相对简单，纯文字。

3. 采集章节内容

r = requests.get(chapter_url, verify=false) # https网站不验证
soup = bs4.BeautifulSoup(res.text, 'lxml') # 解析器
chapter_content = soup.select('div.content').text # 这是纯文字形式，保留样式需要用content，但样式肯定是自己写好的
with open(r'1.txt', 'w') as f:    # 这里应该替换成自己的 html 模板
    f.write(chapter_content)

这样基本就都采集好了。

要采集的内容就是这几样，下面就是添加控制了。

4. 书籍的定时更新

from apscheduler.schedulers.background import BackgroundScheduler

scheduler = BackgroundScheduler()
scheduler.start()

@scheduler.scheduled_job('interval', seconds=3)
def print_str():
    """ update data """
    print('...')

# scheduler.shutdown() 任务停止

以上时间是秒，正常情况我使用的是小时。

5. 将采集的小说发布出来需要集成 flask

此时更新书籍的定时任务也需要改

class Config(object):
    JOBS = [
        {
            'id': 'update_book',
            'func': update_book,
            'trigger': 'interval',
            'hours': 1
        }
    ]

    SCHEDULER_API_ENABLED = True

if __name__ == '__main__':
    app.config.from_object(Config())
    scheduler = APScheduler()
    scheduler.init_app(app)
    scheduler.start()

    app.run(host='0.0.0.0')

上面使用的 flask ，也可以使用其他 web 容器，这样小说网站基本就搭建完成了。基本正常运行一段时间，问题应该不大。

此时，python requests 使用基本熟悉很多了。