scrapy框架爬取小说并用edge_tts制作有声书

Captain_Thomas_L

于 2024-06-13 23:53:25 发布

阅读量319

点赞数 6

文章标签： scrapy

本文链接：https://blog.csdn.net/captain_thomas_l/article/details/139652316

版权

前言：

整个项目要求会使用scrapy框架，edge_tts库，mysql数据库，

整体思路：由于scrapy框架的异步性，不能在代码中直接把每一页小说文字存进txt，于是引入mysql数据库。给每一页内容编一个页码，等到爬取完成，另外编写代码按页码顺序从mysql中读取内容存入字符串中，将整本小说传入edge_tts中生成MP3文件。

前言：

一：scrapy框架爬取小说存入mysql

二：从mysql中按顺序读取数据拼接成文章生成有声书

三：BUG：

一：scrapy框架爬取小说存入mysql

以笔趣阁小说为例：

首先创建工程：

scrapy startproject jin_dan_shi_heng_xing
cd jin_dan_shi_heng_xing
scrapy genspider bigee bigee.cc

在settings中进行一些基础配置：

1.手动添加LOG_LEVEL设置日志级别为WARNING

2.不遵守机器人协议

3.设置最大并行请求数为100

4.启用管道

LOG_LEVEL="WARNING" #除非warning否则不打印
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

ITEM_PIPELINES = {
   "jin_dan_shi_heng_xing.pipelines.JinDanShiHengXingPipeline": 300,
}

将目录页URL存入start_urls：

class BigeeSpider(scrapy.Spider):
    name = "bigee"
    allowed_domains = ["bigee.cc"]
    start_urls = ["https://www.bigee.cc/book/146267/"]

重写start_requests：

    def start_requests(self):
        for i in range(2,442):
            url='https://www.bigee.cc/book/146267/'+str(i)+'.html'
            yield scrapy.Request(url=url,callback=self.parse,cb_kwargs={'index':i})

编写response处理函数parse：

    def parse(self, response,index):
        text_list = response.css('#chaptercontent::text').extract()
        print(text_list)
        str=''
        for text in text_list[:-2]:
            str=str+text
        yield {
            'index':index,
            'text':str
        }

编写管道处理函数，将数据及其编号存入mysql数据库：

    def process_item(self, item, spider):
        db = pymysql.connect(user='root', password='123456', host='localhost', port=3306, db='scrapy_novel')
        cursor = db.cursor()
        sql = "insert into jin_dan(myindex,text) values(%s,%s)"
        cursor.execute(sql, (item['index'], item['text']))
        db.commit()
        db.close()
        print('执行')

启动爬虫，启用-s参数记录爬取状态：

scrapy crawl bigee -s JOBDIR=record/spider-1

爬取结果如下，作者求了一次月票，单独占了一章，故造成总共437章但是数据库多出一项

二：从mysql中按顺序读取数据拼接成文章生成有声书

废话少说，直接上代码

#!/usr/bin/env python3

"""
Basic example of edge_tts usage.
"""
import pymysql
import asyncio
import edge_tts

TEXT = ''
VOICE = "zh-CN-YunxiNeural"
OUTPUT_DIR = "d:/"
i = 0

async def amain(text, output_file) -> None:
    """Main function"""
    communicate = edge_tts.Communicate(text, VOICE)
    await communicate.save(output_file)

async def fetch_and_process(cursor):
    """Fetch data from cursor and process each row"""
    all_d = cursor.fetchall()
    for (text,) in all_d:
        print(text[:100])
        global i
        i += 1
        output_file = f"{OUTPUT_DIR}{i}金丹是恒星，你管这叫修仙？.mp3"
        await amain(text, output_file)

def main():
    db = pymysql.connect(user='root', password='123456', host='localhost', port=3306, db='scrapy_novel')
    try:
        sql = 'select text from jin_dan order by myindex asc'
        cursor = db.cursor()
        cursor.execute(sql)
        loop = asyncio.get_event_loop()
        try:
            loop.run_until_complete(fetch_and_process(cursor))
        finally:
            loop.close()
    except Exception as e:
        print(f"Error: {e}")
    finally:
        db.close()

if __name__ == "__main__":
    main()

运行效果图如下

1.控制台