前言:
整个项目要求会使用scrapy框架,edge_tts库,mysql数据库,
整体思路:由于scrapy框架的异步性,不能在代码中直接把每一页小说文字存进txt,于是引入mysql数据库。给每一页内容编一个页码,等到爬取完成,另外编写代码按页码顺序从mysql中读取内容存入字符串中,将整本小说传入edge_tts中生成MP3文件。
目录
一:scrapy框架爬取小说存入mysql
以笔趣阁小说为例:
首先创建工程:
scrapy startproject jin_dan_shi_heng_xing
cd jin_dan_shi_heng_xing
scrapy genspider bigee bigee.cc
在settings中进行一些基础配置:
1.手动添加LOG_LEVEL设置日志级别为WARNING
2.不遵守机器人协议
3.设置最大并行请求数为100
4.启用管道
LOG_LEVEL="WARNING" #除非warning否则不打印
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
ITEM_PIPELINES = {
"jin_dan_shi_heng_xing.pipelines.JinDanShiHengXingPipeline": 300,
}
将目录页URL存入start_urls:
class BigeeSpider(scrapy.Spider):
name = "bigee"
allowed_domains = ["bigee.cc"]
start_urls = ["https://www.bigee.cc/book/146267/"]
重写start_requests:
def start_requests(self):
for i in range(2,442):
url='https://www.bigee.cc/book/146267/'+str(i)+'.html'
yield scrapy.Request(url=url,callback=self.parse,cb_kwargs={'index':i})
编写response处理函数parse:
def parse(self, response,index):
text_list = response.css('#chaptercontent::text').extract()
print(text_list)
str=''
for text in text_list[:-2]:
str=str+text
yield {
'index':index,
'text':str
}
编写管道处理函数,将数据及其编号存入mysql数据库:
def process_item(self, item, spider):
db = pymysql.connect(user='root', password='123456', host='localhost', port=3306, db='scrapy_novel')
cursor = db.cursor()
sql = "insert into jin_dan(myindex,text) values(%s,%s)"
cursor.execute(sql, (item['index'], item['text']))
db.commit()
db.close()
print('执行')
启动爬虫,启用-s参数记录爬取状态:
scrapy crawl bigee -s JOBDIR=record/spider-1
爬取结果如下,作者求了一次月票,单独占了一章,故造成总共437章但是数据库多出一项
二:从mysql中按顺序读取数据拼接成文章生成有声书
废话少说,直接上代码
#!/usr/bin/env python3
"""
Basic example of edge_tts usage.
"""
import pymysql
import asyncio
import edge_tts
TEXT = ''
VOICE = "zh-CN-YunxiNeural"
OUTPUT_DIR = "d:/"
i = 0
async def amain(text, output_file) -> None:
"""Main function"""
communicate = edge_tts.Communicate(text, VOICE)
await communicate.save(output_file)
async def fetch_and_process(cursor):
"""Fetch data from cursor and process each row"""
all_d = cursor.fetchall()
for (text,) in all_d:
print(text[:100])
global i
i += 1
output_file = f"{OUTPUT_DIR}{i}金丹是恒星,你管这叫修仙?.mp3"
await amain(text, output_file)
def main():
db = pymysql.connect(user='root', password='123456', host='localhost', port=3306, db='scrapy_novel')
try:
sql = 'select text from jin_dan order by myindex asc'
cursor = db.cursor()
cursor.execute(sql)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(fetch_and_process(cursor))
finally:
loop.close()
except Exception as e:
print(f"Error: {e}")
finally:
db.close()
if __name__ == "__main__":
main()
运行效果图如下
1.控制台
2.D盘目录
三:BUG:
1.在开发过程中发现控制台莫名其妙回车了两次如下图,出现两行空白:
还以为是哪里多出来了奇怪的符号搞了好久,最后才发现这是控制台自动排版的,可能是想让程序员舒服一点,打印两次文章后发现下面打印第二遍文章相同位置没有多余空行了。这破控制台瞎搞,害我卡了好久