今天使用的是异步爬取西游记每一个章节的内容。思维逻辑是1.同步爬取西游记每个章节的标题和对应编号2.然后根据每个章节的编号异步爬取章节介绍内容。
#https://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%224306063500%22}
#{title: "第一回 灵根育孕源流出 心性修持大道生", price_status: "0", cid: "11348571"}
#https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"%224306063500","cid":"4306063500|11348571","need_bookinfo":1}
import json
import aiofiles
import requests
import aiohttp
import asyncio
"""
1.同步操作:访问getCatalog拿到所有章节的cid和名称
2.异步操作:访问getChapterContent下载所有的文章内容
"""
async def aiodownload(title,b_id,cid):
try:
data={
"book_id": b_id,
"cid":f"{b_id}|{cid}",
"need_bookinfo": 1
}#我需要把json格式转换成字符串形式
data=json.dumps(data)
url=f"https://dushu.baidu.com/api/pc/getChapterContent?data={data}"
async with aiohttp.ClientSession()as session:
async with session.get(url) as resp:
dic=await resp.json()
async with aiofiles.open(title,mode="a",encoding="utf-8")as f:
await f.write(dic['data']['novel']['content'])#把小说内容写出
except:
aiodownload(title,b_id,cid)
async def getCatalog(url):
try:
resp=requests.get(url)
#print(resp.json())
dic=resp.json()
tasks=[]
for item in dic['data']['novel']['items']:
title=item['title']
cid=item['cid']
#print(cid,title)
#准备异步任务
tasks.append(aiodownload(title,b_id,cid))
await asyncio.wait(tasks)
except:
getCatalog(url)
if __name__ == '__main__':
b_id="4306063500"
url='https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'+b_id+'"}'
loop=asyncio.get_event_loop()
loop.run_until_complete(getCatalog(url))
#asyncio.run(getCatalog(url))
学习内容:
1.首先会异步将内容存储在文件里。import aiofiles
async with aiofiles.open(title,mode="a",encoding="utf-8")as f: await f.write(dic['data']['novel']['content'])#把小说内容写出
2.有json格式转成字符串data=json.dumps(data)
3.出现错误:ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。查了资料说是访问太过于频繁造成的。(怎么改参考文献已给出,但是我还不会,因为他那个是在同步的情况下,异步,我不知道怎么改,后边也许会了再返回来改)
https://blog.csdn.net/illegalname/article/details/77164521
https://blog.csdn.net/qq_40910788/article/details/84844464