Python爬虫-搜集金庸小说

最新推荐文章于 2024-05-23 21:34:38 发布

Robert19831218

最新推荐文章于 2024-05-23 21:34:38 发布

阅读量628

点赞数 16

文章标签： python 开发语言

本文链接：https://blog.csdn.net/Robert19831218/article/details/138699807

版权

~~~美好的一天从一个成功的爬虫开始~~~

最近刚开始接触chatgpt，需要批量文本来训练gpt-2的模型，首先考虑爬取金庸小说金庸小说_金庸武侠小说全集_金庸作品集 - 金庸武侠网，以及作文类等文本数据备用。

【飞雪连天射白鹿，笑书神侠倚碧鸳】加上越女传一共15本，重温下少年时代的经典名著。

1. 打开网站分析一下，都放在"pop-books clearfix"类下边的a标签里，

打印出来之后发现每个h2标签正常一个href，一个text作为文本名，上边多了一个href，文本为空，经过确认，text是一个换行符，需要加上href.text != '\n'

base_url = 'https://www.jinyongwuxia.com'
resp = requests.get(base_url,headers=headers)
resp.encoding = resp.apparent_encoding
soup = BeautifulSoup(resp.text,'html.parser')
jinyong = soup.find('div',class_="pop-books clearfix")
hrefs = jinyong.find_all('a')
for href in hrefs:
if href.text != '\n':
book_url = base_url + href['href']
book_title = href.text

2. 这样进入到每本书里以【飞狐外传】为例，章节链接放在<div class="xsbox clearfix">下边：

进一步获取每一章的链接和名称：

book = requests.get(url=book_url, headers=headers)
book.encoding = 'utf-8'
book_soup = BeautifulSoup(book.text, 'html.parser')
book_text = book_soup.find('div',class_="xsbox clearfix")
chapters = book_text.find_all('a')
print(f"downloading{href.text},pls wait...")

3.进入每一章的链接，发现内容都在m-post这个类的div下边

下边为通过链接获取小说内容，并给人友好的提示，以便了解爬取进度：

text_soup = requests.get(chap_url,headers=headers)
text_soup.encoding = text_soup.apparent_encoding
text_soup = BeautifulSoup(text_soup.text, 'html.parser')
novel = text_soup.find('div',class_="m-post")
novel_text = novel.find_all('p')
print(f"downloading{chap_url},pls wait...")

最后循环每章novel_text里的内容，追加写入到具体每个文件里：

for content in novel_text:
# print(f"will wrtie to {book_title}.txt")
with open(f"D:/个人/xxx/金庸小说/{book_title}.txt", 'a', encoding='utf-8') as f:
f.write(content.text)
f.close()

这样看起来很美好，下载时出现了问题，每一章里对应不是一个页面，而是可能十来个，例如页面代码：

因为结构上出现了预期的不一样，偷懒加上了异常处理并对没下载的继续了下载：

if os.path.exists(f"D:/个人/xxx/金庸小说/{book_title}.txt") is False

由于还有其它任务昨天就这样匆匆了事了，今天进行了改善：

考虑对https://www.jinyongwuxia.com/fei/393.html 进行切割，如章节号393之后393_2 ，393_3等逐个累加，拼接出新的URL后判断statuscode为200继续爬取，否则去其他章节，各个追加写入到xiaoshuo.txt里边，实现了全文爬取：

for chap in chapters:
chap_url = base_url + chap['href']
chap_split_url=chap_url.split(".h")
print(chap_split_url)
count = 1
text_chapter = (requests.get(chap_url, headers=headers))
text_chapter.encoding = text_chapter.apparent_encoding
while text_chapter.status_code == 200: ##判断状态码
print(f"downloading from {chap_url}")
text_soup = BeautifulSoup(text_chapter.text, 'html.parser')
novel = text_soup.find('div',class_="m-post")
novel_text = novel.find_all('p')
print(f"downloading{chap_url},pls wait...")
for content in novel_text:
print(f"will write to {book_title}.txt")
with open(f"D:/个人/xxx/金庸小说/{book_title}.txt", 'a', encoding='utf-8') as f:
f.write(content.text)
f.close()
count += 1 ##循环累计判断
chap_url = chap_split_url[0] + "_" + str(count) + ".html" ##拼接出新的连接
text_chapter = (requests.get(chap_url, headers=headers))
print(text_chapter)
text_chapter.encoding = text_chapter.apparent_encoding