前两天完成了《我的前半生》下载,想下载更多,又不想一个一个去操作,毕竟一共有239篇,于是继续完善程序实现批量处理。这次就直接上源码,以备后续再做类似功能,不用重复写代码了。提高效率是每个程序员热衷的事情,毕竟将人的活交给机器干是我们的职责所在。学习Python对我而言还真的是解决日常生活和工作中提高效率行之有效的一种编程语言。
好了,不啰嗦了,直接上源码:
''' 实现星月文学网https://www.xingyueboke.com/yishu/批量下载全部小说 爬虫线路:requests - bs4 - txt Python版本:3.7 OS:windows 10 ''' import requests import time import sys import os import queue from bs4 import BeautifulSoup # 用一个队列保存url q = queue.Queue() # 首先我们写好抓取网页的函数 def get_content(url): try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } r = requests.get(url=url, headers=headers) r.encoding = 'utf-8' content = r.text #print(content) return content except: s = sys.exc_info() print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno)) return " ERROR " # 解析内容 def praseContent(content, story_name, story_path): soup = BeautifulSoup(content,'html.parser') chapter = soup.find(name='h1',class_="post-title").text content = soup.find(name='div',id="nr1").text save(chapter, content, story_name) try: next1 = soup.find(name='nav',class_="mb2").find(name='ul').find_all('li')[1].find(name="a").get("href") # 如果存在下一个章节的链接,则将链接加入队列 if next1 != story_path: q.put(base_url+next1) #print(next1) except: print("下载完毕") # 保存数据到txt def save(chapter, content, story_name): filename = "./亦舒小说全集/"+ story_name+".txt" f =open(filename, "a+",encoding='utf-8') f.write("".join(chapter)+'\n') f.write("".join(content.split())+'\n') f.close # 主程序 def main(): start_time = time.time() q.put(base_url) # 如果队列为空,则继续 while not q.empty(): content = get_content(q.get()) soup = BeautifulSoup(content,'html.parser') storyurl_list = soup.find_all(name='li',class_= "hot-book") storyname_list = soup.find_all(name='h2',class_= "pop-tit") story_count = len(storyurl_list) print("亦舒全集共 %d 篇"%story_count) for i in range(0, story_count-1): story_name = storyname_list[i].text story_name = story_name.replace("《", "").replace("》", "") print("正在下载:%s"% story_name) current_url = storyurl_list[i].find(name='a').get("href") story_path = current_url.split('/')[-1] q.put(current_url) while not q.empty(): content = get_content(q.get()) soup = BeautifulSoup(content,'html.parser') first_url = soup.find(name='div',class_= "book-list").find(name='ul').find_all('li')[0].find(name="a").get("href") q.put(first_url) # 如果队列为空,则继续 while not q.empty(): content = get_content(q.get()) #print(content) praseContent(content, story_name, story_path) end_time = time.time() project_time = end_time - start_time print('程序用时', project_time) # 接口地址 base_url = 'https://www.xingyueboke.com/yishu' if __name__ == '__main__': main( ) |
代码运行之后,如下所示:
能看到一个一个在陆续下载中了,本文还遗留一个问题,如果网络或连接断开不会自动重连并继续下载,需要重行运行下载,当然可以修改循环中的数字,指定从那一篇开始继续下载。喜欢本文章就点个再看吧。