语言舒尔特方格程序_【升级版】Python批量下载星月文学网亦舒小说全集

最新推荐文章于 2021-12-05 18:10:37 发布

暮明天

最新推荐文章于 2021-12-05 18:10:37 发布

阅读量381

点赞数

文章标签：语言舒尔特方格程序

本文链接：https://blog.csdn.net/weixin_31616935/article/details/112698037

版权

前两天完成了《我的前半生》下载，想下载更多，又不想一个一个去操作，毕竟一共有239篇，于是继续完善程序实现批量处理。这次就直接上源码，以备后续再做类似功能，不用重复写代码了。提高效率是每个程序员热衷的事情，毕竟将人的活交给机器干是我们的职责所在。学习Python对我而言还真的是解决日常生活和工作中提高效率行之有效的一种编程语言。

好了，不啰嗦了，直接上源码：

'''

实现星月文学网https://www.xingyueboke.com/yishu/批量下载全部小说

爬虫线路：requests - bs4 - txt

Python版本：3.7

OS：windows 10

'''

import requests

import time

import sys

import os

import queue

from bs4 import BeautifulSoup

# 用一个队列保存url

q = queue.Queue()

# 首先我们写好抓取网页的函数

def get_content(url):

try:

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',

}

r = requests.get(url=url, headers=headers)

r.encoding = 'utf-8'

content = r.text

#print(content)

return content

except:

s = sys.exc_info()

print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))

return " ERROR "

# 解析内容

def praseContent(content, story_name, story_path):

soup = BeautifulSoup(content,'html.parser')

chapter = soup.find(name='h1',class_="post-title").text

content = soup.find(name='div',id="nr1").text

save(chapter, content, story_name)

try:

next1 = soup.find(name='nav',class_="mb2").find(name='ul').find_all('li')[1].find(name="a").get("href")

# 如果存在下一个章节的链接，则将链接加入队列

if next1 != story_path:

q.put(base_url+next1)

#print(next1)

except:

print("下载完毕")

# 保存数据到txt

def save(chapter, content, story_name):

filename = "./亦舒小说全集/"+ story_name+".txt"

f =open(filename, "a+",encoding='utf-8')

f.write("".join(chapter)+'\n')

f.write("".join(content.split())+'\n')

f.close

# 主程序

def main():

start_time = time.time()

q.put(base_url)

# 如果队列为空，则继续

while not q.empty():

content = get_content(q.get())

soup = BeautifulSoup(content,'html.parser')

storyurl_list = soup.find_all(name='li',class_= "hot-book")

storyname_list = soup.find_all(name='h2',class_= "pop-tit")

story_count = len(storyurl_list)

print("亦舒全集共 %d 篇"%story_count)

for i in range(0, story_count-1):

story_name = storyname_list[i].text

story_name = story_name.replace("《", "").replace("》", "")

print("正在下载：%s"% story_name)

current_url = storyurl_list[i].find(name='a').get("href")

story_path = current_url.split('/')[-1]

q.put(current_url)

while not q.empty():

content = get_content(q.get())

soup = BeautifulSoup(content,'html.parser')

first_url = soup.find(name='div',class_= "book-list").find(name='ul').find_all('li')[0].find(name="a").get("href")

q.put(first_url)

# 如果队列为空，则继续

while not q.empty():

content = get_content(q.get())

#print(content)

praseContent(content, story_name, story_path)

end_time = time.time()

project_time = end_time - start_time

print('程序用时', project_time)

# 接口地址

base_url = 'https://www.xingyueboke.com/yishu'

if __name__ == '__main__':

main( )

代码运行之后，如下所示：

能看到一个一个在陆续下载中了，本文还遗留一个问题,如果网络或连接断开不会自动重连并继续下载,需要重行运行下载,当然可以修改循环中的数字,指定从那一篇开始继续下载。喜欢本文章就点个再看吧。

暮明天

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
语言舒尔特方格程序_【升级版】Python批量下载星月文学网亦舒小说全集

前两天完成了《我的前半生》下载，想下载更多，又不想一个一个去操作，毕竟一共有239篇，于是继续完善程序实现批量处理。这次就直接上源码，以备后续再做类似功能，不用重复写代码了。提高效率是每个程序员热衷的事情，毕竟将人的活交给机器干是我们的职责所在。学习Python对我而言还真的是解决日常生活和工作中提高效率行之有效的一种编程语言。好了，不啰嗦了，直接上源码：'''实现星月...
复制链接

扫一扫