python简单爬取小说_python批量爬取小说（一步一步实现，适合新手入门）

weixin_39945679

于 2020-11-20 23:30:27 发布

阅读量513

点赞数

文章标签： python简单爬取小说

1、下载小说的一个章节

让我们首先打开书趣阁网站中的一个小说中的一个章节，如图：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM2NzU4OTE0,size_16,color_FFFFFF,t_70

然后我们开始请求网页数据：

response = requests.get('http://www.shuquge.com/txt/63542/9645082.html')

# 自动解决编码问题

response.encoding = response.apparent_encoding

使用 parsel 库对数据进行解析：

解析数据一般有三种方式：正则表达式、xpath 路径提取器、css 选择器。在这里，我们使用 css 选择器。

# 将字符串内容实例化成一个对象

sel = parsel.Selector(response.text)

# ::text 是文字属性提取器

title = sel.css('.content h1::text').get() # 可以用 #wrapper>div.book.reader>div.content>h1 代替

content = sel.css('#content::text').getall() # 可以用 .content div.showtxt 代替

其中，::text 是文字属性提取器，sel.css() 中的内容可以用下面这种方式获得：

首先打开开发者工具，在查看器中找到小说章节的名字，然后点击鼠标右键 --> 复制 --> CSS 选择器。

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM2NzU4OTE0,size_16,color_FFFFFF,t_70

之后，我们就可以将小说内容保存到 .txt 文件中了：

# 保存小说内容

with open(title+'.txt', mode='w', encoding='utf-8') as f:

f.write(title+'\n')

for i in content:

f.write(i.strip()+'\n')

其中，.strip() 是为了去掉所有空格。

2、下载小说中的所有章节

先把之前的下载一章的代码封装成一个函数：

def download_one_chapter(url):

response = requests.get(url)

response.encoding = response.apparent_encoding

sel = parsel.Selector(response.text)

title = sel.css('.content h1::text').get()

content = sel.css('#content::text').getall()

with open(title+'.txt', mode='w', encoding='utf-8') as f:

f.write(title+'\n')

for i in content:

f.write(i.strip()+'\n')

然后回到这个小说的目录页，用同样的方法在查看器中找到小说每一章节的下载地址的最后几位数字：

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM2NzU4OTE0,size_16,color_FFFFFF,t_70

# 请求目录页，获取所有章节的下载地址

url = 'http://www.shuquge.com/txt/5809/index.html'

response = requests.get(url)

response.encoding = response.apparent_encoding

sel = parsel.Selector(response.text)

index = sel.css('.listmain dd a::attr(href)').getall()

for i in index[12:]:

download_one_chapter('http://www.shuquge.com/txt/5809/'+i)

其中，index 中的内容就是这些数字，sel.css(）中的内容也是按之前那种方法获取。::attr(href) 用来提取 href 中的内容。

原文链接:https://blog.csdn.net/qq_36758914/article/details/105147425

weixin_39945679

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。