Python爬虫爬取某小说网的教程(含全代码)#大佬勿喷

最新推荐文章于 2024-08-07 09:00:00 发布

伦杰周

最新推荐文章于 2024-08-07 09:00:00 发布

阅读量3.6k

点赞数 9

分类专栏：爬取小说文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_51214928/article/details/111855175

版权

爬取小说专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、主要使用三个库

import parsel
import requests
import os

2、利用parsel解析网页

3、利用os库保存一本小说

4、链接上

小说专题：
https://www.17k.com/all/book/2_24_0______1.html

话不多说，直接上代码

import requests
import parsel
import os
# ua伪装
headers = {
    'User-Agent': 'asfasf'
}
# 指定url
def xiaoshuo(url1):
    #发起请求,接受返回对象.
    response = requests.get(url = url1,headers=headers)
    #对指定的url发起请求对应的url时携带参数的,并且请求过程中处理了内容
    response.encoding='utf-8'
    response = response.text
    nei = parsel.Selector(response)#解析
    title = nei.css('h1::text').get()#抓取标题
    title1 = nei.css('span.c9::text').get()#章节标题（这里可以自己找一个小说主题，现在我用的是书号来写文件夹。）
    print('正在下载:',title)
    xiaoshuo = nei.css('p::text').getall()#抓取内容
    print('获取小说内容完成')
    text = ''
    for i in xiaoshuo:
        text += i + '\n'
    #存储持久化数据
    #通过try防止报错停止爬取
    try:
        os.makedirs(title1, exist_ok=True)#写入文件夹
        with open(os.path.join(title1,title+'.txt'),'w',encoding='utf')as f:
                    f.write(text)
        print('写入成功')
        print('\n')
    except:
        pass

#进行全部文本爬取
if __name__ =='__main__':
	for i in range(0,500):
        url = 'https://www.17k.com/all/book/2_24_0______1.html'
        url.replace('_1','_%d'%(i))
        response = requests.get(url=url,headers = headers)
        html = response.text
        jiexi = parsel.Selector(html)#解析网页
        #获得每一本小说的目录
        huodelianjie =jiexi.css('dl>dt>a::attr(href)').getall()#这里获得链接有缺漏，在这里补一些前缀
        print('获得小说链接成功')
        https = 'https:'
        for i in huodelianjie:
            a = https + i
            response2 = requests.get(a,headers = headers)
            response2.encoding='utf-8'
            response2 = response2.text
            nei = parsel.Selector(response2)
            mulu  = nei.css('dt.read>a::attr(href)').getall()#这里也是有缺漏
            https1 = 'https://www.17k.com'
            for url1 in mulu:
                b = https1 + url1
                #解析小说里面目录的网页
                response3 = requests.get(b, headers=headers)
                response3.encoding = 'utf-8'
                response3 = response3.text
                t = parsel.Selector(response3)#解析网页
                zhangjie = t.css('dl.Volume>dd>a::attr(href)').getall()#还有这里也有缺漏
                https1 = 'https://www.17k.com'
                for url1 in zhangjie:
                    c = https1 + url1
                    xiaoshuo(c)#返回小说每一个章节链接

本人刚学python网络爬虫，对函数和第三方库还不是很熟练，代码可能写的有点累赘，大家就当是学习借鉴吧，如有好的建议简化代码，还望赐教，谢谢。
刚入门3个月，里面若有理解不对的地方请多多指教。
本人爬取小说的思路：
1.获得每本小说的链接
2.再获取每本小说章节的链接
3.最后进行小说爬取
以上就是大概的一些思路
写这篇文章是对自己能力的一次总结，希望大佬们能指出我哪些需要改进的代码。