写一个小说爬虫

最新推荐文章于 2024-06-24 18:45:00 发布

记笔记专用

最新推荐文章于 2024-06-24 18:45:00 发布

阅读量176

点赞数

分类专栏： python 文章标签：爬虫小说 python html 有意思

本文链接：https://blog.csdn.net/int_i_dont_now/article/details/94634535

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

def catalog_page():
    r = requests.get('https://b.faloo.com/f/611691.html')
    html = etree.HTML(r.text)    #这里随便找了本小说
    chapter_names = html.xpath('//tbody/tr/td[@class="td_0"]/a[@href]/text()')  
    chapter_urls = (html.xpath('//tbody/tr/td[@class="td_0"]/a/@href'))#通过查看html代码找到了标题以及文本的网址    
    for i in range(0,len(chapter_names)-1):       		
		chapter_names[i] = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+", "",temp)#去除章名中的不合法符号
    return zip(chapter_names,chapter_urls)#将小说名与小说链接打包
    
    def content_page(chapters):#传入上述包
        r2 = requests.get('https:'+chapters[1]) #小说链接没有https
        html2 = etree.HTML(r2.text)
        content = html2.xpath('//div[@id="content"]/text()') #小说正文   
        contented = "\n".join(content)#每一段落换行
        with open('C://test/novel/'+chapters[0]+'.txt','w',encoding='utf-8') as f:        
        		f.write(contented)
for each in catalog_page():
    content_page(each)

在这里插入图片描述
在HTML中找到tbody下的td中存放了每一章的名字与链接，先提取出来

小说名与小说链接，可以看到链接没有https：

使用xpath提取标题和链接

chapter_names = html.xpath('//tbody/tr/td[@class="td_0"]/a[@href]/text()') #有内容的td都有class=‘td_0‘这个属性，这里提取text小说名 
chapter_urls = (html.xpath('//tbody/tr/td[@class="td_0"]/a/@href'))#这里提取href小说连接，@erf属性提取获得herf

文件命名要规范

chapter_names[i] = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+", "",temp)#去除章名中的不合法符号，window命名不允许出现’/‘有一个章节名出现了，比如“第五章 不服来战，父子局【1/4】”

content = html2.xpath('//div[@id="content"]/text()') #小说正文

在这里插入图片描述
都是文本内容。

最后写入文件即可。
在这里插入图片描述我不看小说纯属好玩。

记笔记专用

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录