用python爬取笔趣阁小说并分章节保存到本地

最新推荐文章于 2024-06-24 18:45:00 发布

熊言森林

最新推荐文章于 2024-06-24 18:45:00 发布

阅读量1.9k

点赞数

分类专栏：爬虫文章标签： python 网络爬虫笔趣阁

本文链接：https://blog.csdn.net/qq_43304640/article/details/100598786

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在宿舍无聊时想做点小程序，刚好又看到笔趣阁的广告，就想着爬一下小说看看，通过网上的爬取教程整理的

使用beautifulsoup解析request获取的HTML http://beautifulsoup.readthedocs.io/zh_CN/latest/

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os

if __name__=='__main__':
    #所要爬取的小说主页，每次使用时，修改该网址即可，同时保证本地保存根路径存在即可
    target="https://www.biqubao.com/book/17570/"
    # 本地保存爬取的文本根路径
    save_path = 'D:/pythontest/read'
    #笔趣阁网站根路径
    index_path='https://www.biqubao.com'

    req=requests.get(url=target)
    #查看request默认的编码，发现与网站response不符，改为网站使用的gdk
    print(req.encoding)
    req.encoding = 'gbk'
    #解析html
    soup=BeautifulSoup(req.text,"html.parser")
    list_tag=soup.div(id="list")
    print('list_tag:',list_tag)
    #获取小说名称
    story_title=list_tag[0].dl.dt.string
    # 根据小说名称创建一个文件夹,如果不存在就新建
    dir_path=save_path+'/'+story_title
    if not os.path.exists(dir_path):
        os.path.join(save_path,story_title)
        os.mkdir(dir_path)
    #开始循环每一个章节，获取章节名称，与章节对应的网址
    for dd_tag in list_tag[0].dl.find_all('dd'):
        #章节名称
        chapter_name=dd_tag.string
        #章节网址
        chapter_url=index_path+dd_tag.a.get('href')
        #访问该章节详情网址，爬取该章节正文
        chapter_req = requests.get(url=chapter_url)
        chapter_req.encoding = 'gbk'
        chapter_soup = BeautifulSoup(chapter_req.text, "html.parser")
        #解析出来正文所在的标签
        content_tag = chapter_soup.div.find(id="content")
        #获取正文文本，并将空格替换为换行符
        content_text = str(content_tag.text.replace('\xa0','\n'))
        #将当前章节，写入以章节名字命名的txt文件
        with open(dir_path+'/'+chapter_name+'.txt', 'w') as f:
            f.write('本文网址:'+chapter_url)
            f.write(content_text)

熊言森林

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
用python爬取笔趣阁小说并分章节保存到本地

在宿舍无聊时想做点小程序，刚好又看到笔趣阁的广告，就想着爬一下小说看看，通过网上的爬取教程整理的使用beautifulsoup解析request获取的HTMLhttp://beautifulsoup.readthedocs.io/zh_CN/latest/# -*- coding:utf-8 -*-import requestsfrom bs4 import BeautifulSo...
复制链接

扫一扫

专栏目录