【Python爬虫】爬取笔趣阁小说(部分)

最新推荐文章于 2024-04-10 15:38:31 发布

ZoomToday

最新推荐文章于 2024-04-10 15:38:31 发布

阅读量1.5k

点赞数 1

分类专栏： Python学习文章标签： python 爬虫小说 css parsel

本文链接：https://blog.csdn.net/qq_36477513/article/details/104834629

版权

Python学习专栏收录该内容

58 篇文章 19 订阅

订阅专栏

爬取小说的网站选取的是笔趣阁，笔趣阁没有反爬，比较好爬取一点。至于其他的小说网站，或者彩票，股票，也可以参照这个思路去爬取，这几天的股市真是见证历史了。源码见我的gitee或者CSDN下载。

下载单章内容

#下载单章内容

def download_one_chapter():
    """获取网页源代码"""
    target_url = 'http://www.shuquge.com/txt/8659/25441893.html'
    response = requests.get(target_url)
    response.encoding=response.apparent_encoding
    html=response.text

    """从网页源代码中拿到小说正文信息"""
    sel = parsel.Selector(html)
    title=sel.css('.content h1::text').extract_first()
    contents = sel.css('#content::text').extract()

    """数据清除 转化并清楚空白字符串"""
    contents1=[content.strip() for content in contents]
    #print(contents1)
    text = '\n'.join(contents1)
    #print(text)

    """保存小说内容"""
    file=open(title+'.txt',mode='w',encoding='utf-8')
    #只能写入字符串
    file.write(title)
    file.write(text)
    #关闭文件
    file.close()

说明：

target_url为请求的网址，response 服务返回内容及对象；

response得到的网页源码中存在乱码问题，response.apparent_encoding自动解决编码问题；

从网页中解析数据可以使用正则表达式（多用于匹配字符串），json（字典），xpath（路径提取）以及css选择器，这里用css提取信息——标题和正文；可以在网页代码中右键复制css选择器路径；

‘extract=getall’，可以通过ctrl+左键查看函数；

css选择器得到的content是一个列表，而write()是写入字符串的，所以使用join()把列表变成字符串，并通过列表推导式
对列表操作去除两端空白字符得到content1。

获取每章URL，下载多章内容

"""获取书籍每章链接，目录页"""
def get_chapters_links(target_url):
    """目录页获取每章的url"""
    #target_url = 'http://www.shuquge.com/txt/8659/index.html'
    response = requests.get(target_url)
    response.encoding=response.apparent_encoding
    html=response.text

    """css选择器提取"""
    sel=parsel.Selector(html)
    links=sel.css('dd a::attr(href)').extract()
    for link in links:
        print（'http://www.shuquge.com/txt/8659/'+link）
    return links

css提取‘dd’标签下‘a’标签中的‘href’属性下的每章的id，通过拼接找到每章的url。

下载一本小说

def get_one_book(book_url):
    links = get_chapters_links(book_url)
    for link in links:
        #print('http://www.shuquge.com/txt/8659/'+link)
        download_one_chapter('http://www.shuquge.com/txt/8659/'+link)

整合上面两个函数，获取每章url，再进行每一章的下载。

最后

后续可以获取整个站点的小说目录，下载整个网站的小说。

ZoomToday

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
【Python爬虫】爬取笔趣阁小说(部分)

爬取小说的网站选取的是笔趣阁，笔趣阁没有反爬，比较好爬取一点。至于其他的小说网站，或者彩票，股票，也可以参照这个思路去爬取，这几天的股市真是见证历史了。源码见我的gitee或者CSDN下载。下载单章内容#下载单章内容def download_one_chapter(): """获取网页源代码""" target_url = 'http://www.shuquge...
复制链接

扫一扫