使用多线程快速爬取小说网的书架榜单

最新推荐文章于 2024-03-28 21:42:46 发布

超级小刀-技术

最新推荐文章于 2024-03-28 21:42:46 发布

阅读量511

点赞数 6

分类专栏：数据爬虫文章标签： python pycharm

本文链接：https://blog.csdn.net/wzhibin/article/details/134833157

版权

数据爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

使用python爬虫和多线程技术，快速爬取小说网站的多部小说，爬取的小说网站是笔趣阁。主要测试和使用多个python爬虫技术，使用线程池设置多个线程，批量获取多部小说内容部分。获取小说内容保存为记事本，可以使用小说阅读器进行本地阅读。

所有爬虫代码已上传网盘，可以关注公众号【站在前沿】,回复biqu，获取网盘下载链接。

一、代码运行效果如下图

二、需求基本思路

1、数据来源分析

从详细面到目录页面，逐个页面抓包分析，寻找数据来源

2、代码实现步骤

先实现详细页面的下载，分析需要的参数，再实现前个页面代码，再分析需要的参数，以此类推。部分代码如下

import concurrent.futures
import logging
import os.path
import pprint
import re
import parsel
import until

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}

def getChapter(bookid):
    url = f'http://www.1biqug.net/book/{str(bookid)}/'
    html = until.getResText(url, headers)
    # print(html)
    # bookName=re.findall('<h1>(.*?)</h1>',html)
    # listUrl = re.findall('<dd><a style=".*?" href="(.*?)">.*?</a></dd>', html)
    # xpath方式,页面中第二个卷名获取错误，暂时忽略处理
    sel = parsel.Selector(html)
    bookName = sel.xpath('//*[@id="info"]/h1/text()').get()
    volumeName = sel.xpath('//div[@class="list"]/dl/dt/text()').get()
    listUrl = sel.xpath('//div[@class="list"]/dl/dd/a/@href').getall()
    volumeName = volumeName.split('》', 1)[1]
    # print(bookName, volumeName)
    # print(listUrl)
    chapter = {}
    for u in listUrl:
        chapter[u] = volumeName
    return bookName, chapter


def getOtherChapter(bookid):
    postUrl = 'http://www.1biqug.net/action.php'
    data = {
        'action': 'clist',
        'bookid': '4831'
    }
    data['bookid'] = str(bookid)
    chapter = {}
    postJson = until.postResJson(postUrl, headers, data)
    # pprint.pprint(postJson)
    for v in postJson['columnlist']:
        volumeName = v['columnname']
        for c in v['chapterlist']:
            chapterUrl = '/book/' + str(bookid) + '/' + str(c['chapterid'] + bookid * 11) + '.html'
            chapter[chapterUrl] = volumeName
    # print(chapter)
    return chapter


def getContent(chapterUrl):
    chapterUrl = 'http://www.1biqug.net' + chapterUrl
    # print(chapterUrl)
    resText = until.getResText(chapterUrl, headers)
    # 通过正则获取
    # title = re.findall('<h1>(.*?)</h1>', resText)[0]
    # content = re.findall('<div id="content">(.*?)<span id="contenttips"', resText, re.S)[0].replace('<br/><br/>', '')
    selector = parsel.Selector(resText)
    # # 通过css选择器
    # title = selector.css('.bookname h1::text').get()
    # content = '\n'.join(selector.css('#content::text').getall())
    # 通过xpath
    title = selector.xpath('//*[@class="bookname"]/h1/text()').get()
    content = '\n'.join(selector.xpath('//*[@id="content"]/text()').getall())
    # print(title)
    # print(content)
    return title, content


def save(bookName, volumeName, title, content):
    savePath = 'book\\' + until.checkFileName(bookName) + '\\' + until.checkFileName(volumeName) + '\\'
    if not os.path.exists(savePath):
        os.makedirs(savePath)
    with open(savePath + until.checkFileName(title) + '.txt', mode='a', encoding='utf-8') as f:
        f.write(title)
        f.write(content)
        f.write('\n')
    logging.info('文件 ' + savePath + until.checkFileName(title) + '.txt 保存成功')


def run(bookName, volumeName, chapterUrl):
    # print(chapterUrl)
    title, content = getContent(chapterUrl)
    save(bookName, volumeName, title, content)

所有爬虫代码已上传网盘，可以关注公众号【站在前沿】,回复biqu，获取网盘下载链接。