使用Scrapy框架爬取88读书网小说，并保存本地文件

最新推荐文章于 2023-10-18 15:14:01 发布

刀刃飞雪玉花

最新推荐文章于 2023-10-18 15:14:01 发布

阅读量1k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/daorenfeixueyuhau/article/details/86636645

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Scrapy框架，爬取88读书网小说

链接：

88读书网

源码

工具

python 3.7

pycharm

scrapy框架

教程

spider：

# -*- coding: utf-8 -*-
import scrapy
from dushu.items import DushuItem


class BookSpider(scrapy.Spider):
    name = 'book'
    # allowed_domains = ['xdushu.com']
    start_urls = ['https://www.x88dushu.com/xiaoshuo/111/111516/']

    def parse(self, response):
        if response.url == self.start_urls[0]:
            self.logger.info('访问小说目录'+response.url)
            li_list = response.css("div.mulu ul li a")
            for li in li_list:
                link = li.css('a::attr(href)').extract_first()
                yield scrapy.Request(self.start_urls[0]+link)
        else:
            self.logger.info('访问小说内容'+response.url)
            novel = response.css('div.novel')
            item = DushuItem()
            item['chapterName'] = novel.css('h1::text').extract_first()
            item['text'] = novel.css('div.yd_text2::text').extract()
            # self.logger().info(item)
            yield item
        # pass

items.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 章节名称
    chapterName = scrapy.Field()
    # 内容
    text = scrapy.Field()
    pass

pipelines.py：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json


class DushuPipeline(object):

    def process_item(self, item, spider):
        file = open('mulu/' + item['chapterName'] + '.txt', 'w', encoding='utf-8')
        for text in item['text']:
            file.write(text + '\n')
        file.close()
        return item

setting.py:

BOT_NAME = 'dushu'

SPIDER_MODULES = ['dushu.spiders']
NEWSPIDER_MODULE = 'dushu.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
}

程序运行：

要爬取的小说url：

start_urls = ['https://www.x88dushu.com/xiaoshuo/111/111516/']

运行cmd：

scrapy crawl book

运行结果：

刀刃飞雪玉花

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用Scrapy框架爬取88读书网小说，并保存本地文件

Scrapy框架，爬取88读书网小说链接：88读书网源码工具python 3.7pycharmscrapy框架教程spider：# -*- coding: utf-8 -*-import scrapyfrom dushu.items import DushuItemclass BookSpider(scrapy.Spider): name...
复制链接

扫一扫

专栏目录