使用Scrapy框架爬取88读书网小说,并保存本地文件

Scrapy框架,爬取88读书网小说

链接:

88读书网

源码

工具

python 3.7

pycharm

scrapy框架

教程

spider:

# -*- coding: utf-8 -*-
import scrapy
from dushu.items import DushuItem


class BookSpider(scrapy.Spider):
    name = 'book'
    # allowed_domains = ['xdushu.com']
    start_urls = ['https://www.x88dushu.com/xiaoshuo/111/111516/']

    def parse(self, response):
        if response.url == self.start_urls[0]:
            self.logger.info('访问小说目录'+response.url)
            li_list = response.css("div.mulu ul li a")
            for li in li_list:
                link = li.css('a::attr(href)').extract_first()
                yield scrapy.Request(self.start_urls[0]+link)
        else:
            self.logger.info('访问小说内容'+response.url)
            novel = response.css('div.novel')
            item = DushuItem()
            item['chapterName'] = novel.css('h1::text').extract_first()
            item['text'] = novel.css('div.yd_text2::text').extract()
            # self.logger().info(item)
            yield item
        # pass

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 章节名称
    chapterName = scrapy.Field()
    # 内容
    text = scrapy.Field()
    pass

pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json


class DushuPipeline(object):

    def process_item(self, item, spider):
        file = open('mulu/' + item['chapterName'] + '.txt', 'w', encoding='utf-8')
        for text in item['text']:
            file.write(text + '\n')
        file.close()
        return item

setting.py:

BOT_NAME = 'dushu'

SPIDER_MODULES = ['dushu.spiders']
NEWSPIDER_MODULE = 'dushu.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
}

程序运行:

要爬取的小说url:

start_urls = ['https://www.x88dushu.com/xiaoshuo/111/111516/']

运行cmd:

scrapy crawl book

运行结果:

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值