Python爬虫【基于scrapy对爬取的数据进行多方式存储】_scrapy爬虫将数据保存到一个文件-CSDN博客

本文链接：https://blog.csdn.net/W_O_Z/article/details/132610035

直入主题，这里有一个要求：将爬取到的数据分两个方式存储。我们这里为了演示方便就选择一个是打印到终端，一个是写入txt文件。

注意：这里是基于CrawlSpider的。以下所有关于scrapy的相关知识请见拙作 "Python爬虫之明星框架scrapy的基础使用" 。

直接上代码

Crawl.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from CrawlDemo.items import CrawldemoItem


class CrawlSpider(CrawlSpider):
    name = "Crawl"
    # allowed_domains = ["www.xxx.com"]
    start_urls = ["https://movie.douban.com/top250"]

    # 链接提取器：根据指定规则（allow="正则"）进行指定链接的提取
    link = LinkExtractor(allow=r"start=\d+&filter=")

    rules = (
        Rule(link, callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        li_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li')

        for li in li_list:
            name = li.xpath('./div/div[2]/div[1]/a/span[1]/text()').extract_first()

            item = CrawldemoItem()
            item['title'] = name

            yield item

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawldemoItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    # pass

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class CrawldemoPipeline:
    # 专门用来处理item类型对象，可以接收爬虫文件提交过来的item对象，每接收到一个item就会被调用一次
    def process_item(self, item, spider):
        title = item['title']

        print(title)
        print("=" * 30)

        return item


class DemoPipeline:
    # 文件对象
    fp = None

    # 重写父类的一个方法，该方法只在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        self.fp = open('./film.txt', 'w', encoding='utf-8')

    # 专门用来处理item类型对象，可以接收爬虫文件提交过来的item对象，每接收到一个item就会被调用一次
    def process_item(self, item, spider):
        title = item['title']

        self.fp.write(title + '\n')

        return item

    # 重写父类的一个方法，该方法只在结束爬虫的时候被调用一次
    def close_spider(self, spider):
        print("文件存储结束...")
        self.fp.close()

settings.py

配置文件中的一些老生常谈的改动这里就不展示了。