爬虫 - scrapy数据解析和持久化存储

最新推荐文章于 2022-08-15 12:04:07 发布

学习中的小菜鸟.

最新推荐文章于 2022-08-15 12:04:07 发布

阅读量204

点赞数

分类专栏：爬虫 Scrapy 文章标签： python scrapy

本文链接：https://blog.csdn.net/qq_33962481/article/details/116375300

版权

爬虫同时被 2 个专栏收录

30 篇文章 0 订阅

订阅专栏

Scrapy

10 篇文章 0 订阅

订阅专栏

文章目录

一、解析操作
- 1. 使用xpath解析数据
二、持久化存储
- 1.基于终端指令
- 2. 基于管道

一、解析操作

1. 使用xpath解析数据

response.xpath() : xpath返回的是一个列表, 但是列表元素一定是Selector类型的对象
extract() : 可以将Selector对象中data参数存储的字符串提取出来, 列表嗲用了extract之后, 则表示将列表中每一个Selector对象中data对应的字符串提取了出来

二、持久化存储

1.基于终端指令

只可以将parse方法的返回值存储到本地的文本文件中

使用命令 : scrapy crawl 项目名称 -o ./文件名.csv 将文件存储到本地

import scrapy


class QiubaiproSpider(scrapy.Spider):
    name = 'qiubaiPro'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        joke_list = response.xpath("//div[contains(@class, 'article block')]")
        all_dic = []
        for joke in joke_list:
            author = joke.xpath("./div/a[2]/h2/text()")[0].extract()
            print(author)
            content = joke.xpath(".//div[@class='content']/span/text()").extract()
            content = ''.join(content)
            dic = {
                'author': author,
                'content': content
            }
            all_dic.append(dic)
        return all_dic

2. 基于管道

编码流程 :
- 数据解析
- 在item类中定义相关的属性

import scrapy

class ScrapyworkItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()

- 将解析的数据封装存储到item类型的对象中

import scrapy
from scrapyWork.items import ScrapyworkItem


class QiubaiproSpider(scrapy.Spider):
    ...

    def parse(self, response):
        ...

            item = ScrapyworkItem()
            item['author'] = author
            item['content'] = content

- 将item类型的对象提交给管道进行持久化存储的操作

import scrapy
from scrapyWork.items import ScrapyworkItem


class QiubaiproSpider(scrapy.Spider):
    ...

    def parse(self, response):
        ...

            item = ScrapyworkItem()
            item['author'] = author
            item['content'] = content
			
			# 将item提交给管道
            yield item

- 在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyworkPipeline:

    # 重写父类的open_spider方法,存储数据
    def open_spider(self, spider):
        self.fp = open('./qiubai.txt', 'w', encoding='utf-8')
        print('开始爬虫')
	
	# 专门用来处理item类型对象
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+':'+content+'\n')
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('结束爬虫')

- 在配置文件中开启管道

ITEM_PIPELINES = {
	# 300表示优先级, 数值越小优先级越高
   'scrapyWork.pipelines.ScrapyworkPipeline': 300,
}

学习中的小菜鸟.

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录