爬虫 - scrapy数据解析和持久化存储


一、解析操作

1. 使用xpath解析数据

  • response.xpath() : xpath返回的是一个列表, 但是列表元素一定是Selector类型的对象

  • extract() : 可以将Selector对象中data参数存储的字符串提取出来, 列表嗲用了extract之后, 则表示将列表中每一个Selector对象中data对应的字符串提取了出来

二、持久化存储

1.基于终端指令

只可以将parse方法的返回值存储到本地的文本文件中

  • 使用命令 : scrapy crawl 项目名称 -o ./文件名.csv 将文件存储到本地
import scrapy


class QiubaiproSpider(scrapy.Spider):
    name = 'qiubaiPro'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        joke_list = response.xpath("//div[contains(@class, 'article block')]")
        all_dic = []
        for joke in joke_list:
            author = joke.xpath("./div/a[2]/h2/text()")[0].extract()
            print(author)
            content = joke.xpath(".//div[@class='content']/span/text()").extract()
            content = ''.join(content)
            dic = {
                'author': author,
                'content': content
            }
            all_dic.append(dic)
        return all_dic

2. 基于管道

  • 编码流程 :
    • 数据解析
    • 在item类中定义相关的属性
import scrapy

class ScrapyworkItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()
- 将解析的数据封装存储到item类型的对象中
import scrapy
from scrapyWork.items import ScrapyworkItem


class QiubaiproSpider(scrapy.Spider):
    ...

    def parse(self, response):
        ...

            item = ScrapyworkItem()
            item['author'] = author
            item['content'] = content
- 将item类型的对象提交给管道进行持久化存储的操作
import scrapy
from scrapyWork.items import ScrapyworkItem


class QiubaiproSpider(scrapy.Spider):
    ...

    def parse(self, response):
        ...

            item = ScrapyworkItem()
            item['author'] = author
            item['content'] = content
			
			# 将item提交给管道
            yield item
- 在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyworkPipeline:

    # 重写父类的open_spider方法,存储数据
    def open_spider(self, spider):
        self.fp = open('./qiubai.txt', 'w', encoding='utf-8')
        print('开始爬虫')
	
	# 专门用来处理item类型对象
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+':'+content+'\n')
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('结束爬虫')
- 在配置文件中开启管道
ITEM_PIPELINES = {
	# 300表示优先级, 数值越小优先级越高
   'scrapyWork.pipelines.ScrapyworkPipeline': 300,
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值