scrapy框架使用piplines、items进行提取项目并保存数据

最新推荐文章于 2022-09-12 05:58:36 发布

weixin_46837101

最新推荐文章于 2022-09-12 05:58:36 发布

阅读量518

点赞数

分类专栏：爬虫系列文章标签： python

本文链接：https://blog.csdn.net/weixin_46837101/article/details/106953986

版权

爬虫系列专栏收录该内容

24 篇文章 0 订阅

订阅专栏

1 .Scrapy提取项目

从网页中提取数据，Scrapy 使用基于 XPath 和 CSS 表达式的技术叫做选择器。

选择器有四个基本的方法，如下所示：

S.N.	方法 & 描述
extract()	它返回一个unicode字符串以及所选数据
extract_first()	它返回第一个unicode字符串以及所选数据
re()	它返回Unicode字符串列表，当正则表达式被赋予作为参数时提取
xpath()	它返回选择器列表，它代表由指定XPath表达式参数选择的节点
css()	它返回选择器列表，它代表由指定CSS表达式作为参数所选择的节点

2 .Scrapy Shell

如果使用选择器想快速的到到效果，我们可以使用Scrapy Shell
scrapy shell "http://www.163.com"

注意windows系统必须使用双引号

3.以文件的方式打印，保存文件的两种方式

3.1 python原生方式

with open("movie.txt", 'wb') as f: 
    for n, c in zip(movie_name, movie_core): 
    str = n+":"+c+"\n" f.write(str.encode())

3.2 以scrapy内置方式

scrapy 内置主要有四种：JSON，JSON lines，CSV，XML

我们将结果用最常用的JSON导出，命令如下：

在控制台输出一下命令：

scrapy crawl dmoz -o douban.json -t json

-o 后面是导出文件名，-t 后面是导出类型（这个可以不写）

scrapy crawl qidian -o qidian.json 
scrapy crawl qidian -o qidian.csv 
scrapy crawl qidian -o qidian.xml

3.3 scrapy 保存json文件出现Unicode字符

在setting里面加入下面的配置

FEED_EXPORT_ENCODING ='utf-8'

3.4 scrapy 保存csv文件出现乱码

在setting里面加入下面的配置

FEED_EXPORT_ENCODING = 'gb18030'

4. scrapy框架，使用piplines和items两种格式，对提取文件进行打印和保存

从一个普通的HTML网站提取数据，查看该网站得到的 XPath 的源代码。检测后，可以看到数据将在UL标签，并选择 li 标签中的元素。
代码的下面行显示了不同类型的数据的提取：


# -*- coding: utf-8 -*-
import scrapy


class QidianSpider(scrapy.Spider):
    name = 'qidian'
    allowed_domains = ['qidian.com']
    start_urls = ['https://www.qidian.com/rank/yuepiao?chn=21']

    def parse(self, response):
        names=response.xpath('//h4/a/text()').extract()
        authors=response.xpath('//p[@class="author"]/a[1]/text()').extract()

        # print(names,':',authors)
        books=[]
        for name,author in zip(names,authors):
            books.append({"name":name,"author":author})
        return books

返回的内容

{'movie_name': ['肖申克的救赎', '霸王别姬', '这个杀手不太冷', '阿甘正传', '美丽人生', '千与千寻', '泰坦尼克号', '辛德勒的名单', '盗梦空间', '机器人总动员', '海上钢琴师', '三傻大闹宝莱坞', '忠犬八公的故事', '放牛班的春天', '大话西游之大圣娶亲', '教父', '龙猫', '楚门的世界', '乱世佳人', '熔炉', '触不可及', '天堂电影院', '当幸福来敲门', '无间道', '星际穿越'], 'movie_core': ['9.6', '9.5', '9.4', '9.4', '9
.5', '9.2', '9.2', '9.4', '9.3', '9.3', '9.2', '9.1', '9.2', '9.2', '9.2', '9.2', '9.1', '9.1', '9.2', '9.2', '9.1', '9.1', '8.9', '9.0
', '9.1']}

4.1.通过使用piplines，对提取文件进行打印和保存


2.通过piplines打印返回数据并且保存数据

# -*- coding: utf-8 -*-
import scrapy

class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://maoyan.com/films?showType=3']

    def parse(self, response):
        names=response.xpath('//div[@class="channel-detail movie-item-title"]/a/text()').extract()
        
        # scores_div=response.xpath('//div[@class="channel-detail channel-detail-orange"]')
        # scores=[]
        # for score in scores_div:
        #     scores.append(score.xpath('string(.)').extract_first())
        
        #简写
        scores = [score.xpath('string(.)').extract_first() for score in
                  response.xpath('//div[@class="channel-detail channel-detail-orange"]')]
       
        # 在控制台打印的方法一
        # for name,score in zip(names,scores):
        #   print(name,':',score)
        
        # 在控制台打印的方法二 pipline 使用yield函数推送到pipline
        # 必须使用字典或item形式
        # 返回的是一个字典

         for name,score in zip(names,scores):
             yield {'name': name,'score': score}
       
       
# 在piplines输出             
# -*- coding: utf-8 -*-       
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class SpiderPipeline:
    def process_item(self, item, spider):
        print(item)
        # return item
        
注意：                        
        piplines默认的打印格式是log日志
        需要在设置里打开piplines配置文件
        
        ITEM_PIPELINES = {
   'spider.pipelines.SpiderPipeline': 300,
}

#保存文件
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 写法一
import json

class SpiderPipeline:
    def process_item(self, item, spider):
        #使用a追加写入，需要每次打开文件，增加CPU速度
        with open('movie.txt','a',encoding='utf-8') as f:
              #item返回的是字典对象，dumps转为字符串对象
              f.write(json.dumps(item,ensure_ascii=False)+'\n')         
         print(item)
         return item
写法二,使用打开和关闭文件的函数模式保存文件
import json

class SpiderPipeline:
    def open_spider(self,spider):
        self.filename=open('movie.txt','w',encoding='utf-8')

    def process_item(self, item, spider):
        #item返回的是字典对象，dumps转为字符串对象
        self.filename.write(json.dumps(item,ensure_ascii=False)+'\n')

        return item
    def close_spider(self,spider):
        self.filename.close()

4.2.通过使用items，对提取文件进行打印和保存

提取内容的封装Item
Scrapy进程可通过使用蜘蛛提取来自网页中的数据。Scrapy使用Item类生成输出对象用于收刮数据。Item 对象是自定义的python字典，可以使用标准字典语法获取某个属性的值

3. 提取内容的封装Item
Scrapy进程可通过使用蜘蛛提取来自网页中的数据。Scrapy使用Item类生成输出对象用于收刮数据
Item 对象是自定义的python字典，可以使用标准字典语法获取某个属性的值
2.1 定义
# -*- coding: utf-8 -*-
import scrapy
from spider.items import MovieItem

class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://maoyan.com/films?showType=3']

    def parse(self, response):
        names=response.xpath('//div[@class="channel-detail movie-item-title"]/a/text()').extract()

        # scores_div=response.xpath('//div[@class="channel-detail channel-detail-orange"]')
        # scores=[]
        # for score in scores_div:
        #     scores.append(score.xpath('string(.)').extract_first())

        scores = [score.xpath('string(.)').extract_first() for score in
                  response.xpath('//div[@class="channel-detail channel-detail-orange"]')]
        
        # 创建item对象
        item=MovieItem()
        for name,score in zip(names,scores):
            item['names'] = name
            item['scores'] = score
            yield item
 
# 在item输出    
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    names=scrapy.Field()
    scores = scrapy.Field()
    
    
    
保存数据    
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
写法一
class SpiderPipeline:
    def process_item(self, item, spider):
        #使用a追加写入，需要每次打开文件，增加CPU速度
        with open('movie.txt','a',encoding='utf-8') as f:
            #item返回的是字典对象，dumps转为字符串对象
            
            #使用item进行保存文件会出现序列化报错，需要把item序列化转化为字典格式
            #TypeError: Object of type MovieItem is not JSON serializable
            
            f.write(json.dumps(dict(item),ensure_ascii=False)+'\n')
        # print(item)
        return item

import json

class SpiderPipeline:
    def open_spider(self,spider):
        self.filename=open('movie.txt','w',encoding='utf-8')

    def process_item(self, item, spider):
        #item返回的是字典对象，dumps转为字符串对象

        #使用item进行保存文件会出现序列化报错，需要把item序列化转化为字典格式
        #TypeError: Object of type MovieItem is not JSON serializable

        self.filename.write(json.dumps(dict(item),ensure_ascii=False)+'\n')

        return item
    def close_spider(self,spider):
        self.filename.close()

weixin_46837101

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
scrapy框架使用piplines、items进行提取项目并保存数据

1 .Scrapy提取项目从网页中提取数据，Scrapy 使用基于 XPath 和 CSS 表达式的技术叫做选择器。选择器有四个基本的方法，如下所示： S.N. 方法 & 描述 extract() 它返回一个unicode字符串以及所选数据 extract_first() 它返回第一个unicode字符串以及所选数据 re() 它返回U
复制链接

扫一扫