有关爬虫pipelines管道文件（一）

最新推荐文章于 2024-08-03 19:27:22 发布

上官轩言

最新推荐文章于 2024-08-03 19:27:22 发布

阅读量1.4k

点赞数 15

文章标签：爬虫

本文链接：https://blog.csdn.net/stitches_fly/article/details/139333212

版权

响应CSDN小助手的要求，关于pipelines的内容部分来了！！

具体的爬虫可以去看作者的另一篇文章Scrapy爬虫基础讲解及案例-CSDN博客

1.1在pipelines.py中定义对数据的操作

定义一个管道
重写管道类的process_item方法
process_item方法处理完item之后必须返回给引擎

class MyspiderPipeline:
    def process_item(self, item, spider):
        print('itcast',item)
        #默认使用完管道之后需要将数据返回给引擎
        return item

1.2配置并启用管道

✔未启用管道之前运行爬虫文件

✔启用爬虫文件

ITEM_PIPELINES = {
   "myspider.pipelines.MyspiderPipeline": 300,  #目录.文件.定义的管道类
   "myspider.pipelines.MyspiderPipeline1": 299, #数值越小优先执行
}

Enabled item pipelines：激活的管道

可以有多个管道

✔启用管道之后运行爬虫文件(这里日志太多,vscode终端长度不够，使用windows自带的终端--命令提示符或者powershell)

✔查看打印

红框：原先日志自动打印的
黄框：管道自己设置打印的

1.3往文件中输出

from itemadapter import ItemAdapter
import json

class MyspiderPipeline:
    def __init__(self):
        self.file = open('itcast.json','w')     #打开(创建)itcast.json并写入
    def process_item(self, item, spider):
        # print('itcast',item)
        #将字典数据序列化
        json_data = json.dumps(item,ensure_ascii=False) + ',\n'     #修改json字符串编码
        #将数据写入文件
        self.file.write(json_data)
        #默认使用完管道之后需要将数据返回给引擎
        return item
    def __del__(self):
        self.file.close()