五、Scrapy管道的使用-网易招聘

最新推荐文章于 2023-07-15 16:16:09 发布

IT瘾君

最新推荐文章于 2023-07-15 16:16:09 发布

阅读量163

点赞数

分类专栏： python 文章标签：爬虫 xpath python

本文链接：https://blog.csdn.net/u012441595/article/details/121653283

版权

python 专栏收录该内容

89 篇文章 8 订阅

订阅专栏

python编程快速上手（持续更新中…）

python爬虫从入门到精通

Scrapy爬虫框架

文章目录

python编程快速上手（持续更新中…）
python爬虫从入门到精通
Scrapy爬虫框架

深入的学习scrapy管道的使用

1. pipeline中常用的方法：

process_item(self,item,spider):
管道类中必须有的函数
实现对item数据的处理
必须return item
open_spider(self, spider): 在爬虫开启的时候仅执行一次
close_spider(self, spider): 在爬虫关闭的时候仅执行一次

2. 管道文件的修改-job_simple.py

继续完善wangyi爬虫，在pipelines.py代码中完善

import json
class WangyiPipeline:

    def open_spider(self, spider):
        if spider.name == 'job':
            self.file = open('wangyi.json', 'w')

    def process_item(self, item, spider):
        if spider.name == 'job':
            item = dict(item)

            str_data = json.dumps(item, ensure_ascii=False) + ',\n'

            self.file.write(str_data)

        return item

    def close_spider(self, spider):
        if spider.name == 'job':
            self.file.close()


class WangyiSimplePipeline:

    def open_spider(self, spider):
        if spider.name == 'job_simple':
            self.file = open('wangyisimple.json', 'w')

    def process_item(self, item, spider):
        if spider.name == 'job_simple':
            item = dict(item)

            str_data = json.dumps(item, ensure_ascii=False) + ',\n'

            self.file.write(str_data)

        return item

    def close_spider(self, spider):
        if spider.name == 'job_simple':
            self.file.close()

在settings.py设置开启pipeline

ITEM_PIPELINES = {
‘myspider.pipelines.ItcastFilePipeline’: 400, # 400表示权重
‘myspider.pipelines.ItcastMongoPipeline’: 500, # 权重值越小，越优先执行！
}

3.管道添加mongo.db保存记录

版本：pymongo4.0

class MongoPipeline(object):

    def open_spider(self, spider):
        self.client = MongoClient('mongodb://127.0.0.1:27017')
        self.col = self.client['itcast']['wangyi']

    def process_item(self, item, spider):
        data = dict(item)
        self.col.insert_one(data)

        return item

    def close_spider(self, spider):
        self.client.close()

别忘了开启mongodb数据库

sudo service mongodb start

并在mongodb数据库中查看 mongo

思考：在settings中能够开启多个管道，为什么需要开启多个？
不同的pipeline可以处理不同爬虫的数据，通过spider.name属性来区分
不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作，比如一个进行数据清洗，一个进行数据的保存
同一个管道类也可以处理不同爬虫的数据，通过spider.name属性来区分