解决Scrapy2.8 open-spider不支持异步起动

最新推荐文章于 2024-06-09 22:15:24 发布

ViniJack

最新推荐文章于 2024-06-09 22:15:24 发布

阅读量221

点赞数

文章标签： scrapy python

本文链接：https://blog.csdn.net/ViniJack/article/details/132249139

版权

问题：

我在scrapy2.8中的pipelines使用了异步定义了open_spider，在里面如果再次使用await 等待执行回调返回，会出现报错，但是如果我不使用await 就没产生报错信息。
但是，如果我仔细排查，发现我在使用异步定义open_spider的方法中是没有输出，不会执行的，如下：

async def open_spider(self, spider):
    print('Into the async open spider.....')

分析：

Scrapy2.5以上，已经全面支持异步的调用了，但是在某些模块的使用上，还是只能支持同步的启动，如果使用了异步的定义编写方式，这模块就默认跳过，或者不会在scrapy的启动调用列表中。就例如在pipelines中的open_spider中，如果我想在启动的时候，定义接入数据库或者介入某些异步动作，我使用如下定义

async def open_spider(self, spider):

那么scrapy 框架就因为不支持使用异步启动的方式而把open_spider调整为默认的模块执行，而你在自己编写的open_spider模块就会使用默认的open_spider来替代了。

解决方案：

既然scrapy 2.8框架不主动出发我的异步函数，我唯有自己主动的调用。
按照scrapy的每个模块执行顺序的不一，我只使用了pipeline来做延时

class RabbitMQPipeline:
    def __init__(self):
        self.connection = None
        self.channel = None

    async def open_spider(self, spider):
        # 初始化连接和通道
        if not self.connection or self.connection.is_closed:
            self.connection = await aio_pika.connect_robust("amqp://guest:guest@localhost/")
        self.channel = await self.connection.channel()
        await self.channel.declare_queue(queue_name='your_queue_name')

    async def close_spider(self, spider):
        # 关闭通道和连接
        if self.channel and not self.channel.is_closed:
            await self.channel.close()
        if self.connection and not self.connection.is_closed:
            await self.connection.close()

    async def process_item(self, item, spider):
        message = f"Your item data: {item['data']}"
        await self.channel.default_exchange.publish(
            aio_pika.Message(body=message.encode()),
            routing_key='your_queue_name'
        )
        return item

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        # 注册信号量
        crawler.signals.connect(pipeline.open_spider, signal=signals.spider_opened)
        crawler.signals.connect(pipeline.close_spider, signal=signals.spider_closed)
        return pipeline

    async def start_requests(self, spider):
        # 在这里调用异步的 open_spider 方法
        await self.open_spider(spider)
        # 返回空的请求列表，因为实际的请求将通过其他方式生成
        return []       
         ```