Scrapyd 远程管理scrapy爬虫封装

  • 解决Scrapyd远程管理URL消费者爬虫启动、状态、关闭、数量统计等问题
  • 对Scrapyd进行封
import requests
import json


class Scrapyd:
    def __init__(self, server_list):
        self.server_list, _ = self.update_server(server_list)

    def update_server(self, server_list):
        stop_list = []
        self.server_list = []
        for _, v in enumerate(server_list):
            try:
                r = requests.get(f'http://{v}', timeout=1)
                r.raise_for_status()  # 如果响应状态码不是 200,就主动抛出异常
            except requests.RequestException as e:
                stop_list.append(v)
            else:
                self.server_list.append(v)
        print(f"scrapyd running: {self.server_list}, stop: {stop_list}")
        return self.server_list, stop_list

    # 获取scrapyd有效服务器
    def server(self):
        return self.server_list

    # 获取爬虫详情列表
    def spider_list(self, project='yuqing_spider'):
        result_dict = dict()
        for _, v in enumerate(self.server_list):
            res = requests.get(f'http://{v}/listjobs.json?project={project}')
            if res.status_code == 200:
                res_dict = json.loads(res.text)
                result_dict[v] = {}
                result_dict[v]['node_name'] = res_dict['node_name']
                result_dict[v]['running'] = res_dict['running']
                result_dict[v]['pending'] = res_dict['pending']
        return result_dict

    # jod_id对应服务器k-v
    def spider2server(self, project='yuqing_spider'):
        jobs_dict = dict()
        for k, v in self.spider_list(project=project).items():
            jobs_dict.update({y: k for y in [x['id'] for x in v['running']]})
        return jobs_dict

    # jod_id对应服务器k-v
    def server2spider(self, project='yuqing_spider'):
        jobs_dict = dict()
        for k, v in self.spider_list(project=project).items():
            jobs_dict[k] = [x['id'] for x in v['running']]
        return jobs_dict

    # 获取running爬虫数量
    def spider_count(self, server=':'):
        obj = self.spider_list()
        count = 0
        for k, v in obj.items():
            if server in k:
                count += len(v['running'])
        return count

    # 节点名称
    def spider_node_name(self):
        obj = self.spider_list()
        node_name_list = []
        for k, v in obj.items():
            node_name_list.append(v['node_name'])
        return node_name_list

    # 启动爬虫
    def spider_start(self, count=1, project='yuqing_spider', spider='consumer_spider'):
        result_dict = dict()
        post_data = {'project': project, 'spider': spider}
        start_count = 0
        for _ in range(len(self.server_list)*count):
            i = start_count % len(self.server_list)
            res = requests.post(f'http://{self.server_list[i]}/schedule.json', data=post_data)
            if res.status_code == 200:
                result_dict[json.loads(res.text)['jobid']] = self.server_list[i]
            start_count += 1
            if start_count >= count:
                break
        return result_dict

    # 关闭爬虫
    def spider_close(self, job_id_list=[], project='yuqing_spider'):
        ok_list, lose_list = [], []
        jobs_dict = self.spider2server(project=project)
        if len(job_id_list) == 0:
            job_id_list = jobs_dict.keys()
        for _, job_id in enumerate(job_id_list):
            post_data = {'project': project, 'job': f'{job_id}'}
            if jobs_dict.get(job_id):
                res = requests.post(f'http://{jobs_dict[job_id]}/cancel.json', data=post_data)
                if res.status_code == 200:
                    ok_list.append(job_id)
            else:
                lose_list.append(job_id)
        return ok_list, lose_list


if __name__ == '__main__':
    scrapy = Scrapyd(['192.168.11.1:6800'])
    scrapy.update_server(['192.168.11.1:6800', '192.168.11.19:6800'])
    print(scrapy.spider2server())
    # print(scrapy.spider_count())
    # print(scrapy.spider_start())
    # print(scrapy.spider_count())
    # print(scrapy.spider_close([]))
    # print(scrapy.spider_list())
    # print(scrapy.spider_count())



scrapy URL消费者爬虫可以参考

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值