使用scrapyd管理爬虫

方文档看这里:http://scrapyd.readthedocs.org/en/latest/api.html

安装scrapyd

pip3 install scrapyd

报这个错

pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.

然后开始进行如下操作

yum install openssl
  yum install openssl-devel
  cd python3.6.1
  make && make install
  
默认安装在python3/bin下有scrapyd文件

创建目录 /etc/scrapyd
创建文件 scrapyd.conf

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1  网上说最好不要开放 外网  不安全
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

具体含义 自行百度

/usr/local/python3/bin
到目录下执行python3 scrapyd 启动

/usr/local/python3/bin/scrapyd

curl 127.0.0.1:6800
不报错就证明安装成功了
后台运行

nohup python3 scrapyd 2>&1 &

检验:curl 127.0.0.1:6800

设置爬虫(你的项目)根目录下scrapy.cfg配置文件
编辑项目名称project = projectA (可以默认)
deploy下添加 url = http://127.0.0.1:6800/ (ip地址与scrapyd.conf相同)

安装scrapyd-client

pip3 install scrapyd-client

将python3/bin下 scrapyd-deploy文件复制到爬虫根目录

爬虫根目录执行部署
在这里插入图片描述

python3 scrapyd-deploy default -p projectA

返回:{“node_name”: “host”, “status”: “ok”, “project”: “projectA”, “version”: “1593221298”, “spiders”: 4}

查看当前主机已经部署的Scrapy项目

curl http://127.0.0.1:6800/listprojects.json

返回{“node_name”: “host”, “status”: “ok”, “projects”: [“projectA”]}
projects中就是部署的项目

执行爬虫

curl http://127.0.0.1:6800/schedule.json -d project=projectA -d spider=爬虫name

返回状态 {“node_name”: “host”, “status”: “ok”, “jobid”: “d10b9786b81511ea8c8c00163c98e1c1”}

停止爬虫

curl http://127.0.0.1:6800/cancel.json -d project=spider_demo -d job=097cd29aa6ef11e8ab735254003796ff

查看正在运行的爬虫(历史爬虫)

curl http://127.0.0.1:6800/listjobs.json?project=projectA

删除爬虫部署

curl http://127.0.0.1:6800/delproject.json -d project=projectA

执行scrapy问题:-bash:scrapy:command not found
但是可以import,于是添加python3.5到环境变量,搞定

export PATH=$PATH:/usr/local/python3.5/bin/ 

问题

2017-10-17 17:58:05 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "g:\python-2-7\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "g:\python-2-7\lib\site-packages\scrapy\crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "g:\python-2-7\lib\site-packages\scrapy\crawler.py", line 76, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "g:\python-2-7\lib\site-packages\scrapy\crawler.py", line 99, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "g:\python-2-7\lib\site-packages\scrapy\spiders\__init__.py", line 54, in from_crawler
    spider = cls(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument '_job'

解决:

  • 执行scrapy看看有没有安装

scrapyd-client方式启动爬虫失败,可以试试scrapy
进入到爬虫根目录
执行:

scrapy crawl updateChapterSpider

观察启动情况。
/usr/local/python3/bin/scrapy crawl testSpider
/usr/local/python3/bin/scrapy crawl updateChapterSpider
/usr/local/python3/bin/scrapy crawl updateBookSpider

  • 以上问题排查完毕后还是这个错误
    去掉 —__init__方法

  • 到这里我就已经解决了所有问题

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

马志武

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值