基于Scrapyd的爬虫部署

系统为Ubuntu16.04TLS。

1. Installtion

通过使用scrapy-client中的scrapy-deploy将scrapy project部署到scrapyd server。

# 安装scrapyd
pip install scrapyd
# 安装scrapy-client
# for python2.x
pip install git+https://github.com/scrapy/scrapyd-client
# for python3.6
pip install scrapy-client

2. Usage

a. 配置scrapy.cfg
[settings]
default = njupt.settings

[deploy:server-njupt]
url = http://localhost:6800/
project = njupt
b. 配置scrapyd

配置文件可参考scrapy文档进行配置。
其加载顺序为:
/etc/scrapyd/scrapyd.conf
/etc/scrapyd/conf.d/*
scrapyd.conf
~/.scrapyd.conf

example:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
c. 启动scrapyd
scrapyd
d. 发布
# 进入scrapy project根目录
scrapyd-deploy server-njupt -p njupt
# 指定版本号,默认为当前时间戳
scrapyd-deploy server-njupt -p njupt --version 1.0

scrapy-deploy的命令请看其帮助

e. 执行爬虫任务
curl http://localhost:6800/schedule.json -d project=njupt -d spider=njupt

可通过scrapyd-client spiders -p njupt 查看project=njupt下的spider。

3. Security

可以在scrapyd前面加一层反向代理来实现用户认证。以nginx为例, 配置nginx

server {
       listen 6801;
       location / {
            proxy_pass            http://127.0.0.1:6800/;
            auth_basic            "Restricted";
            auth_basic_user_file  /etc/nginx/htpasswd/user.htpasswd;
        }
}

/etc/nginx/htpasswd/user.htpasswd里设置用户名和密码,假设都为test。修改scrapy.cfg如下:

[settings]
default = njupt.settings

[deploy:server-njupt]
url = http://localhost:6800/
project = njupt
username = test
password = test

4. API

参考官方文档API

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值