记录一次完整的爬虫管理调度平台--crawlab生产环境部署

前言

如果业务规模比较小,我们写的爬虫脚本可以依赖人工的方式直接在本地单机运行。但是当业务量比较大,且需要爬虫任务自动的按时完成,有成千上万的爬虫任务需要管理时,就需要依赖爬虫管理调度平台来管理爬虫任务。

目前公司的生产环境就是部署的spiderkeeper来管理爬虫任务,spiderkeeper的主要缺点是当任务量多时就会出现不能按时执行任务的情况,并且很容易出现调度任务阻塞的情况。为了不再每天半夜起床手动的执行解决阻塞的爬虫任务,替换掉spiderkeeper迫在眉睫。

目前主要的爬虫管理平台的比较(引自crawlab,github)

FrameworkTechnologyProsCons
CrawlabGolang + VueNot limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.Not yet support spider versioning
ScrapydWebPython Flask + VueBeautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
GerapyPython Django + VueGerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeperPython FlaskOpen-source Scrapyhub. Concise and simple UI interface. Support cron job.Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

准备

相关环境

docker---V20.10.5

rancher--V1.6

centos--v7.5(六台线上机器)

由于crawlab镜像内的python版本是3.8的,而我们的开发环境是3.6,版本不一致,导致一些项目的依赖项冲突无法安装,所以需要更改crawlab镜像内的python版本。

  • 拉取crawlab源代码
qzxqbdeMacBook-Pro:~ qzxqb$ mkdir crawlab-dev
qzxqbdeMacBook-Pro:~ qzxqb$ cd crawlab-dev/
qzxqbdeMacBook-Pro:crawlab-dev qzxqb$ git clone git@github.com:crawlab-team/crawlab.git
Cloning into 'crawlab'...
remote: Enumerating objects: 16817, done.
remote: Counting objects: 100% (64/64), done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 16817 (delta 27), reused 48 (delta 23), pack-reused 16753
Receiving objects: 100% (16817/16817), 20.43 MiB | 4.58 MiB/s, done.
Resolving deltas: 100% (10339/10339), done.
qzxqbdeMacBook-Pro:crawlab-dev qzxqb$
  • 更改crawlab的dockerfile
FROM golang:latest AS backend-build

WORKDIR /go/src/app
COPY ./backend .

ENV GO111MODULE on
ENV GOPROXY https://goproxy.io

RUN go install -v ./...

FROM node:latest AS frontend-build

ADD ./frontend /app
WORKDIR /app

# install frontend
#RUN npm config set unsafe-perm true
#RUN npm install -g yarn && yarn install

RUN yarn install && yarn run build:prod

# images
FROM ubuntu:latest

# set as non-interactive
ENV DEBIAN_FRONTEND noninteractive

# set CRAWLAB_IS_DOCKER
ENV CRAWLAB_IS_DOCKER Y

# install packages
RUN chmod 777 /tmp \
	&& apt-get update \
	&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate nginx wget dumb-init cloc software-properties-common \
	&& add-apt-repository ppa:deadsnakes/ppa \
	&& apt update \
	&& apt install python3.6 -y \
	&& ln -s /usr/bin/pip3 /usr/local/bin/pip \
	&& ln -s /usr/bin/python3.6 /usr/local/bin/python


# install backend
RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash

# add files
COPY ./backend/conf /app/backend/conf
COPY ./backend/data /app/backend/data
COPY ./backend/scripts /app/backend/scripts
COPY ./backend/template /app/backend/template
COPY ./nginx /app/nginx
COPY ./docker_init.sh /app/docker_init.sh

# copy backend files
RUN mkdir -p /opt/bin
COPY --from=backend-build /go/bin/crawlab /opt/bin
RUN cp /opt/bin/crawlab /usr/local/bin/crawlab-server

# copy frontend files
COPY --from=frontend-build /app/dist /app/dist

# copy nginx config files
COPY ./nginx/crawlab.conf /etc/nginx/conf.d

# working directory
WORKDIR /app/backend

# timezone environment
ENV TZ Asia/Shanghai

# language environment
ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8

# frontend port
EXPOSE 8080

# backend port
EXPOSE 8000

# start backend
CMD ["/bin/bash", "/app/docker_init.sh"]
  • 本地构建镜像
[root@q3 crawlab]# git pull
Already up-to-date.
[root@q3 crawlab]# docker build -f Dockerfile -t 172.16.5.3:9443/crawlab:0.0.5 .
Sending build context to Docker daemon  4.114MB
Step 1/33 : FROM golang:latest AS backend-build
 ---> d5dc529b0ee7
Step 2/33 : WORKDIR /go/src/app
 ---> Using cache
 ---> 8fe2c89e54b2
Step 3/33 : COPY ./backend .
 ---> Using cache
 ---> a33c69edcd1d
Step 4/33 : ENV GO111MODULE on
 ---> Using cache
 ---> 4f7051d24941
Step 5/33 : ENV GOPROXY https://goproxy.io
 ---> Using cache
 ---> 26e8a4182851
Step 6/33 : RUN go install -v ./...
 ---> Using cache
 ---> 0195fd4a9abf
Step 7/33 : FROM node:latest AS frontend-build
 ---> d2850632b602
Step 8/33 : ADD ./frontend /app
 ---> Using cache
 ---> c2c6ae08093f
Step 9/33 : WORKDIR /app
 ---> Using cache
 ---> dfcb1f88e0d9
Step 10/33 : RUN yarn install && yarn run build:prod
 ---> Using cache
 ---> 367f5be0fa54
Step 11/33 : FROM ubuntu:latest
 ---> 26b77e58432b
Step 12/33 : ENV DEBIAN_FRONTEND noninteractive
 ---> Using cache
 ---> ee6ceef0ca40
Step 13/33 : ENV CRAWLAB_IS_DOCKER Y
 ---> Using cache
 ---> f157499cdaf9
Step 14/33 : RUN chmod 777 /tmp 	&& apt-get update 	&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate nginx wget dumb-init cloc software-properties-common 	&& add-apt-repository ppa:deadsnakes/ppa 	&& apt update 	&& apt install python3.6 -y 	&& apt-get install  -y python3-pip 	&& ln -s /usr/bin/pip3 /usr/local/bin/pip 	&& ln -s /usr/bin/python3.6 /usr/local/bin/python
 ---> Using cache
 ---> efe0aa203727
Step 15/33 : RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash
 ---> Using cache
 ---> 3fb1f60a0583
Step 16/33 : COPY ./backend/conf /app/backend/conf
 ---> Using cache
 ---> 7f197f9ae3c2
Step 17/33 : COPY ./backend/data /app/backend/data
 ---> Using cache
 ---> 1729928a3757
Step 18/33 : COPY ./backend/scripts /app/backend/scripts
 ---> Using cache
 ---> 804267afb368
Step 19/33 : COPY ./backend/template /app/backend/template
 ---> Using cache
 ---> e321fe6b9ed6
Step 20/33 : COPY ./nginx /app/nginx
 ---> Using cache
 ---> 4b8312fdfc8c
Step 21/33 : COPY ./docker_init.sh /app/docker_init.sh
 ---> Using cache
 ---> c3f2b9e5b6ae
Step 22/33 : RUN mkdir -p /opt/bin
 ---> Using cache
 ---> f0e6aed3e853
Step 23/33 : COPY --from=backend-build /go/bin/crawlab /opt/bin
 ---> Using cache
 ---> d9bac6e1dde0
Step 24/33 : RUN cp /opt/bin/crawlab /usr/local/bin/crawlab-server
 ---> Using cache
 ---> 08e9ab14ec2d
Step 25/33 : COPY --from=frontend-build /app/dist /app/dist
 ---> Using cache
 ---> 612b14dcb3cc
Step 26/33 : COPY ./nginx/crawlab.conf /etc/nginx/conf.d
 ---> Using cache
 ---> 011ce6e404ca
Step 27/33 : WORKDIR /app/backend
 ---> Using cache
 ---> 113d7da234d8
Step 28/33 : ENV TZ Asia/Shanghai
 ---> Using cache
 ---> b20825d59a9e
Step 29/33 : ENV LC_ALL C.UTF-8
 ---> Using cache
 ---> 474ef571d1ca
Step 30/33 : ENV LANG C.UTF-8
 ---> Using cache
 ---> 1900f0a897c3
Step 31/33 : EXPOSE 8080
 ---> Using cache
 ---> 59f7b4a5fc4d
Step 32/33 : EXPOSE 8000
 ---> Using cache
 ---> 3ac0679cf524
Step 33/33 : CMD ["/bin/bash", "/app/docker_init.sh"]
 ---> Using cache
 ---> 05c886bd310f
Successfully built 05c886bd310f
Successfully tagged 172.16.5.3:9443/crawlab:0.0.5
  • 上传到私有镜像仓库
[root@q3 crawlab]# docker push 172.16.5.3:9443/crawlab:0.0.5
The push refers to repository [172.16.5.3:9443/crawlab]
7bdb99ab55de: Layer already exists
b6d882c09c38: Layer already exists
60e763dd93e1: Layer already exists
ffcac982fd08: Layer already exists
7c184307f92f: Layer already exists
24d9956dd633: Layer already exists
e524f1f19abf: Layer already exists
8a7cb040f5f9: Layer already exists
0efdd438158d: Layer already exists
92dbdae35638: Layer already exists
6a975a1793dd: Layer already exists
6460401a902d: Layer already exists
adebce06afa0: Layer already exists
346be19f13b0: Layer already exists
935f303ebf75: Layer already exists
0e64bafdc7ee: Layer already exists
0.0.5: digest: sha256:49da0419217365b442717852c97e0fde1846ae9c43a498905eb4bca6341f67cd size: 3664

安装部署

  • 主节点安装

  • 在rancher中添加应用crawlab并添加服务
  • 暴露端口号8080
  • 添加redis和mongo的服务链接
  • 添加环境变量

  • 添加卷,为了持久化日志

  • 选择调度服务器

工作节点安装

工作节点安装只需将环境变量中的CRAWLAB_SERVER_MASTER设置为N就可以了。

至此一台主节点和5台工作节点安装完成。

迁移任务

---未完待续

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值