电信集团政企项目爬虫部分
1 项目用到的技术点有 scrapy scrapyd scrapyd-client docker docker-compose
2 需求是要爬取全国各个省级以及省会的招投标信息。之前做过浙江省级的招投标爬取,利用的scarpy的本地爬取,
这次的爬取网站多,而且要用到定时去爬取,并做到项目的架构的可复用以及可扩充。所以我在scrapy的基础上,利用docker 以及 scrapyd的技术特点来实现
3 实现为:项目组员,各自有各自的scrapyd服务部署端,程序在本地编写完成,通过scrapyd-client来实现打包部署。由于项目当前只有一台host,所以在里面我利用docker-compose开启了多个scrapyd的容器,每个scrapyd以不同的端口号来区分,外部访问各个不同的scrapyd通过host的ip+不同的scrapyd的port来实现
整个架构的目录是
drwxr-x--- 2 puaiuc users 57 Mar 10 14:17 6800
drwxr-x--- 2 puaiuc users 57 Mar 10 14:33 6801
drwxr-x--- 2 puaiuc users 57 Mar 10 14:34 6802
drwxr-x--- 2 puaiuc users 57 Mar 10 14:35 6803
drwxr-x--- 2 puaiuc users 57 Mar 10 14:35 6804
drwxr-x--- 2 puaiuc users 23 Mar 10 15:46 base_scrapy
-rw-r----- 1 puaiuc users 1062 Mar 10 14:38 docker-compose.yml
-rw-r----- 1 puaiuc users 413 Mar 10 14:11 docker-compose.yml.bak1
其中base_scrapy的Dockerfile如下
FROM python:3.6
MAINTAINER yyqq188@foxmail.com
RUN apt-get update && apt-get upgrade -y && apt-get install -y python-pip \
&& pip install scrapy && pip install scrapyd && pip install pymongo \
&& pip install mysqlclient
建立scrapyd的基础镜像,推送到dockerhub上
其中 6800-6804 是针对不同scrapyd,每个scrapyd对应的不同的端口,6800目录的结构如下,其他的目录也是一样的结构
-rw-r----- 1 puaiuc users 134 Mar 10 14:17 Dockerfile
-rw-r----- 1 puaiuc users 897 Mar 10 12:35 scrapyd.conf
-rwxr-x--x 1 puaiuc users 32 Mar 10 13:35 start.sh
以6800为例,其中的Dockerfile如下
FROM yyqq188/base_scrapy
MAINTAINER yyqq188@foxmail.com
COPY $PWD/scrapyd.conf /etc/scrapyd/scrapyd.conf
COPY $PWD/start.sh /start.sh
scrapyd.conf的内容如下,其他都一样,重点关注bind_address 和http_port
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
start.sh的内容如下
#!/bin/bash
scrapyd > /dev/null
docker-compose.yml的内容如下
version: "3"
services:
scrapyd-6800:
build: ./6800
ports:
- 6800:6800
links:
- mysql-docker
command: bash /start.sh
scrapyd-6801:
build: ./6801
ports:
- 6801:6801
links:
- mysql-docker
command: bash /start.sh
scrapyd-6802:
build: ./6802
ports:
- 6802:6802
links:
- mysql-docker
command: bash /start.sh
scrapyd-6803:
build: ./6803
ports:
- 6803:6803
links:
- mysql-docker
command: bash /start.sh
scrapyd-6804:
build: ./6804
ports:
- 6804:6804
links:
- mysql-docker
command: bash /start.sh
mysql-docker:
image: "mysql:5.6"
environment:
MYSQL_ROOT_PASSWORD: abc
volumes:
- /data/mysql_data:/var/lib/mysql
ports:
- "3306:3306"
今后扩容器,只要复制出类似6800目录这样的节点目录,并重新修改下docker-compose.yml文件即可
细节部分: /etc/scrapyd/scrapyd.conf 中的scrapyd.conf 名称必须不能变,scrapyd启动的时候默认就按照该名称去启动,改名的话会依旧按照该默认名称读取,所有名称不要变,变的是里面的配置内容。
后续还演进为swarm 以及 k8s 部署。