为在容器中运行scrapy爬虫,搭建带scrapy环境的基础python2镜像
以下操作步骤在centos7系统上执行:
- 安装docker:参考https://docs.docker.com/install/linux/docker-ce/centos/
- 启动docker:systemctl enable docker && systemctl start docker
- 添加docker
国内镜像
加速拉去镜像文件速度:在/etc/docker/daemon.json文件中添加{“registry-mirrors”:[“http://hub-mirror.c.163.com“, “https://docker.mirrors.ustc.edu.cn“]} - 创建生成目标镜像的工作文件夹:mkdir -p /data/scrapy
添加需python安装的模块到requirements.txt文件中,内容如下:
pymysql
lxml
scrapy创建生成镜像的Dockerfile文件,内容如下:
FROM python:2.7.14-alpine
ENV TZ "Asia/Shanghai"
COPY requirements.txt /
RUN echo "$TZ" > /etc/timezone \
&& ln -sf /usr/share/zoneinfo/${TZ} /etc/localtime \
&& echo "https://mirror.tuna.tsinghua.edu.cn/alpine/v3.4/main" > /etc/apk/repositories \
&& apk add -U alpine-sdk \
libxml2-dev \
libxslt-dev \
build-base \
python-dev \
openssl-dev \
libffi-dev \
zlib-dev \
&& pip install --upgrade pip \
&& pip install --default-timeout=100 --no-cache-dir -i https://pypi.mirrors.ustc.edu.cn/simple -r requirements.txt
CMD ["scrapy", "--help"]
7.生成目标镜像
docker build -t py_scrapy .
镜像生成成功后,就可以让scrapy程序在该镜像环境下运行了
8.创建启动脚本和停止脚本
start.sh
#!/bin/sh
CONTAINER_NAME='scrapy_item'
if docker inspect $CONTAINER_NAME &>/dev/null
then
echo "$CONTAINER_NAME is running!"
exit 0
fi
BASE_DIR=`readlink -f $(dirname $0)`
DATA_DIR="/data/docker/$CONTAINER_NAME"
sudo mkdir -p "$DATA_DIR"/logs
OWNER=$(id --user):$(id --group)
sudo chown -R $OWNER "$DATA_DIR"
docker run \
--detach \
--network bridge \
--name $CONTAINER_NAME \
--volume /etc/localtime:/etc/localtime \
--volume "$BASE_DIR/app":/sft/app \
--volume "$DATA_DIR/logs":/sft/logs \
--restart always \
-w /sft/app \
py_scrapy:latest \
python run.py
stop.sh
#!/bin/sh
CONTAINER_NAME='scrapy_item'
if ! docker stop $CONTAINER_NAME &>/dev/null
then
docker kill $CONTAINER_NAME &>/dev/null
fi
if !(docker inspect $CONTAINER_NAME &>/dev/null) || docker rm $CONTAINER_NAME &>/dev/null
then
echo "$CONTAINER_NAME stoped"
else
echo "$CONTAINER_NAME failed to stop"
fi
检测是否存活脚本
#!/bin/sh
LOG_FILE=/data/docker/taoke/other/taoketop/logs/top.log
SHELL_DIR=/data/deploy/taoke/other
SIZE_SAVE_FILE=taoke_top_log.size
if [ ! -e $LOG_FILE ]
then
echo "$LOG_FILE DOES NOT EXISTS"
exit 1
fi
if [ ! -d $SHELL_DIR ]
then
echo "$SHELL_DIR DOES NOT EXISTS"
exit 1
fi
newsize=`ls -l $LOG_FILE | awk '{print $5}'`
# old size record file exists
if [ -e $SIZE_SAVE_FILE ]
then
oldsize=`cat $SIZE_SAVE_FILE`
if [ $newsize -eq $oldsize ]
then
echo "will restart crawl taoke top"
$SHELL_DIR/top_stop.sh
sleep 1
$SHELL_DIR/top_start.sh
else
echo $newsize > $SIZE_SAVE_FILE
fi
else
echo $newsize > $SIZE_SAVE_FILE
fi