案例1
案例1属于通过shell脚本实现中断任务自动重启的基础case。
案例场景
由于未知因素,模型每训练5W迭代次数,就会中断一次。期望效果是每次模型训练中断的时候,都能够及时感知到,并自动的进行resume训练。
功能实现
#!/bin/sh
SERVICE_NAME="rp/bin"
START_CMD="python tools/train.py --config experiments/ctc_standard/gpu/trainer.yaml --num-gpus 8 --resume"
LOG_FILE="restart.log"
pwd
while true
do
procnum=`ps -ef|grep $SERVICE_NAME|grep -v grep|wc -l`
if [ $procnum -eq 0 ]
then
echo "start service...................."
echo `date +%Y-%m-%d` `date +%H:%M:%S` $SERVICE_NAME >>$LOG_FILE
${START_CMD}
/bin/bash
fi
sleep 10m
done
实际使用的时候将其保存为resume.sh文件,并执行
nohup bash ./resume.sh
案例2
案例场景
有的时候虽然任务中断了,但任务相关的进程并未消失,而是以诸如S(休眠)的状态存在,此时就无法通过
$procnum -eq 0
来判断进程是否中断。需要更加细致的判断相关进程的数量以及状态。
D uninterruptible sleep (usually IO)
I Idle kernel thread
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped by job control signal
t stopped by debugger during the tracing
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct (“zombie”) process, terminated but not reaped by its parent
功能实现
START_CMD="python tools/train.py --config experiments/ctc_standard/gpu/trainer.yaml --num-gpus 8 --resume"
while true
do
All_Process_Pid=`pgrep -f rp/bin`
#echo ${All_Process_Pid}
flag=0
for i in ${All_Process_Pid}
do
if [ "$(ps -q $i -o state --no-headers)" = "R" ];
then
flag=1
fi
done
if [ $flag = 0 ];
then
echo "flag is 0"
pkill -9 -f rp/bin
${START_CMD}
else
echo "flag is 1"
fi
sleep 1m
done
后续
这里面还是有一些小点,感觉后面可能会用到,着重强调一下:
- pgrep -f rp/bin: 获取含有"rp/bin"关键字的所有进程id列表。
- pkill -9 -f rp/bin:kill掉所有含有"rp/bin"关键字的所有进程。