优雅：通过shell脚本实现中断任务的自动重启

学弟

已于 2024-03-31 19:55:14 修改

阅读量1.4k

点赞数

分类专栏： # 优雅文章标签： linux bash 运维 pgrep pkill

于 2022-10-12 16:23:08 首次发布

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/u011345885/article/details/127282870

版权

优雅专栏收录该内容

9 篇文章 0 订阅

订阅专栏

通过shell脚本实现中断任务的自动重启

案例1
- 案例场景
- 功能实现
案例2
- 案例场景
- 功能实现
后续

案例1

案例1属于通过shell脚本实现中断任务自动重启的基础case。

案例场景

由于未知因素，模型每训练5W迭代次数，就会中断一次。期望效果是每次模型训练中断的时候，都能够及时感知到，并自动的进行resume训练。

功能实现

#!/bin/sh　


SERVICE_NAME="rp/bin"

START_CMD="python tools/train.py --config experiments/ctc_standard/gpu/trainer.yaml --num-gpus 8 --resume"

LOG_FILE="restart.log"


pwd

while true

do

    procnum=`ps -ef|grep $SERVICE_NAME|grep -v grep|wc -l`

    if [ $procnum -eq 0 ]

    then
        echo "start service...................."

        echo `date +%Y-%m-%d` `date +%H:%M:%S`  $SERVICE_NAME >>$LOG_FILE

        ${START_CMD}
        /bin/bash

    fi

    sleep 10m

done

实际使用的时候将其保存为resume.sh文件，并执行

nohup bash ./resume.sh

案例2

案例场景

有的时候虽然任务中断了，但任务相关的进程并未消失，而是以诸如S(休眠)的状态存在，此时就无法通过

$procnum -eq 0

来判断进程是否中断。需要更加细致的判断相关进程的数量以及状态。

D uninterruptible sleep (usually IO)
I Idle kernel thread
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped by job control signal
t stopped by debugger during the tracing
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct (“zombie”) process, terminated but not reaped by its parent

功能实现

START_CMD="python tools/train.py --config experiments/ctc_standard/gpu/trainer.yaml --num-gpus 8 --resume"

while true

do
    All_Process_Pid=`pgrep -f rp/bin`
    #echo ${All_Process_Pid}
    flag=0
    for i in ${All_Process_Pid}
    do
         if [ "$(ps -q $i -o state --no-headers)" = "R" ];
         then
            flag=1
         fi
    done

    if [ $flag = 0 ];
    then
        echo "flag is 0"
        pkill -9 -f rp/bin
        ${START_CMD}
    else
        echo "flag is 1"
    fi

    sleep 1m
done