前天,我发现机器人在运行过程中出现程序异常退出,每次都需要重启程序或机器人。
如何解决此问题呢?我的思路是1.找问题根源,根治。2.若不能根治,防范。根治+防范确保程序的足够健壮。
接下来,找问题根源,就是找到程序bug,我看了源代码,无法判断,并且机器人不在身边,无法复现现象。只能留作后期解决。当务之际,先做好防范。我的机器人系统为ubuntu16.04,程序基于ROS构建。要确保ROS node节点用不退出,最好的方法是守护进程。下面是我的守护进程,基于shell编写。
思路是:1.做一个while 循环,不停的监控ros node 的进程状态,如过退出,则重启程序。2.考虑到三种状况。a.程序退出,直接重启,b.开启了同一个node的多个进程,杀死这个node的多个进程,再重启一个新的node进程。3.针对长期不动的僵尸node进程,直接清理之。下面直接上代码:
total.sh
#!/bin/sh
function process_monitor
{
NUM=`ps aux | grep $1| grep -v grep |wc -l` #Get the number of processes based on the process name
# echo process num:$NUM
if [ "${NUM}" -lt "1" ];then #If the number of processes is less than 1, start a new process
echo "$1 not exists,will launch..."
gnome-terminal -x bash -c "sh ./run_total.sh $2;exec bash;"
# If the number of processes is greater than 1, kill all processes and restart a new process with the same name
elif [ "${NUM}" -gt "2" ];then
echo "more than 1 $1,killall $1"
kill -9 $(ps -ef|grep $1|grep -v grep|awk '{print $2}')
gnome-terminal -x bash -c "sh ./run_total.sh $2;exec bash;"
fi
# Kill zombie processes
NUM_STAT=`ps aux | grep $1 | grep T | grep -v grep | wc -l`
if [ "${NUM_STAT}" -gt "0" ];then
kill -9 $(ps -ef|grep $1 |grep -v grep|awk '{print $2}')
fi
}
while true
do
# 1 is websocket 2 is slamware_ros_sdk_server_node 3 is move_action_server ,4 is speed_control_service
process_monitor rosbridge_server "1"
process_monitor slamware_ros_sdk_server_node "2"
process_monitor move_action_server "3"
process_monitor speed_control_service "4"
sleep 0.1
done
sleep 0.1
exit 0
run_total.sh
#!/bin/sh
source /opt/ros/kinetic/setup.bash
source ~/catkin_robot/devel/setup.bash
echo $1
if [ $1 -eq 1 ];then
roslaunch rosbridge_server rosbridge_websocket.launch
elif [ $1 -eq 2 ];then
roslaunch slamware_ros_sdk slamware_ros_sdk_server_node.launch
elif [ $1 -eq 3 ];then
roslaunch slamware_ros_sdk move_action_server.launch
elif [ $1 -eq 4 ];then
roslaunch slamware_ros_sdk speed_control_service.launch
else
echo "no launch file found..."
fi
exit 0
由于时间关系,不再赘述。希望大家多评论。