LSF中应用程序退出码的说明
退出码 | 说明 |
0 | 应用程序运行过程中没有发生错误,正常结束。 |
1 ~ 125 | 应用程序退出码,需要查看应用程序手册确定退出码的含义。有些应用程序非零退出码也代表正常结束。 |
126 | 用户没有权限执行命令 |
127 | 没有找到要执行的命令 |
> 128 | 表示作业被信号中断,信号值为 退出码 - 128,需要在相应操作系统上查看对应信号的涵义。如退出码130, 130 - 128 = 2, 在Linux平台信号2表示SIGINT,即中断信号。 |
255 | 作业以 -1 退出 |
示例1:退出码255
编写C程序以-1退出, cat /tmp/calibre.c
#include <stdio.h>
int main(void){
printf("Hello world.\n");
return(-1);
}
编译后在命令行运行,可见退出码为255
[lsfadmin@master tmp]$ gcc calibre.c -o calibre
[lsfadmin@master tmp]$ chmod +x calibre
[lsfadmin@master tmp]$ ./calibre
Hello world.
[lsfadmin@master tmp]$ echo $? 255
将此命令提交LSF执行
[lsfadmin@master ~]$ bsub -I calibre
Job <1349> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on shugb>>
Hello world.
[lsfadmin@master ~]$ bjobs -UF 1349
Job <1349>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <interactive>, Interactive mode, Command <calibre>, Share group charged </lsfadmin>
Sat May 21 21:48:27: Submitted from host <master>, CWD <$HOME>;
Sat May 21 21:48:27: Started 1 Task(s) on Host(s) <shugb>, Allocated 1 Slot(s) on Host(s) <shugb>;
Sat May 21 21:48:32: Exited with exit code 255. The CPU time used is 0.0 seconds.
Sat May 21 21:48:32: Completed <exit>.
示例2: 退出码 127 找不到命令
提交一个不存在的命令到LSF执行:
[lsfadmin@master configdir]$ bsub -I pt_shell
Job <1346> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on shugb>>
/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/myjs.sh: line 17: pt_shell: command not found
[lsfadmin@master configdir]$ bjobs -UF 1346
Job <1346>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <interactive>, Interactive mode, Command <pt_shell>, Share group charged </lsfadmin>
Sat May 21 21:38:06: Submitted from host <master>, CWD </opt/ibm/lsfsuite/lsf/conf/lsbatch/lsf-demo/configdir>;
Sat May 21 21:38:06: Started 1 Task(s) on Host(s) <shugb>, Allocated 1 Slot(s) on Host(s) <shugb>;
Sat May 21 21:38:11: Exited with exit code 127. The CPU time used is 0.0 seconds.
Sat May 21 21:38:11: Completed <exit>.
示例3: 退出码126 没有访问权限
以用户帐号lsfadmin创建程序,并设置权限为仅自己可访问。
[lsfadmin@master /]$ ls -l /tmp/gen
-rwx------ 1 lsfadmin lsfadmin 34 Jul 4 18:06 /tmp/gen
[lsfadmin@openlava-master /]$ bsub -Ip -m master /tmp/gen
Job <208> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on master>>
Hello World!
切换用户帐号shugb,提交以上命令到LSF中运行。
[
shugb@master ~]$ bsub -Ip -m master /tmp/gen
Job <209> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on master>>
/home/
shugb/.lsbatch/1656929425.209: line 8: /tmp/gen: Permission denied
[
shugb@master ~]$ bjobs -UF 209
Job <209>, User <
shugb>, Project <default>, Status <EXIT>, Queue <interactive>, Interactive pseudo-terminal mode, Command </tmp/gen>, Share group charged </
shugb>
Mon Jul 4 18:10:25: Submitted from host <master>, CWD <$HOME>, Specified Hosts <master>;
Mon Jul 4 18:10:25: Started 1 Task(s) on Host(s) <master>, Allocated 1 Slot(s) on Host(s) <master>;
Mon Jul 4 18:10:31: Exited with exit code 126. The CPU time used is 0.0 seconds.
Mon Jul 4 18:10:31: Completed <exit>.
示例4: 退出码130 程序被中断运行
提交作业到LSF执行
[
shugb@master ~]$ bsub -m cmp1 sleep 1000
Job <210> is submitted to default queue <normal>.
在执行机上,中断程序执行
[root@cmp1 log]# ps -elf|grep sleep
0 S shuguan+ 97555 97553 0 80 0 - 27015 hrtime 18:17 ? 00:00:00 sleep 1000
0 S root 97572 1399 0 80 0 - 27014 hrtime 18:17 ? 00:00:00 sleep 60
0 S root 97578 94237 0 80 0 - 28204 pipe_w 18:17 pts/0 00:00:00 grep --color=auto sleep
[root@openlava-cmp1 log]# kill -2 97555
[root@openlava-cmp1 log]#
检查作业退出码
[
shugb@master ~]$ bjobs -UF 210
Job <210>, User <
shugb>, Project <default>, Status <EXIT>, Queue <normal>, Command <sleep 1000>, Share group charged </
shugb> Mon Jul 4 18:17:19: Submitted from host <master>, CWD <$HOME>, Specified Hosts <cmp1>;
Mon Jul 4 18:17:19: Started 1 Task(s) on Host(s) <cmp1>, Allocated 1 Slot(s) on Host(s) <cmp1>, Execution Home </home/
shugb>, Execution CWD </home/
shugb>;
Mon Jul 4 18:17:56: Exited with exit code 130. The CPU time used is 0.1 seconds.
Mon Jul 4 18:17:56: Completed <exit>.
示例5: 作业被用户或管理员通过LSF命令终止
如果作业是用户或管理员能过LSF命令终止,在作业信息中除了有退出码外,还会有诸如 TERM_OWNER、TERM_ADMIN等提示
[
shugb@master ~]$ bsub -m openlava-cmp1 sleep 1000
Job <211> is submitted to default queue <normal>.
[
shugb@master ~]$ bkill 211
Job <211> is being terminated
[
shugb@master ~]$ bjobs -UF 211
Job <211>, User <
shugb>, Project <default>, Status <EXIT>, Queue <normal>, Command <sleep 1000>, Share group charged </
shugb> Mon Jul 4 18:24:12: Submitted from host <master>, CWD <$HOME>, Specified Hosts <cmp1>;
Mon Jul 4 18:24:13: Started 1 Task(s) on Host(s) <cmp1>, Allocated 1 Slot(s) on Host(s) <cmp1>, Execution Home </home/
shugb>, Execution CWD </home/
shugb>;
Mon Jul 4 18:24:22: Exited with exit code 130. The CPU time used is 0.0 seconds.
Mon Jul 4 18:24:22: Completed <exit>; TERM_OWNER: job killed by owner.
以帐号shugb提交作业,然后管理员lsfadmin通过LSF命令bkill 终止作业,查看作业信息。
[
shugb@master ~]$ bsub -m cmp1 sleep 1000
Job <212> is submitted to default queue <normal>.
[
shugb@master ~]$ bjobs -UF 212 Job <212>, User <
shugb>, Project <default>, Status <EXIT>, Queue <normal>, Command <sleep 1000>, Share group charged </
shugb>
Mon Jul 4 18:27:42: Submitted from host <master>, CWD <$HOME>, Specified Hosts <cmp1>;
Mon Jul 4 18:27:43: Started 1 Task(s) on Host(s) <cmp1>, Allocated 1 Slot(s) on Host(s) <cmp1>, Execution Home </home/
shugb>, Execution CWD </home/
shugb>;
Mon Jul 4 18:28:30: Exited with exit code 130. The CPU time used is 0.1 seconds.
Mon Jul 4 18:28:30: Completed <exit>; TERM_ADMIN: job killed by root or an administrator.