问题的关键
Linux系统函数 pid_t waitpid(pid_t pid, int *stat_val, int options)
为了判断一个进程是否正常退出以及其执行期间是否有错误发生,我们需要保证上述函数的第二个参数非空,其将保存当前进程退出时的状态信息。那么如何获取进程退出时的状态码呢?
WIFEXITED(stat_val)
WEXITSTATUS(stat_val)
可以组合使用来获取进程正常退出时的状态码,问题的核心正在于此,此时获取的状态码 是一个补码,而非真正的状态码,所以请务必注意这一点。
拿下面的例子来说,子进程遇到错误,正常退出时状态码为-1,但是通过上述方法获取到的状态码是255.
参考链接:https://linux.die.net/man/3/waitpid
问题场景
依据下述代码, 当存在子进程执行失败情况时,程序最终输出如下:
Task executes failure!
All the tasks were successfully completed!
是不是很意外?
/*
* DESC:
* Entry point of subprocess
*/
int main()
{
// bussiness logic detail
...
...
// any error occur, will exit
if (...)
{
printf("Task executes failure!\n");
Exit(-1);
}
return 0;
}
/*
* DESC:
* In the main process, two subprocesses will be spawned and each of them
* will deal its own task(**
## If failure, the return value is -1
**). During the execution of all subprocesses, the main
* process will be hung until all subprocesses exit . So the exit status of main
* process depends on the two subprocesses.
*/
errinfo *spawnTasksProcesses()
{
errinfo *ei = NULL;
long process_num = 2;
long *pid = NULL;
long return_val = -1;
int exit_status = -1;
bool has_error = false;
int i = 0;
// Spawn two subprocesses
pid = (long*)calloc(sizeof(long), processes_num);
for (int i = 0; i < process_num; i++)
{
pid[i] = mockFun_spawnProcess(...);
}
// Scan the two subprocesses to exit
while(TRUE)
{
return_val = waitpid(-1, &exit_status, WNOHANG);
// Status of child process has not changed, continue waiting
if (0 == return_val)
continue;
// A process exit normally with error
if (0 < return_val)
{
if (-1 == exit_status)
has_error = true;
}
// Set process ID as -1 once it exits
for (i = 0; i < process_num; i++)
{
if (pid[i] == return_val)
{
pid[i] = -1;
break;
}
}
// Check if all processes exit
for (i = 0; i< process_num; i++)
{
if (-1 != pid[i])
break;
}
// Make sure all subprocesses exit
if (i == process_num)
break;
}
if (has_error)
ei = error_constructor(ERR_NUM, ERR_LEVEL, "Error occurs when doing task.\n");
return ei;
}
/*
* DESC:
* Invoke above function spawnTasksProcesses
*/
void spawnTasksProcesses_invoker()
{
if (NULL == spawnTasksProcesses())
printf("All the tasks were successfully completed!\n");
else
printf("Not all tasks were successfully completed: Error occurs, please check!\n");
}
问题排查及解决方案
排查流程如下:
程序输出信息"All the tasks were successfully completed!" -->
确定被调函数spawnTasksProcesses()中变量’has_error’的值为false -->
锁定代码第54行到58行 -->
Debug查看失败进程退出时变量’exit_status’的值为65280 -->
真是一个奇怪的数字!!!
各种资料查阅,同时认真阅读 https://linux.die.net/man/3/waitpid 确认此时的返回值65280并非子进程中设置的返回值(错误出现仅仅会返回-1),距离真相一步之遥。
为了获取真正的返回值,借助于宏
WIFEXITED(stat_val)
WEXITSTATUS(stat_val)
更新原始程序第56行的条件约束为" (WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) == -1"。
原始程序第54到58行更新如下:
if (0 < return_val)
{
if ((WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) == -1))
has_error = true;
}
重新编译程序,替换二进制文件,执行依然不符合预期。
再次Debug,发现此时的’exit_status’值为255.
说实话,瞬间懵逼了…
无数只草泥马脑海中闪过后,好像记起什么来… 赶紧审 查一遍官方文档对宏 WEXITSTATUS(stat_val)的说明:
WEXITSTATUS(status)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should only be employed if WIFEXITED returned true.
补码:-1的补码不就是255吗?
再次更新上述代码块如下:
if (0 < return_val)
{
if ((WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) != 0))
has_error = true;
}
再次编译,替换二进制文件,执行。KO!
结论
认真,认真,再认真阅读官方文档。
尤其是陌生的系统函数。