最近重构NAS存储的程序,发现程序总是在运行1天多后因为oom被kill。开始觉得很奇怪,因为担心有内存碎片,程序里面用到的内存都不是动态申请的,怎么会有内存泄漏呢?内核消息和top监控的结果都显示确实存在内存泄漏,只好逐个线程检查。检查到NFS服务检测线程的时候出现了异常情况,用valgrind检测NFS服务检测模块居然有内存泄漏。这个结果更加让我诧异,检测模块的代码是从showmount源码中提取出来的。
首先我怀疑是不是我提取代码的时候有什么问题,于是又把我的代码和showmount的代码(1.0.9版本的nfs-utils)重新核对了一遍,除了省略了使用clnttcp_create而直接使用clntudp_create创建RPC客户外,与showmount的执行流程没什么区别。
我使用的代码如下(使用//注释的代码是修正内存泄漏问题添加的代码):
于是直接用valgrind检测showmount程序:
showmount申请了15次内存,只释放了4次,有8KB多内存没有释放。从valgrind的详细结果可以找到哪些函数动态申请的内存没有是否:
问题集中在clnttcp_create(clntudp_create)、authunix_create_default和xdr_exports三个函数,这三个函数分别创建了RPC客户、RPC客户授权信息和NFS服务器的共享目录列表,也就是说这三个函数创建的资源需要在退出前释放。
man clnt×××××和man xdr_××××一系列函数说明后,找到了释放上述资源的函数(代码中//注释的部分)
懒得装nfs-utils编译的依赖,我没有重编译showmount,但是在我提取的代码修改后用valgrind测试没有内存泄漏。
nfs-utils已经升级到1.2.6,showmount里面的代码增加了clnt_destroy,但是没有auth_destroy和xdr_free,centos不能yum到1.2.6的showmount,我再次偷懒没编译,不知道是否依然存在内存泄漏。
首先我怀疑是不是我提取代码的时候有什么问题,于是又把我的代码和showmount的代码(1.0.9版本的nfs-utils)重新核对了一遍,除了省略了使用clnttcp_create而直接使用clntudp_create创建RPC客户外,与showmount的执行流程没什么区别。
我使用的代码如下(使用//注释的代码是修正内存泄漏问题添加的代码):
WIS_BOOL NSIA_DISK_CheckNFSService(WIS_CONST WIS_U32 u32DiskIp,
WIS_S8 * WIS_CONST ps8ShareDir)
{
enum clnt_stat enClntStat;
struct sockaddr_in stServerAddr;
struct timeval stPertryTimeout;
struct timeval stTotalTimeout;
CLIENT *pstClient = WIS_NULL;
exports stExportList;
// exports stExportList2;
WIS_S32 s32Sock;
/* init rpc handle, use udp only */
stServerAddr.sin_addr.s_addr = u32DiskIp;
stServerAddr.sin_family = AF_INET;
stServerAddr.sin_port = 0;
s32Sock = RPC_ANYSOCK;
stPertryTimeout.tv_sec = 1;
stPertryTimeout.tv_usec = 0;
if ( (pstClient = clntudp_create(&stServerAddr, MOUNTPROG, MOUNTVERS, stPertryTimeout, &s32Sock)) == WIS_NULL )
return WIS_FALSE;
pstClient->cl_auth = authunix_create_default();
stTotalTimeout.tv_sec = 2;
stTotalTimeout.tv_usec = 0;
memset(&stExportList, '\0', sizeof(stExportList));
enClntStat = clnt_call(pstClient, MOUNTPROC_EXPORT, (xdrproc_t) xdr_void, WIS_NULL,
(xdrproc_t) xdr_exports, (caddr_t) &stExportList, stTotalTimeout);
// stExportList2 = stExportList;
if (enClntStat == RPC_SUCCESS)
{
while (stExportList)
{
if ( strcmp(ps8ShareDir, stExportList->ex_dir) == 0 )
{
// xdr_free((xdrproc_t) xdr_exports, (caddr_t) &stExportList2);
// auth_destroy(pstClient->cl_auth);
// clnt_destroy(pstClient);
return WIS_TRUE;
}
stExportList = stExportList->ex_next;
}
}
// xdr_free((xdrproc_t) xdr_exports, (caddr_t) &stExportList2);
// auth_destroy(pstClient->cl_auth);
// clnt_destroy(pstClient);
return WIS_FALSE;
}
于是直接用valgrind检测showmount程序:
valgrind --tool=memcheck --leak-check=full --leak-resolution=high --show-reachable=yes --track-origins=yes /usr/sbin/showmount -e 172.18.13.245
结果居然是:
==4142== HEAP SUMMARY:
==4142== in use at exit: 8,737 bytes in 11 blocks
==4142== total heap usage: 15 allocs, 4 frees, 9,721 bytes allocated
.........
==4142== LEAK SUMMARY:
==4142== definitely lost: 24 bytes in 2 blocks
==4142== indirectly lost: 8,697 bytes in 8 blocks
==4142== possibly lost: 0 bytes in 0 blocks
==4142== still reachable: 16 bytes in 1 blocks
==4142== suppressed: 0 bytes in 0 blocks
showmount申请了15次内存,只释放了4次,有8KB多内存没有释放。从valgrind的详细结果可以找到哪些函数动态申请的内存没有是否:
==4142== 8 bytes in 1 blocks are indirectly lost in loss record 1 of 11
==4142== at 0x4804EC2: calloc (vg_replace_malloc.c:418)
==4142== by 0x4911102: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109D05: ??? (in /usr/sbin/showmount)
==4142== by 0x109E7E: ??? (in /usr/sbin/showmount)
==4142== by 0x49110A7: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109CB5: xdr_exports (in /usr/sbin/showmount) //clnt_call调用的xdr_exports里面申请的内存
==4142== by 0x490A596: clnttcp_call (in /lib/libc-2.5.so)
==4142== by 0x1090CA: main (in /usr/sbin/showmount)
==4142==
==4142== 9 bytes in 1 blocks are indirectly lost in loss record 2 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x490FE61: xdr_string (in /lib/libc-2.5.so)
==4142== by 0x109E2B: ??? (in /usr/sbin/showmount)
==4142== by 0x109E5D: ??? (in /usr/sbin/showmount)
==4142== by 0x49110A7: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109CB5: xdr_exports (in /usr/sbin/showmount) //clnt_call调用的xdr_exports里面申请的内存
==4142== by 0x490A596: clnttcp_call (in /lib/libc-2.5.so)
==4142== by 0x1090CA: main (in /usr/sbin/showmount)
==4142==
==4142== 16 bytes in 1 blocks are still reachable in loss record 3 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x490FE61: xdr_string (in /lib/libc-2.5.so)
==4142== by 0x109D9B: ??? (in /usr/sbin/showmount)
==4142== by 0x109DCD: ??? (in /usr/sbin/showmount)
==4142== by 0x49110A7: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109D05: ??? (in /usr/sbin/showmount)
==4142== by 0x109E7E: ??? (in /usr/sbin/showmount)
==4142== by 0x49110A7: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109CB5: xdr_exports (in /usr/sbin/showmount) //clnt_call调用的xdr_exports里面申请的内存
==4142== by 0x490A596: clnttcp_call (in /lib/libc-2.5.so)
==4142==
==4142== 29 (12 direct, 17 indirect) bytes in 1 blocks are definitely lost in loss record 4 of 11
==4142== at 0x4804EC2: calloc (vg_replace_malloc.c:418)
==4142== by 0x4911102: xdr_reference (in /lib/libc-2.5.so)
==4142== by 0x491105F: xdr_pointer (in /lib/libc-2.5.so)
==4142== by 0x109CB5: xdr_exports (in /usr/sbin/showmount) //clnt_call调用的xdr_exports里面申请的内存
==4142== by 0x490A596: clnttcp_call (in /lib/libc-2.5.so)
==4142== by 0x1090CA: main (in /usr/sbin/showmount)
==4142==
==4142== 36 bytes in 1 blocks are indirectly lost in loss record 5 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x4908A50: authunix_create (in /lib/libc-2.5.so)
==4142== by 0x49085FC: authunix_create_default (in /lib/libc-2.5.so) //authunix_create_default里面申请的内存
==4142== by 0x109052: main (in /usr/sbin/showmount)
==4142==
==4142== 40 bytes in 1 blocks are indirectly lost in loss record 6 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x4908952: authunix_create (in /lib/libc-2.5.so)
==4142== by 0x49085FC: authunix_create_default (in /lib/libc-2.5.so) //authunix_create_default里面申请的内存
==4142== by 0x109052: main (in /usr/sbin/showmount)
==4142==
==4142== 68 bytes in 1 blocks are indirectly lost in loss record 7 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x4910692: xdrrec_create (in /lib/libc-2.5.so)
==4142== by 0x490A20C: clnttcp_create (in /lib/libc-2.5.so) //clnttcp_create里面申请的内存,showmount未执行clntudp_create,这个函数应该也申请了内存
==4142== by 0x109043: main (in /usr/sbin/showmount)
==4142==
==4142== 100 bytes in 1 blocks are indirectly lost in loss record 8 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x490A051: clnttcp_create (in /lib/libc-2.5.so) //clnttcp_create里面申请的内存
==4142== by 0x109043: main (in /usr/sbin/showmount)
==4142==
==4142== 432 bytes in 1 blocks are indirectly lost in loss record 9 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x4908964: authunix_create (in /lib/libc-2.5.so)
==4142== by 0x49085FC: authunix_create_default (in /lib/libc-2.5.so) //authunix_create_default里面申请的内存
==4142== by 0x109052: main (in /usr/sbin/showmount)
==4142==
==4142== 8,004 bytes in 1 blocks are indirectly lost in loss record 10 of 11
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x49106D7: xdrrec_create (in /lib/libc-2.5.so)
==4142== by 0x490A20C: clnttcp_create (in /lib/libc-2.5.so) //clnttcp_create里面申请的内存
==4142== by 0x109043: main (in /usr/sbin/showmount)
==4142==
==4142== 8,692 (12 direct, 8,680 indirect) bytes in 1 blocks are definitely lost
==4142== at 0x4805B83: malloc (vg_replace_malloc.c:195)
==4142== by 0x490A042: clnttcp_create (in /lib/libc-2.5.so) //clnttcp_create里面申请的内存
==4142== by 0x109043: main (in /usr/sbin/showmount)
问题集中在clnttcp_create(clntudp_create)、authunix_create_default和xdr_exports三个函数,这三个函数分别创建了RPC客户、RPC客户授权信息和NFS服务器的共享目录列表,也就是说这三个函数创建的资源需要在退出前释放。
man clnt×××××和man xdr_××××一系列函数说明后,找到了释放上述资源的函数(代码中//注释的部分)
懒得装nfs-utils编译的依赖,我没有重编译showmount,但是在我提取的代码修改后用valgrind测试没有内存泄漏。
nfs-utils已经升级到1.2.6,showmount里面的代码增加了clnt_destroy,但是没有auth_destroy和xdr_free,centos不能yum到1.2.6的showmount,我再次偷懒没编译,不知道是否依然存在内存泄漏。
测试环境:CentOS5 2.6.18-308.13.1.el5 showmount版本1.0.9 valgrind版本3.5.0
如有错误,欢迎指正