使用Tair时遇到pthread_join段错误问题解决

最经使用程序访问Tair时,程序经常Crash,通过跟踪和分析发现原因如下
     在tair_client_impl::retrieve_server_addr中调用了如下函数:
            thread.start(this, reinterpret_cast<void *>(heart_type));
            response_thread.start(this, reinterpret_cast<void *>(response_type));
    当前线程创建出错,但是没有处理,但是在tair_client_impl::close函数中调用了如下函数:
             thread.join();
             response_thread.join();
    由于线程创建失败,所以这里产生了段错误。

具体分析和解决步骤如下:
(1) gdb调试core dump:
        通过core dump得到的stack如下:

#0 0x0000003a14c07fc3 in pthread_join () from/lib64/libpthread.so.0

#1 0x00000000004abe6f injoin(this=0x7f1df3ffe130) at /home/guojun8/lib/lib/include/tbsys/thread.h:51

#2 tair::tair_client_impl::close(this=0x7f1df3ffe130) at tair_client_api_impl.cpp:247

#3 0x00000000004b07a7 in tair::tair_client_impl::~tair_client_impl(this=0x7f1df3ffe130, __in_chrg=<value optimized out>) at tair_client_api_impl.cpp:83

#4 0x00000000004a58f0 in tair::new_tair_client(master_addr=<value optimized out>, slave_addr=<value optimized out>, group_name=<value optimized out>)

    at tair_client_api.cpp:584        

#5 0x00000000004a5b43 in tair::tair_client_api::startup(this=0x7f1dd4001170, master_addr=0x7f1dd40010d8"127.0.0.1:5198",

    slave_addr=0x7f1dd4001108 "127.0.0.1:5198", group_name=<value optimized out>) at tair_client_api.cpp:72

#6 0x0000000000447126 in imagestorage::Tair_Handler::Connect(this=0x7f1dd4000f90) at imagestorage/tair_handler.cc:10

#7 0x00000000004502cc in imagestorage::ImageHandler::FetchImage(this=0x1e8cb90, image_name=0x7f1dd4000908"h00731dcfb73d42acc95f5a54e6088df117",

    image_norm_name=0x1e8e9a0 "\270\347\350\001", image_buffer=0x7f1de19fc010"", image_size=0x7f1df3ffe894, schema="plaza", err_msg="")

    at imagestorage/image_handler.cc:213    
....


 

   2.  通过gdb调试:

点击(此处)折叠或打开

  1. (gdb) f 2
  2. #2 0x00000000004b3c72 in tair::tair_client_impl::close(this=0x7fa2bf4f3120) at tair_client_api_impl.cpp:248
  3. warning: Source file is more recent than executable.
  4. 248 response_thread.join();
  5. (gdb) p response_thread
  6. $1 ={tid= 140336135386880, pid= 0, runnable= 0x7fa2bf4f3128, args= 0x1}        ===============》 pid = 0
  7. (gdb)
   查看源码:

点击(此处)折叠或打开

  1. static void *hook(void*arg){
  2.         CThread *thread = (CThread*) arg;
  3.         thread->pid= gettid();                            =========> 如果线程启动成功, pid不应该为0,因此怀疑创建线程失败;

  4.         if (thread->getRunnable()){
  5.             thread->getRunnable()->run(thread, thread->getArgs());
  6.         }

  7.         return (void*)NULL;
  8.     }
3. 添加日志:
  1. ret_thread = thread.start(this, reinterpret_cast<void*>(heart_type));
  2.     if(!ret_thread){
  3.       TBSYS_LOG(ERROR,"create thread failed.");
  4.     }
  5.     ret_thread = response_thread.start(this, reinterpret_cast<void*>(response_type));
  6.     if(!ret_thread){
  7.       TBSYS_LOG(ERROR,"create response_thread failed.");
  8.     }
 重新运行后得到下面的日志输出,因此判断创建线程出错。

 

  1. [2013-10-29 18:07:21.531977] WARN parse_invalidate_server (tair_client_api_impl.cpp:3449)[140336971073280] no invalid server info found.
  2. [2013-10-29 18:07:21.532869]ERRORretrieve_server_addr(tair_client_api_impl.cpp:3434)[140336971073280] create response_thread failed.
  3. [2013-10-29 18:07:21.532915] INFO transport.cpp:394[140336976336640] ADDIOC, SOCK: 24, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa270802ea0
  4. [2013-10-29 18:07:21.532941] INFO transport.cpp:394[140337076029184] ADDIOC, SOCK: 25, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa2a0803c50
4. 得到pthread_create的失败信息:

点击(此处)折叠或打开

  1. int ret = pthread_create(&tid,NULL, CThread::hook, this);
  2. if(ret != 0)
  3.     printf("pthread_create failed, ret = %s\n", strerror(ret));
  4. assert(ret == 0);
  5. return 0 == ret;
得到的日志输出结果为:
     pthread_create failed, ret = Resource temporarily unavailable

5. 解决方法:
查看错误信息,得到:
       EAGAIN not enough system resources to create  a  process  for   the  new
              thread.

       EAGAIN more than PTHREAD_THREADS_MAX threads are already active.

.
/ asm / errno.h: 14 : #define         EAGAIN          11      /* Try again */

怀疑当前用户的进程数超出:
    [sre@WDDS-DEV-016 ~]$ ulimit -u
    1024
修改/etc/security/limits.d/90-nproc.conf中的默认值到10240,具体参见(ulimit限制之nproc问题
修改之后的值为10240.

     [sre@WDDS-DEV-016 ~]$ ulimit -u
    
10240                                         
修改用户进程限制后,问题解决。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值