由于guard自身是多线程程序,所以每次有新的改动都会看看线程的数目是不是正确的。
在加入zk注册后,guard运行出现下列异常线程。
简注:如何看线程,gdb->attach 进程->thread apply all bt
(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x000000000066048b in do_completion ()
#2 0x00007f86771969ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#3 0x00007f8676ef316d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()
(gdb) bt
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1 0x00000000005418fc in base::CondVar::WaitWithTimeout (this=0x1b564a0, mu=0x1b56470, millis=10000) at ./base/mutex.h:205
#2 0x0000000000540f4a in util::YRFSManager::RecoverConnection (this=0x1b56460) at util/yrfs/yrfs_manager.cc:387
#3 0x000000000054334c in base::_MemberResultCallback_0_0<true, void, util::YRFSManager>::Run (this=0x1b4a9a0) at ./base/callback_spec.h:119
#4 0x000000000068f71f in base::ThreadPool::Worker (p=0x1b7e200) at base/thread_pool.cc:38
#5 0x000000000068fb25 in base::WorkerThread::ThreadBody (my_thread=0x1b4aa40) at ./base/thread_pool.h:208
#6 0x00007f86771969ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#7 0x00007f8676ef316d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#8 0x0000000000000000 in ?? ()
(gdb) bt
#0 0x00007f8676ee69d3 in *__GI___poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=3332) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x0000000000660695 in do_io ()
#2 0x00007f86771969ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#3 0x00007f8676ef316d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()
从网上查到这个话题
http://osdir.com/ml/java-hadoop-zookeeper-devel/2010-03/msg00233.html
这个话题的标题对解决本问题的帮助很大,
The C Client cannot exit properly in some situation
初步怀疑zk client没有正常退出
再看代码,发现情况。具体如下:
main() {
scoped_ptr<util::YRNSManager> yrns; ============>这句在if clause前面
if (!FLAGS_guard_yrns_path.empty()) {
yrns.reset(new util::YRNSManager());
std::string path = StringPrintf("%s/%d",
FLAGS_guard_yrns_path.c_str(),
FLAGS_guard_shard_id);
if (monitor_server->RegisterYRNS(
yrns.get(), path,
FLAGS_guard_replica_id)) {
LOG(INFO) << "Monitor:" << path
<<" id:" << FLAGS_guard_replica_id
<< " at port " << monitor_server->ServerPort();
} else {
LOG(ERROR) << "Failed to register monitor at port:"
<< monitor_server->ServerPort() << " to YRNS";
exit(1);
}
bool ret = yrns->Register(path,
FLAGS_guard_replica_id,
util::YRNSManager::SERVICE_RPC,
FLAGS_local_thrift_server_port);
CHECK(ret) << "Failed to register rpc monitor with:" << path;
} else {
LOG(ERROR) << "ZK path empty";
}
/*这里程序会进入长期运行状态*/
task_scheduler.Start();
output_processor.Start();
thrift_server->Join();
task_scheduler.Join();
output_processor.Join();
monitor_server->Join();
}
现象的原因是
yrns这个指针没有析构,间接地是zk client没有退出
经过与zk owner确认,zk client不能退出,需要长期存在,所以这2个线程是正常的。