近期修改intu后端对Searchhub、Suggestion异步连接时都出core。
其中sh只能怀疑是循环变量未初始化,for(int idx; i<max; i++),目前线上已ok,应该就是这个问题
针对sugg目前怀疑是sugge和后端数据源变量共享导致的问题,问题代码如下:
(1)Sugg的状态变量仍采用ori_stat,而一旦suggestion处理完成,则再将其置回O_WAIT以便于后续处理sh或cache。这就触发了一个临界状态,即Sugg处理完成了,但超时方法也即将触发,这将导致刚刚borrow成功的sh连接被认作超时处理。
(2)Sugg的连接变量仍采用hdle,当suggestion接收成功后,也会触发一个临界状态:
process(worker);
worker->hdle->pool->returnSket(worker->hdle);
process(worker)会后续进行sh或cache连接borrow,borrow到的连接也会存至hdle,由于是异步的,return可能就return为刚刚borrow的成功的连接(被再次认为可用,出问题)
目前suggestion异步定位是这两个问题。
修改的话,即所有变量都不要共享为最佳原则。
但即便如此问题2依然在超时或者错误重发时导致问题。即连接错误,重发后borrow到的连接也被迅速返回了。
修改方案如下:
int sugg_collector::process(Worker* worker) {
unregisterWorker(worker->id);
_INFO("[reqid=%08X] process worker, valid=%d", worker->id, worker->valid);
if (worker->valid) {
if (worker->hdle->rtype == CHUNK {
************此处省略部分代码****************
}
else if (worker->hdle->rtype == BODYLEN) {
_INFO("go bodylen");
worker->sugg_res = std::string(worker->hdle->recv_buf + worker->hdle->header.length(),worker->hdle->body_len);
_INFO("sugg res = %s",worker->sugg_res.c_str());
}
}
//return submitWorker(worker); //此处我们把submitWorker注释掉
}
并将fire代码修改为:
if (should_process) {
process(worker);
worker->hdle->pool->returnSket(worker->hdle);
submitWorker(worker);
}
而notifytimeout代码可直接修改为:
if (is_to) {
//worker->entity.is_res_to = 1;
int drop_ret = receiver->dropSket(worker->hdle);
_INFO("[reqid=%08X] sugg to triggered, drop_ret=%d", id, drop_ret);
//process(worker); // 此处的process神马都不干,所以直接提交worker即可
submitWorker(worker);
}
后期注意事项:
1、加强对代码的理解,锁到底是锁到什么粒度要搞清楚;现在改代码总是照猫画虎老画出问题;
2、不同的处理流程不要共用变量,这会超出既有锁的控制范围,导致触发各种临界状态。
3、多注意积累,不要局限于简单的业务代码。
4、最后,希望是这两个问题导致,祈祷明天上线顺利。
*****************问题代码*********************
int sugg_collector::fire(unsigned int worker_id, int type, void* arg) {
_INFO("[reqid=%08X] sugg arrive", worker_id);
Worker* worker = NULL;
worker = findWorker(worker_id);
if (worker != NULL) {
bool should_process = false;
worker->lock();
if (worker->ori_stat == O_WAIT) {
timer->unregister_timer(worker->ori_timer_id, worker->id);
should_process = true;
if (type < 0) {
_INFO("[reqid=%08X] sugg err", worker_id);
worker->ori_stat = O_GET_FAIL;
}
else {
worker->ori_stat = O_GET_SUCC;
}
}
worker->unlock();
if (worker->ori_stat == O_GET_FAIL) {
worker->valid = false;
}
if (should_process) {
process(worker);
worker->hdle->pool->returnSket(worker->hdle);
}
}
else {
_INFO("[reqid=%08X] sugg al to", worker_id);
}
return 0;
}
int sugg_collector::notify_timeout(unsigned int id, const timeval & now,const unsigned int time_out_ms) {
_INFO("[reqid=%08X] sugg to", id);
timeval cur;
gettimeofday(&cur, NULL);
Worker* worker = NULL;
worker = findWorker(id);
if (worker != NULL) {
_INFO("cost=%ld", cur.tv_sec * 1000 + cur.tv_usec / 1000 - worker->ori_timestamp.tv_sec * 1000 - worker->ori_timestamp.tv_usec / 1000);
bool is_to = false;
worker->lock();
if (worker->ori_stat == O_WAIT) {
//worker->reason = reason_recv_to;
worker->valid = false;
is_to = true;
worker->ori_stat = O_TO;
worker->entity.sugg_to++;
}
worker->unlock();
if (is_to) {
//worker->entity.is_res_to = 1;
int drop_ret = receiver->dropSket(worker->hdle);
_INFO("[reqid=%08X] sugg to triggered, drop_ret=%d", id, drop_ret);
process(worker);
}
}
else {
_INFO("[reqid=%08X] sugg arr", id);
}
return 0;
}