如图:告警504服务的整体调用流程
2022/07/22 14:20:00
事情发生在一个明媚的下午,部门说来了一个大客户需要POC - XX测试,竞品给到他们1千QPS,咋们也需要支持到,我作为XX技术负责(已经骚不动),非常明确的说明,<瓶颈服务>二级子服务-目前最多支撑300QPS,而且是共享的,日常会使用100QPS,所以只有200富余。 一番沟通之后扩容加机器,根据压测报告最后扩容到总体500QPS的量,然后和客户协商后约定限制QPS到300
2022/07/22 15:20:00
QPS限流工单审批通过,客户流量进入,整体QPS出现上升,带宽告警,立即通知运维同学扩带宽,500M->1000M,整理RT趋于稳定,监控30分钟,群内反馈QPS/RT情况,事情真的结束了吗?
2022/07/22 17:05:00
gateway出现大量504告警,持续7s左右后恢复,排查监控
- 最下游一个<二级子服务>RT非常高,也出现了大量超时-通知子服务负责人排查原因-END
- <一级子服务>调用<二级子服务>应该是有降级处理,为何不生效
先贴代码:
for (AbstractModelTask task : tasks) {
futures.add(MODEL_TASK_EXECUTOR_POOL.submit(task));
}
long timeout = modelLbUtil.getModelExecuteTimeout();
long now = System.currentTimeMillis();
for (Future future : futures) {
if (future != null) {
long remain = timeout - (System.currentTimeMillis() - now);
try {
if (remain < 0) {
LOG.warn("futureGet remain out p={} s={}", partnerCode, seqId);
break;
}
future.get(remain, TimeUnit.MILLISECONDS);
} catch (XXXX e) {
//XXXXXXXXXXXXXX
} catch (Exception e) {
LOG.warn("future get error, msg={} p={} s={}", e.getMessage(), partnerCode, seqId);
} finally {
future.cancel(true);
}
}
然后在结合日志
分析发现在 2022/07/22 17:08:06 时间点的时候,线程池future已经报错,但是直到 2022/07/22 17:08:12 时间点才进入到 futureGet remain out 超时判断逻辑中,这中间线程有哪些操作呢?
- 执行 future.cancel(true);
- 下一次foreach的get();
- long remain = timeout - (System.currentTimeMillis() - now);
分析这三个动作
- 第三个 : 系统时间算数, (本身不可能出现耗时久)
- 第二个,foreach获取list数组内的值,线程无竞争, (本身不可能出现耗时久)
- 第一个, cancel(true);future任务取消,中断内部线程,结束任务,(本身不可能出现耗时久)
- 但从日志看它的执行时间是在 2022/07/22 17:08:12 342之后,2022/07/22 17:08:12 495之前,这就很诡异了,逻辑上它本应在2022/07/22 17:08:06 时间点就执行完成,想不通。
继续分析cancel这个动作,
//========================= 接口
/**
* Attempts to cancel execution of this task. This attempt will
* fail if the task has already completed, has already been cancelled,
* or could not be cancelled for some other reason. If successful,
* and this task has not started when {@code cancel} is called,
* this task should never run. If the task has already started,
* then the {@code mayInterruptIfRunning} parameter determines
* whether the thread executing this task should be interrupted in
* an attempt to stop the task.
*
* <p>After this method returns, subsequent calls to {@link #isDone} will
* always return {@code true}. Subsequent calls to {@link #isCancelled}
* will always return {@code true} if this method returned {@code true}.
*
* @param mayInterruptIfRunning {@code true} if the thread executing this
* task should be interrupted; otherwise, in-progress tasks are allowed
* to complete
* @return {@code false} if the task could not be cancelled,
* typically because it has already completed normally;
* {@code true} otherwise
*/
boolean cancel(boolean mayInterruptIfRunning);
//========================= 抽象实现
protected void cancel() {
canceled = true;
}
//========================= 实现
@Override
protected void cancel() {
super.cancel();
if (completed)
setLocalResult(getEmptyResult());
}
@Override
protected final Node<P_OUT> getEmptyResult() {
return Nodes.emptyNode(op.getOutputShape());
}
@Override
protected void setLocalResult(R localResult) {
if (isRoot()) {
if (localResult != null)
sharedResult.compareAndSet(null, localResult);
}
else
super.setLocalResult(localResult);
}
翻看官方API介绍不会阻塞,进入方法内部,唯一耗时的是compareAndSet() - CAS 但也不会阻塞6秒之久,真相究竟是啥啊!!!
=========================分割线============================
阶段性结论:这时候得到的结论是,线程平白无故阻塞了
回顾职业生涯:之前也有过504的经历,那次是出现了tomcat初始化的连接数不够,在QPS起来之后,频繁创建连接导致资源消耗,部分线程出现阻塞,有没有可能也是这个原因呢??
寻找资料:
发现了疑点,在504故障时间点线程池出现非常明显的波动,本身线程池coreThreadNum = 300, MaxThreadNum = 600
很奇怪为啥会出现这种情况,会不会和线程销毁重建有关呢??
此处本应执行 Jstack pid > stackLog.stack,但是机会错过就不再
jstack脚本-记录工作线程:Model_Task_Executor_Pool的数量变化----观察监控出现波动是,工作线程是否发生变化
但是结果不乐观:并没有关联关系
#!/bin/bash
if [ $# -eq 0 ];then
echo "please enter java pid"
exit -1
fi
pid=$1
#修改成自己安装的JDK配置路径
jstack_cmd="/usr/bin/jstack"
#line=`top -H -o %CPU -b -n 1 -p $pid | sed '1,/^$/d' | grep -v $pid | awk 'NR==2'`
#line=`top -H -b -n 1 -p $pid | sed '1,/^$/d' | sed '1d;/^$/d' | grep -v $pid | sort -nrk9 | head -1`
#echo "$line" | awk '{print "tid: "$1," cpu: %"$9}'
#tid_0x=`printf "%0x" $(echo "$line" | awk '{print $1}')`
#$jstack_cmd $pid | grep $tid_0x -A20 | sed -n '1,/^$/p'
for i in {1..2000};
do
$jstack_cmd $pid > stack_thread.log
#Model_Task_Executor_Pool_
echo "每5s统计一次,Model_Task_Executor_Pool_ 执行线程数量=========>"
line=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$8}'|uniq -c`
echo "$line"
line2=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$4}'|uniq -c`
echo " 总数: $line2"
echo "=================>"
sleep 5s
done
原因在下边的文章中