2022/07/22：服务504超时响应告警 - 线程池的秘密

开发小头目

已于 2022-10-12 15:47:36 修改

阅读量486

点赞数

分类专栏：日常

于 2022-07-29 14:15:53 首次发布

本文链接：https://blog.csdn.net/qq_33199919/article/details/126051621

版权

Future 线程池 504告警超时处理 Java并发

关键词由CSDN通过智能技术生成

日常专栏收录该内容

27 篇文章 0 订阅

订阅专栏

如图：告警504服务的整体调用流程

2022/07/22 14:20:00

事情发生在一个明媚的下午，部门说来了一个大客户需要POC - XX测试，竞品给到他们1千QPS，咋们也需要支持到，我作为XX技术负责(已经骚不动)，非常明确的说明，<瓶颈服务>二级子服务-目前最多支撑300QPS，而且是共享的，日常会使用100QPS，所以只有200富余。一番沟通之后扩容加机器，根据压测报告最后扩容到总体500QPS的量，然后和客户协商后约定限制QPS到300

2022/07/22 15:20:00

QPS限流工单审批通过，客户流量进入，整体QPS出现上升，带宽告警，立即通知运维同学扩带宽，500M->1000M,整理RT趋于稳定，监控30分钟，群内反馈QPS/RT情况，事情真的结束了吗？

2022/07/22 17:05:00

gateway出现大量504告警，持续7s左右后恢复，排查监控

最下游一个<二级子服务>RT非常高，也出现了大量超时-通知子服务负责人排查原因-END
<一级子服务>调用<二级子服务>应该是有降级处理，为何不生效

先贴代码：

                    for (AbstractModelTask task : tasks) {
                        futures.add(MODEL_TASK_EXECUTOR_POOL.submit(task));
                    }

                    long timeout = modelLbUtil.getModelExecuteTimeout();
                    long now = System.currentTimeMillis();

                    for (Future future : futures) {

                        if (future != null) {

                            long remain = timeout - (System.currentTimeMillis() - now);

                            try {

                                if (remain < 0) {
                                    LOG.warn("futureGet remain out p={} s={}", partnerCode, seqId);
                                    break;
                                }

                                future.get(remain, TimeUnit.MILLISECONDS);

                            } catch (XXXX e) {
                                 //XXXXXXXXXXXXXX
                            } catch (Exception e) {
                                LOG.warn("future get error, msg={} p={} s={}", e.getMessage(), partnerCode, seqId);
                            } finally {
                                future.cancel(true);
                            }

                        }

然后在结合日志

分析发现在 2022/07/22 17:08:06 时间点的时候，线程池future已经报错，但是直到 2022/07/22 17:08:12 时间点才进入到 futureGet remain out 超时判断逻辑中，这中间线程有哪些操作呢？

执行 future.cancel(true);
下一次foreach的get();
long remain = timeout - (System.currentTimeMillis() - now);

分析这三个动作

第三个 : 系统时间算数， (本身不可能出现耗时久)
第二个，foreach获取list数组内的值，线程无竞争， (本身不可能出现耗时久)
第一个, cancel(true)；future任务取消，中断内部线程，结束任务，(本身不可能出现耗时久)
- 但从日志看它的执行时间是在 2022/07/22 17:08:12 342之后，2022/07/22 17:08:12 495之前,这就很诡异了，逻辑上它本应在2022/07/22 17:08:06 时间点就执行完成，想不通。

继续分析cancel这个动作，

//========================= 接口
    /**
     * Attempts to cancel execution of this task.  This attempt will
     * fail if the task has already completed, has already been cancelled,
     * or could not be cancelled for some other reason. If successful,
     * and this task has not started when {@code cancel} is called,
     * this task should never run.  If the task has already started,
     * then the {@code mayInterruptIfRunning} parameter determines
     * whether the thread executing this task should be interrupted in
     * an attempt to stop the task.
     *
     * <p>After this method returns, subsequent calls to {@link #isDone} will
     * always return {@code true}.  Subsequent calls to {@link #isCancelled}
     * will always return {@code true} if this method returned {@code true}.
     *
     * @param mayInterruptIfRunning {@code true} if the thread executing this
     * task should be interrupted; otherwise, in-progress tasks are allowed
     * to complete
     * @return {@code false} if the task could not be cancelled,
     * typically because it has already completed normally;
     * {@code true} otherwise
     */
    boolean cancel(boolean mayInterruptIfRunning);

//========================= 抽象实现

    protected void cancel() {
        canceled = true;
    }

//========================= 实现
        @Override
        protected void cancel() {
            super.cancel();
            if (completed)
                setLocalResult(getEmptyResult());
        }


        @Override
        protected final Node<P_OUT> getEmptyResult() {
            return Nodes.emptyNode(op.getOutputShape());
        }

        @Override
        protected void setLocalResult(R localResult) {
            if (isRoot()) {
                if (localResult != null)
                    sharedResult.compareAndSet(null, localResult);
            }
            else
                super.setLocalResult(localResult);
       }

翻看官方API介绍不会阻塞，进入方法内部，唯一耗时的是compareAndSet() - CAS 但也不会阻塞6秒之久，真相究竟是啥啊！！！

=========================分割线============================

阶段性结论：这时候得到的结论是，线程平白无故阻塞了

回顾职业生涯：之前也有过504的经历，那次是出现了tomcat初始化的连接数不够，在QPS起来之后，频繁创建连接导致资源消耗，部分线程出现阻塞，有没有可能也是这个原因呢？？

寻找资料：

发现了疑点，在504故障时间点线程池出现非常明显的波动，本身线程池coreThreadNum = 300， MaxThreadNum = 600

很奇怪为啥会出现这种情况，会不会和线程销毁重建有关呢？？

此处本应执行 Jstack pid > stackLog.stack，但是机会错过就不再

jstack脚本-记录工作线程：Model_Task_Executor_Pool的数量变化----观察监控出现波动是，工作线程是否发生变化

但是结果不乐观：并没有关联关系

#!/bin/bash

if [ $# -eq 0 ];then
  echo "please enter java pid"
  exit -1
fi

pid=$1


#修改成自己安装的JDK配置路径
jstack_cmd="/usr/bin/jstack"

#line=`top -H -o %CPU -b -n 1 -p $pid | sed '1,/^$/d' | grep -v $pid | awk 'NR==2'`

#line=`top -H -b -n 1  -p $pid | sed '1,/^$/d' | sed '1d;/^$/d' | grep -v $pid | sort -nrk9 | head -1`
#echo "$line" | awk '{print "tid: "$1," cpu: %"$9}'
#tid_0x=`printf "%0x" $(echo "$line" | awk '{print $1}')`
#$jstack_cmd $pid | grep $tid_0x -A20 | sed -n '1,/^$/p'


for i in {1..2000};
do 
    $jstack_cmd $pid > stack_thread.log

    #Model_Task_Executor_Pool_
    

    echo "每5s统计一次，Model_Task_Executor_Pool_ 执行线程数量=========>"

    line=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$8}'|uniq -c`
    echo "$line" 

    line2=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$4}'|uniq -c`
    echo "    总数: $line2" 
    echo "=================>"

    sleep 5s
done