Skywalking系列学习之告警通知源码分析

前言

  前面的文章,我们已经知道了skywalking可以采集链路数据,但如果链路里有异常应该怎样通知呢?本篇文章以响应时间超时为例源码分析一下skywalking的告警流程

ServiceDispatcher如何生成

  1. 在OALRuntime#start时,解析oal/core.oal配置文件,动态生成ServiceDispatcher
    public void start(ClassLoader currentClassLoader) throws ModuleStartException, OALCompileException {
    	// 读取oal/core.oal配置文件
    	Reader read = ResourceUtils.read(oalDefine.getConfigFile());
    	
    	// 解析oal/core.oal配置文件
    	ScriptParser scriptParser = ScriptParser.createFromFile(read, oalDefine.getSourcePackage());
    	OALScripts oalScripts = scriptParser.parse();
    	
    	// 生成对应的Dispatcher类
    	this.generateClassAtRuntime(oalScripts);
    }
    
  2. OALRuntime#generateClassAtRuntime生成ServiceDispatcher
     private void generateClassAtRuntime(OALScripts oalScripts) throws OALCompileException {
        List<AnalysisResult> metricsStmts = oalScripts.getMetricsStmts();
        /**
          * 1. 取出metricsStmt的from属性的sourceName ---> Service
          * 2.  将sourceName与DispatcherContext形成映射关系:OALRuntime$allDispatcherContext$allContext
          * 3.  将属于sourceName的metricsStmt放在DispatcherContext的metrics属性中
          **/ 
        metricsStmts.forEach(this::buildDispatcherContext);
    
        for (AnalysisResult metricsStmt : metricsStmts) {
           /**
              * generateMetricsClass生成Metrics类(例如ServiceRespTimeMetrics) 
              * 继承LongAvgMetrics,实现WithMetadata接口 --> 为ServiceRespTimeMetrics添加方法(hashCode方法在metrics/hashCode.ftl)
              * */
            metricsClasses.add(generateMetricsClass(metricsStmt));
            generateMetricsBuilderClass(metricsStmt);
        }
    
        for (Map.Entry<String, DispatcherContext> entry : allDispatcherContext.getAllContext().entrySet()) {
        	// generateDispatcherClass生成Dispatcher类
            dispatcherClasses.add(generateDispatcherClass(entry.getKey(), entry.getValue()));
        }
    
        oalScripts.getDisableCollection().getAllDisableSources().forEach(disable -> {
            DisableRegister.INSTANCE.add(disable);
        });
    }
    

OALRuntime#generateDispatcherClass生成ServiceDispatcher

private Class generateDispatcherClass(String scopeName,
                                          DispatcherContext dispatcherContext) throws OALCompileException {
        // 通过javassist动态生成ServiceDispatcher类                                  
        String className = dispatcherClassName(scopeName, false);
        CtClass dispatcherClass = classPool.makeClass(dispatcherClassName(scopeName, true));
		
       
       // 为ServiceDispatcher添加Service scope metrics的方法(方法模板为dispatcher/doMetrics.ftl)-->例如下方的doServiceRespTime方法
       for (AnalysisResult dispatcherContextMetric : dispatcherContext.getMetrics()) {
       		StringWriter methodEntity = new StringWriter();
       		configuration.getTemplate("dispatcher/doMetrics.ftl").process(dispatcherContextMetric, methodEntity);
            dispatcherClass.addMethod(CtNewMethod.make(methodEntity.toString(), dispatcherClass));
       } 
       
       // 为ServiceDispatcher添加dispatch方法(会调用上面方法集合,方法模板为dispatcher/dispatch.ftl)
	   StringWriter methodEntity = new StringWriter();
       configuration.getTemplate("dispatcher/dispatch.ftl").process(dispatcherContext, methodEntity);
       dispatcherClass.addMethod(CtNewMethod.make(methodEntity.toString(), dispatcherClass));
		
}
  1. 为ServiceDispatcher添加Service scope metrics的方法(以ServiceRespTime为例)
    private void doServiceRespTime(org.apache.skywalking.oap.server.core.source.Service source) {
    	org.apache.skywalking.oap.server.core.source.oal.rt.metrics.ServiceRespTimeMetrics metrics = 
    						new org.apache.skywalking.oap.server.core.source.oal.rt.metrics.ServiceRespTimeMetrics();
    	metrics.setTimeBucket(source.getTimeBucket());
        metrics.setEntityId(source.getEntityId());
    	metrics.combine( (long)(source.getLatency()), (long)(1));
    	org.apache.skywalking.oap.server.core.analysis.worker.MetricsStreamProcessor.getInstance().in(metrics);
    }
    
    在这里插入图片描述
  2. 为ServiceDispatcher添加dispatch方法
    public void dispatch(org.apache.skywalking.oap.server.core.source.ISource source) {
    	org.apache.skywalking.oap.server.core.source.Service _source = (org.apache.skywalking.oap.server.core.source.Service)source;
        doServiceRespTime(_source);
        doServiceSla(_source);
        doServiceCpm(_source);
        doServicePercentile(_source);
        doServiceApdex(_source);
        doServiceMqConsumeCount(_source);
        doServiceMqConsumeLatency(_source);
    }
    

ServiceDispatcher生成ServiceRespTimeMetrics数据

  1. TraceSegmentReportServiceHandler接收链路数据SegmentObject上报,核心源码流程:
    TraceSegmentReportServiceHandler#collect#onNext --> SegmentParserServiceImpl#send --> TraceAnalyzer#doAnalysis --> RPCAnalysisListener#parseEntry --> RPCAnalysisListener#build --> SourceReceiverImpl#receive --> dispatcherManager#forward --> XXXDispatcher#dispatch

  2. notifyEntryListener中调用RPCAnalysisListener#parseEntry,把上下游的链路信息完善到sourceBuilder里(包括链路耗时),并添加到callingInTraffic中
    在这里插入图片描述

    public void parseEntry(SpanObject span, SegmentObject segmentObject) {
    	RPCTrafficSourceBuilder sourceBuilder = new RPCTrafficSourceBuilder(namingControl);
        sourceBuilder.setSourceServiceName(Const.USER_SERVICE_NAME);
        sourceBuilder.setSourceServiceInstanceName(Const.USER_INSTANCE_NAME);
        sourceBuilder.setSourceEndpointName(Const.USER_ENDPOINT_NAME);
        sourceBuilder.setSourceLayer(Layer.UNDEFINED);
        sourceBuilder.setDestServiceInstanceName(segmentObject.getServiceInstance());
        sourceBuilder.setDestServiceName(segmentObject.getService());
        sourceBuilder.setDestLayer(identifyServiceLayer(span.getSpanLayer()));
        sourceBuilder.setDestEndpointName(span.getOperationName());
        sourceBuilder.setDetectPoint(DetectPoint.SERVER);
        sourceBuilder.setComponentId(span.getComponentId());
    	
    	// 涉及latency计算:startTime-endTime,TimeBucke:StartTime分钟计时->202207232324
        setPublicAttrs(sourceBuilder, span);
        callingInTraffic.add(sourceBuilder);
    }
    
    
  3. 从下边RPCAnalysisListener#build代码片段中可以看到包含了Service、ServiceInstance、ServiceRelation、ServiceInstanceRelation这些类型的Source;并将这些Source提交给sourceReceiver,其底层封装的DispatcherManager会根据 Source的类型选择相应的SourceDispatcher,通过方法dispatch进一步处理

    public void build() {
    	callingInTraffic.forEach(callingIn -> {
            callingIn.prepare();
            sourceReceiver.receive(callingIn.toService());
            sourceReceiver.receive(callingIn.toServiceInstance());
            sourceReceiver.receive(callingIn.toServiceRelation());
            sourceReceiver.receive(callingIn.toServiceInstanceRelation());
            ....
        });
        .......
    }
    
  4. SourceDispatcher类怎样生成以及与Source映射关系在目录(ServiceDispatcher如何生成)已经分析,ServiceDispatcher#dispatch

    public void dispatch(org.apache.skywalking.oap.server.core.source.ISource source) {
    	org.apache.skywalking.oap.server.core.source.Service _source = (org.apache.skywalking.oap.server.core.source.Service)source;
        doServiceRespTime(_source);
        doServiceSla(_source);
        doServiceCpm(_source);
        doServicePercentile(_source);
        doServiceApdex(_source);
        doServiceMqConsumeCount(_source);
        doServiceMqConsumeLatency(_source);
    }
    
  5. ServiceDispatcher#doServiceRespTime生成ServiceRespTimeMetrics后调用MetricsStreamProcessor进行Metrics聚合处理

    private void doServiceRespTime(org.apache.skywalking.oap.server.core.source.Service source) {
    	org.apache.skywalking.oap.server.core.source.oal.rt.metrics.ServiceRespTimeMetrics metrics = 
    						new org.apache.skywalking.oap.server.core.source.oal.rt.metrics.ServiceRespTimeMetrics();
    	metrics.setTimeBucket(source.getTimeBucket());
        metrics.setEntityId(source.getEntityId());
    	metrics.combine( (long)(source.getLatency()), (long)(1));
    	org.apache.skywalking.oap.server.core.analysis.worker.MetricsStreamProcessor.getInstance().in(metrics);
    }
    

ServiceRespTimeMetrics数据告警处理

  1. NotifyHandler接收Metrics,根据scope封装MetaInAlarm信息,获取MetricsName所有的RunningRule集合,遍历执行in() 添加到Window中
    public void notify(Metrics metrics) {
        WithMetadata withMetadata = (WithMetadata) metrics;
        MetricsMetaInfo meta = withMetadata.getMeta();
        int scope = meta.getScope();
    	.....
    	// 根据scope封装MetaInAlarm信息
    	 MetaInAlarm metaInAlarm;
        if (DefaultScopeDefine.inServiceCatalog(scope)) {
            final String serviceId = meta.getId();
            final IDManager.ServiceID.ServiceIDDefinition serviceIDDefinition = IDManager.ServiceID.analysisId(
                serviceId);
            ServiceMetaInAlarm serviceMetaInAlarm = new ServiceMetaInAlarm();
            serviceMetaInAlarm.setMetricsName(meta.getMetricsName());
            serviceMetaInAlarm.setId(serviceId);
            serviceMetaInAlarm.setName(serviceIDDefinition.getName());
            metaInAlarm = serviceMetaInAlarm;
        }
    	......
    	// 获取MetricsName所有的RunningRule集合
    	List<RunningRule> runningRules = core.findRunningRule(meta.getMetricsName());
        if (runningRules == null) {
            return;
        }
        runningRules.forEach(rule -> rule.in(metaInAlarm, metrics));
    }     
    
    在这里插入图片描述
  2. RunningRule#in(MetaInAlarm meta, Metrics metrics)添加metrics到window中(只保留最近的N(period)桶 -->静默10次避免重复告警)
  3. AlarmCore启动定时任务,每10s检查一次
    Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(() -> {
    	LocalDateTime checkTime = LocalDateTime.now();
    	// 距离上一次检查间隔多少分钟
        int minutes = Minutes.minutesBetween(lastExecuteTime, checkTime).getMinutes();
         alarmRulesWatcher.getRunningContext().values().forEach(ruleList -> ruleList.forEach(runningRule -> {
         	// 检查间隔有1分钟以上
         	if (minutes > 0) {
         		// 移动窗口
         		runningRule.moveTo(checkTime);
         		// 秒刻度大于15
         		if (checkTime.getSecondOfMinute() > 15) {
         		 	 // 检查条件,决定是否触发报警,告警阈值实际判断逻辑:服务调用耗时超过阈值并且次数也达到阈值
                     alarmMessageList.addAll(runningRule.check());
                }
         	}
         }));
         if (!alarmMessageList.isEmpty()) {
        	// 告警通知(原生支持9种告警通知)
       		allCallbacks.forEach(callback -> callback.doAlarm(filteredMessages));
         }
    }, 10, 10, TimeUnit.SECONDS);
    
    在这里插入图片描述
  4. 响应时间超过阈值,钉钉进行告警通知
    在这里插入图片描述

束语

  本篇文章先介绍ServiceDispatcher如何生成,然后分析ServiceDispatcher怎样生成ServiceRespTimeMetrics数据,最后通过检查ServiceRespTimeMetrics数据进行告警通知,对整条链路进行了分析,相信大家也对整个链路有了一定的了解。在为了保证主体思路的前提下,忽略了一些细节,譬如Metrics是如何聚合处理的?后续再继续分析吧

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值