Kafka Metrics模块解析

背景

Metrics是kafka内部使用的监控模块,主要有以下几个组成部分:

  1. Measurable
  2. Stat
  3. Sensor
  4. Metric

类结构

我们先来看一下这些类的继承关系和结构,有个大概的认识

  • Measurable
    在这里插入图片描述
  • Stat
    在这里插入图片描述- Sensor
    在这里插入图片描述
  • Metric
    在这里插入图片描述

接口分析

  1. Measurable

Measurable接口是度量类型最基础的接口,通过measure()方法获取被监控的值。

public interface Measurable extends MetricValueProvider<Double> {
    double measure(MetricConfig config, long now);
}
  1. Stat

Stat接口表示需要经过统计计算的度量类型,例如平均值、最大值、最小值等,通过record()方法记录某值并更新度量值。

public interface Stat {
    public void record(MetricConfig config, double value, long timeMs);
}

MeasuleStat继承了Measureable接口和Stat接口,并没有添加新的方法。CompoundStat接口表示多个Stat的组合。
SampledStat是一个比较重要的抽象类,它表示一个抽样的度量值,除了Total外的其他MeasureableStat接口实现都依赖它功能。在SampleStat中可以有多个Sample并通过多个Sample完成对一个值的度量,在每个Sample中都记录了其对应的时间窗口和事件数量,SampledStat在计算最终的结果值时,可以根据这两个值决定是否使用此sample中的数据。SampledStat实现了MeasuleStat接口的record()方法和measure()方法。在record()方法中会根据时间窗口和事件数使用合适的Sample对象进行记录。

public void record(MetricConfig config, double value, long timeMs) {
//      拿到当前时间的sample对象
        Sample sample = current(timeMs);
//      检测当前sample是否已经完成取样        
        if (sample.isComplete(timeMs, config))
            sample = advance(config, timeMs);
//      更新sample对象
        update(sample, config, value, timeMs);
//      smaple对象的事件数加1
        sample.eventCount += 1;
    }

measure()方法首先会将过期的sample重置,之后调用combine方法完成计算。combine方法是抽象方法,不同子类有不同的实现。

public double measure(MetricConfig config, long now) {
//      检查sample是否过期
        purgeObsoleteSamples(config, now);
        return combine(this.samples, config, now);
}
  1. Sensor

在实际应用中,对同一个操作需要有多个不同方面的度量,例如需要监控请求的最大长度,同时也需要监控请求的平均长度等。kafka通过将多个相关的度量对象封装在进sensor中实现。

  1. Metric
    Metrics类,负责统一管理Sensor对象、KafkaMetric对象。
public class Metrics implements Closeable {
//  默认配置信息
    private final MetricConfig config;
//  保存了添加到Metrics中的KafkaMetrics对象
    private final ConcurrentMap<MetricName, KafkaMetric> metrics;
//  保存了添加到Metrics中的Sensor的集合
    private final ConcurrentMap<String, Sensor> sensors;
//  记录了每个Sensor的子Sensor集合
    private final ConcurrentMap<Sensor, List<Sensor>> childrenSensors;
    private final List<MetricsReporter> reporters;
    private final Time time;
    private final ScheduledThreadPoolExecutor metricsScheduler;
    private static final Logger log = LoggerFactory.getLogger(Metrics.class);

//  从sensors集合中获取sensor对象,如果指定的Sensor不存在则创建新Sensor对象,并使用childrenSensors集合记录Sensor的层级关系
    public synchronized Sensor sensor(String name, MetricConfig config, long inactiveSensorExpirationTimeSeconds, Sensor.RecordingLevel recordingLevel, Sensor... parents) {
    //  根据name从sensors集合中获取sensor对象
        Sensor s = getSensor(name);
        if (s == null) {
        //  如果不存在则创建sensor对象
            s = new Sensor(this, name, parents, config == null ? this.config : config, time, inactiveSensorExpirationTimeSeconds, recordingLevel);
            this.sensors.put(name, s);
            if (parents != null) {
            // 通过childrenSensors记录sensor的层级关系
                for (Sensor parent : parents) {
                    List<Sensor> children = childrenSensors.get(parent);
                    if (children == null) {
                        children = new ArrayList<>();
                        childrenSensors.put(parent, children);
                    }
                    children.add(s);
                }
            }
            log.debug("Added sensor with name {}", name);
        }
        return s;
    }
}

使用场景

Producer、Consumer、Broker都会用到。下面以Producer举例。
Producer的构造函数中会初始化Metrics。

MetricConfig metricConfig = new MetricConfig().samples(config.getInt(ProducerConfig.METRICS_NUM_SAMPLES_CONFIG))
                    .timeWindow(config.getLong(ProducerConfig.METRICS_SAMPLE_WINDOW_MS_CONFIG), TimeUnit.MILLISECONDS)
                    .recordLevel(Sensor.RecordingLevel.forName(config.getString(ProducerConfig.METRICS_RECORDING_LEVEL_CONFIG)))
                    .tags(metricTags);
List<MetricsReporter> reporters = config.getConfiguredInstances(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG, MetricsReporter.class);
reporters.add(new JmxReporter(JMX_PREFIX));
this.metrics = new Metrics(metricConfig, reporters, time);

Producer主要用Metrics来度量和统计"produce-throttle-time"的相关指标。

public static Sensor throttleTimeSensor(SenderMetricsRegistry metrics) {
        Sensor produceThrottleTimeSensor = metrics.sensor("produce-throttle-time");
        produceThrottleTimeSensor.add(metrics.produceThrottleTimeAvg, new Avg());
        produceThrottleTimeSensor.add(metrics.produceThrottleTimeMax, new Max());
        return produceThrottleTimeSensor;
    }

如上,metrics首先注册了名为“produce-throttle-time”的sensor。然后给这个sensor加了两个指标,分别是produceThrottleTimeAvg(平均值)和produceThrottleTimeMax(最大值)。这两个指标对应的度量方法分别是Avg的实例对象和Max的实例对象。
什么触发这些指标的统计呢?答案是在客户端收到发送消息的Response后。如下:

throttleTimeSensor.record(responseBody.get(CommonFields.THROTTLE_TIME_MS), now);

这个record方法解析如下:

public void record(double value, long timeMs, boolean checkQuotas) {
        if (shouldRecord()) {
            this.lastRecordTime = timeMs;
//          线程安全
            synchronized (this) {
//          遍历所有stat,这里对应的是上文的Avg和Max
                for (Stat stat : this.stats)
                    stat.record(config, value, timeMs);
                if (checkQuotas)
                    checkQuotas(timeMs);
            }
            for (Sensor parent : parents)
                parent.record(value, timeMs, checkQuotas);
        }
    }

Avg和Max都继承了SampledStat的record()方法。

public void record(MetricConfig config, double value, long timeMs) {
        Sample sample = current(timeMs);
        if (sample.isComplete(timeMs, config))
            sample = advance(config, timeMs);
//      这里的update就由各子类单独实现。
        update(sample, config, value, timeMs);
        sample.eventCount += 1;
    }
// Avg
@Override
    protected void update(Sample sample, MetricConfig config, double value, long now) {
//      很简单,先求和
        sample.value += value;
    }
//  Max
@Override
    protected void update(Sample sample, MetricConfig config, double value, long now) {
//      直接取最大值    
        sample.value = Math.max(sample.value, value);
    }

最后这两个指标的计算会由JmxReporter调用,最终的计算逻辑在SampledStat的combine()方法中。指标值最终会呈现在jmx中。

@Override
    public double measure(MetricConfig config, long now) {
        purgeObsoleteSamples(config, now);
//      measure()方法调用combine()方法
        return combine(this.samples, config, now);
    }

// Avg
@Override
    public double combine(List<Sample> samples, MetricConfig config, long now) {
        double total = 0.0;
        long count = 0;
        for (Sample s : samples) {
            total += s.value;
            count += s.eventCount;
        }
        return count == 0 ? 0 : total / count;
    }

// Max
@Override
    public double combine(List<Sample> samples, MetricConfig config, long now) {
        double max = Double.NEGATIVE_INFINITY;
        for (Sample sample : samples)
            max = Math.max(max, sample.value);
        return max;
    }

客户端集成指标

Producer
public SenderMetricsRegistry(Metrics metrics) {
        this.metrics = metrics;
        this.tags = this.metrics.config().tags().keySet();
        this.allTemplates = new ArrayList<MetricNameTemplate>();
        
        /***** Client level *****/
        
        this.batchSizeAvg = createMetricName("batch-size-avg",
                "The average number of bytes sent per partition per-request.");
        this.batchSizeMax = createMetricName("batch-size-max",
                "The max number of bytes sent per partition per-request.");
        this.compressionRateAvg = createMetricName("compression-rate-avg",
                "The average compression rate of record batches.");
        this.recordQueueTimeAvg = createMetricName("record-queue-time-avg",
                "The average time in ms record batches spent in the send buffer.");
        this.recordQueueTimeMax = createMetricName("record-queue-time-max",
                "The maximum time in ms record batches spent in the send buffer.");
        this.requestLatencyAvg = createMetricName("request-latency-avg", 
                "The average request latency in ms");
        this.requestLatencyMax = createMetricName("request-latency-max", 
                "The maximum request latency in ms");
        this.recordSendRate = createMetricName("record-send-rate", 
                "The average number of records sent per second.");
        this.recordSendTotal = createMetricName("record-send-total", 
                "The total number of records sent.");
        this.recordsPerRequestAvg = createMetricName("records-per-request-avg",
                "The average number of records per request.");
        this.recordRetryRate = createMetricName("record-retry-rate",
                "The average per-second number of retried record sends");
        this.recordRetryTotal = createMetricName("record-retry-total", 
                "The total number of retried record sends");
        this.recordErrorRate = createMetricName("record-error-rate",
                "The average per-second number of record sends that resulted in errors");
        this.recordErrorTotal = createMetricName("record-error-total",
                "The total number of record sends that resulted in errors");
        this.recordSizeMax = createMetricName("record-size-max", 
                "The maximum record size");
        this.recordSizeAvg = createMetricName("record-size-avg", 
                "The average record size");
        this.requestsInFlight = createMetricName("requests-in-flight",
                "The current number of in-flight requests awaiting a response.");
        this.metadataAge = createMetricName("metadata-age",
                "The age in seconds of the current producer metadata being used.");
        this.batchSplitRate = createMetricName("batch-split-rate", 
                "The average number of batch splits per second");
        this.batchSplitTotal = createMetricName("batch-split-total", 
                "The total number of batch splits");

        this.produceThrottleTimeAvg = createMetricName("produce-throttle-time-avg",
                "The average time in ms a request was throttled by a broker");
        this.produceThrottleTimeMax = createMetricName("produce-throttle-time-max",
                "The maximum time in ms a request was throttled by a broker");

        /***** Topic level *****/
        this.topicTags = new HashSet<String>(tags);
        this.topicTags.add("topic");

        // We can't create the MetricName up front for these, because we don't know the topic name yet.
        this.topicRecordSendRate = createTopicTemplate("record-send-rate",
                "The average number of records sent per second for a topic.");
        this.topicRecordSendTotal = createTopicTemplate("record-send-total",
                "The total number of records sent for a topic.");
        this.topicByteRate = createTopicTemplate("byte-rate",
                "The average number of bytes sent per second for a topic.");
        this.topicByteTotal = createTopicTemplate("byte-total", 
                "The total number of bytes sent for a topic.");
        this.topicCompressionRate = createTopicTemplate("compression-rate",
                "The average compression rate of record batches for a topic.");
        this.topicRecordRetryRate = createTopicTemplate("record-retry-rate",
                "The average per-second number of retried record sends for a topic");
        this.topicRecordRetryTotal = createTopicTemplate("record-retry-total",
                "The total number of retried record sends for a topic");
        this.topicRecordErrorRate = createTopicTemplate("record-error-rate",
                "The average per-second number of record sends that resulted in errors for a topic");
        this.topicRecordErrorTotal = createTopicTemplate("record-error-total",
                "The total number of record sends that resulted in errors for a topic");

    }
Consumer
public FetcherMetricsRegistry(Set<String> tags, String metricGrpPrefix) {
        
        /***** Client level *****/
        String groupName = metricGrpPrefix + "-fetch-manager-metrics";
                
        this.fetchSizeAvg = new MetricNameTemplate("fetch-size-avg", groupName, 
                "The average number of bytes fetched per request", tags);

        this.fetchSizeMax = new MetricNameTemplate("fetch-size-max", groupName, 
                "The maximum number of bytes fetched per request", tags);
        this.bytesConsumedRate = new MetricNameTemplate("bytes-consumed-rate", groupName, 
                "The average number of bytes consumed per second", tags);
        this.bytesConsumedTotal = new MetricNameTemplate("bytes-consumed-total", groupName,
                "The total number of bytes consumed", tags);

        this.recordsPerRequestAvg = new MetricNameTemplate("records-per-request-avg", groupName, 
                "The average number of records in each request", tags);
        this.recordsConsumedRate = new MetricNameTemplate("records-consumed-rate", groupName, 
                "The average number of records consumed per second", tags);
        this.recordsConsumedTotal = new MetricNameTemplate("records-consumed-total", groupName,
                "The total number of records consumed", tags);

        this.fetchLatencyAvg = new MetricNameTemplate("fetch-latency-avg", groupName, 
                "The average time taken for a fetch request.", tags);
        this.fetchLatencyMax = new MetricNameTemplate("fetch-latency-max", groupName, 
                "The max time taken for any fetch request.", tags);
        this.fetchRequestRate = new MetricNameTemplate("fetch-rate", groupName, 
                "The number of fetch requests per second.", tags);
        this.fetchRequestTotal = new MetricNameTemplate("fetch-total", groupName,
                "The total number of fetch requests.", tags);

        this.recordsLagMax = new MetricNameTemplate("records-lag-max", groupName, 
                "The maximum lag in terms of number of records for any partition in this window", tags);

        this.fetchThrottleTimeAvg = new MetricNameTemplate("fetch-throttle-time-avg", groupName, 
                "The average throttle time in ms", tags);
        this.fetchThrottleTimeMax = new MetricNameTemplate("fetch-throttle-time-max", groupName, 
                "The maximum throttle time in ms", tags);

        /***** Topic level *****/
        Set<String> topicTags = new HashSet<>(tags);
        topicTags.add("topic");

        this.topicFetchSizeAvg = new MetricNameTemplate("fetch-size-avg", groupName, 
                "The average number of bytes fetched per request for a topic", topicTags);
        this.topicFetchSizeMax = new MetricNameTemplate("fetch-size-max", groupName, 
                "The maximum number of bytes fetched per request for a topic", topicTags);
        this.topicBytesConsumedRate = new MetricNameTemplate("bytes-consumed-rate", groupName, 
                "The average number of bytes consumed per second for a topic", topicTags);
        this.topicBytesConsumedTotal = new MetricNameTemplate("bytes-consumed-total", groupName,
                "The total number of bytes consumed for a topic", topicTags);

        this.topicRecordsPerRequestAvg = new MetricNameTemplate("records-per-request-avg", groupName, 
                "The average number of records in each request for a topic", topicTags);
        this.topicRecordsConsumedRate = new MetricNameTemplate("records-consumed-rate", groupName, 
                "The average number of records consumed per second for a topic", topicTags);
        this.topicRecordsConsumedTotal = new MetricNameTemplate("records-consumed-total", groupName,
                "The total number of records consumed for a topic", topicTags);
        
        /***** Partition level *****/
        this.partitionRecordsLag = new MetricNameTemplate("{topic}-{partition}.records-lag", groupName, 
                "The latest lag of the partition", tags);
        this.partitionRecordsLagMax = new MetricNameTemplate("{topic}-{partition}.records-lag-max", groupName, 
                "The max lag of the partition", tags);
        this.partitionRecordsLagAvg = new MetricNameTemplate("{topic}-{partition}.records-lag-avg", groupName, 
                "The average lag of the partition", tags);
        
    
    }

如何查看指标

客户端目前只支持用过jmx监控指标。
jvm启动参数添加:

-ea -Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=9996  

启动客户端,使用JConsole工具连接对应进程。
截图如下:
Producer
在这里插入图片描述
Consumer
在这里插入图片描述

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值