Flink简介
Apache Flink是一个框架和分布式处理引擎,用于对无界和有界数据流进行有状态计算。Flink设计为在所有常见的集群环境中运行,以内存速度和任何规模执行计算。
官网:https://flink.apache.org/
源码:https://github.com/apache/flink
Flink特点
-
流处理特性
(1)支持高吞吐、低延迟、高性能的流处理
(2)支持带有事件时间的窗口(Window)操作
(3)支持有状态计算的Exactly-once语义
(4)支持高度灵活的窗口(Window)操作,支持基于time、count、session,以及data-driven的窗口操作
(5)支持具有Backpressure功能的持续流模型
(6)支持基于轻量级分布式快照(Snapshot)实现的容错
(7)运行时同时支持Batch on Streaming处理和Streaming处理
(8)Flink在JVM内部实现了自己的内存管理
(9)支持迭代计算
(10)支持程序自动优化:避免特定情况下Shuffle、排序等昂贵操作,中间结果有必要进行缓存 -
API支持
(1)对Streaming数据类应用,提供DataStream API
(2)对批处理类应用,提供DataSet API(支持Java/Scala) -
Libraries支持
支持机器学习(FlinkML)、支持图分析(Gelly)、支持关系数据处理(Table)、支持复杂事件处理(CEP) -
整合支持
支持Flink on YARN、HDFS、Kafka的输入数据、Apache HBase、Hadoop程序、Tachyon、ElasticSearch、RabbitMQ、Apache Storm、S3、XtreemFS。 -
随处部署应用程序
Apache Flink是一个分布式系统,需要计算资源才能执行应用程序。Flink与所有常见的集群资源管理器(如Hadoop YARN,Apache Mesos和Kubernetes)集成,但也可以设置为作为独立集群运行。 -
以任何比例运行应用程序
Flink旨在以任何规模运行有状态流应用程序。应用程序可以并行化为数千个在集群中分布和同时执行的任务。因此,应用程序可以利用几乎无限量的CPU,主内存,磁盘和网络IO。
而且,Flink可以轻松维护非常大的应用程序状态。其异步和增量检查点算法确保对处理延迟的影响最小,同时保证一次性状态一致性。
Storm、Spark、Flink对比
吞吐量
spark是mirco-batch级别的计算,各种优化做的也很好,它的throughputs是最大的
。但是需要提一下,有状态计算(如updateStateByKey算子)需要通过额外的rdd来维护状态,导致开销较大,对吞吐量影响也较大。
storm的容错机制需要对每条data进行ack,因此容错开销对throughputs影响巨大,throughputs下降甚至可以达到70%
。storm trident是基于micro-batch实现的,throughput中等
。
flink的容错机制较为轻量,对throughputs影响较小,而且拥有图和调度上的一些优化机制,使得flink可以达到很高 throughputs。
下图是flink官网给出的storm和flink的对比图,我们可以看出storm在打开ack容错机制后,throughputs下降非常明显。而flink在开启checkpoint和关闭的情况下throughputs变化不大,说明flink的容错机制确实代价不高。
延迟
spark基于micro-batch实现,提高了throughputs,但是付出了latency的代价。一般spark的latency是秒级别的。
storm是native streaming实现,可以轻松的达到几十毫秒级别的latency,在几款框架中它的latency是最低的。storm trident是基于micro-batch实现的,latency较高。
flink也是native streaming实现,也可以达到百毫秒级别的latency。
下图是flink官网给出的和storm的latency对比benchmark。storm可以达到平均5毫秒以内的latency,而flink的平均latency也在30毫秒以内。两者的99%的data都在55毫秒latency内处理完成,表现都很优秀。
监控方案
集群监控
进程存在性监控
Flink进程分为JobManager(StandaloneSessionClusterEntrypoint)、和TaskManager。可通过脚本,分别监控各进程是否存在。
集群进程性能监控
Flink官方提供了Prometheus 的监控方案,通过修改flink/conf/flink-conf.yaml文件,添加如下配置信息,
# 使用PrometheusReporter类对外提供监控数据
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
# 设置对外提供监控数据的接口,默认为9249,可设置端口范围
metrics.reporter.prom.port: 9249-9250
在Prometheus的yml采集配置中添加如下内容进行采集:
# List填写Flink进程和监控端口,label标签根据如下添加
- targets: ['30.0.0.20:9049','30.0.0.21:9049']
labels:
clusterName:'FlinkCluster001'
job:'flink'
任务监控
任务存在性监控
Flink任务可分为批处理任务
和流处理任务
。通过bin/flink list
命令可以查看当前运行的任务。
- 流处理任务会一直处于运行状态,可以使用脚本,通过调用
bin/flink list
命令查看当前运行的任务,监控任务是否存在。 - 批处理任务,在运行结束后退出,退出后
bin/flink list
命令将无法查看到任务,可以从调度上解决,将批处理任务的调用由crontab,改为程序调度,这样,向Flink提交批处理任务的CliFrontend进程会一直存在,由CliFrontend通过监控CliFrontend进程达到监控批处理任务的存在性。
业务监控
Flink作为数据处理引擎,其任务功能离不开数据的输入和输出,可以结合任务实际业务,对输入、输出数据量进行监控。
指标介绍
指标类型
Flink支持Counters, Gauges, Histograms 和 Meters四种指标类型。
Counter
Counter用于计数。
Gauge
Gauge根据需要提供任何类型的值。
Histogram
Histogram衡量长值的分布。
Meter
Meter衡量平均吞吐量。
指标范围(scope)
当上报metric时,metric被打上了标识符,和一系列的key-value对。
该标识符基于3个组成部分:注册度量标准时的用户定义名称,可选的用户定义范围和系统提供的范围。例如,如果A.B是系统范围,C.D用户范围和E名称,则指标的标识符将为A.B.C.D.E。
该标识符由3个部分组成:注册指标时用户定义的名称,可选的用户定义的范围(scope)和系统提供的范围(scope)。例如,如果A.B是系统提供的范围,C.D用户定义的范围,E是用户定义的名称,则指标的标识符将为A.B.C.D.E。
可以通过conf/flink-conf.yaml
配置文件的metrics.scope.delimiter
配置项调整标识符的分隔符,默认为为.
指标清单
CPU
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
Job-/TaskManager | Status.JVM.CPU | Load | 当前JVM的CPU使用率 | Gauge |
Time | The CPU time used by the JVM. | Gauge |
内存
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
Job-/TaskManager | Status.JVM.Memory | Heap.Used | The amount of heap memory currently used (in bytes). | Gauge |
Heap.Committed | The amount of heap memory guaranteed to be available to the JVM (in bytes). JVM申请内存大小 | Gauge | ||
Heap.Max | The maximum amount of heap memory that can be used for memory management (in bytes). 可用于内存管理的最大heap内存 | Gauge | ||
NonHeap.Used | The amount of non-heap memory currently used (in bytes). 当前被使用的non-heap内存 | Gauge | ||
NonHeap.Committed | The amount of non-heap memory guaranteed to be available to the JVM (in bytes). | Gauge | ||
NonHeap.Max | The maximum amount of non-heap memory that can be used for memory management (in bytes). | Gauge | ||
Direct.Count | The number of buffers in the direct buffer pool. 直接缓存池的缓存数 | Gauge | ||
Direct.MemoryUsed | The amount of memory used by the JVM for the direct buffer pool (in bytes). JVM使用掉的直接缓存池内存大小,单位byte | Gauge | ||
Direct.TotalCapacity | The total capacity of all buffers in the direct buffer pool (in bytes). 直接缓存池总容量,单位byte | Gauge | ||
Mapped.Count | The number of buffers in the mapped buffer pool. | Gauge | ||
Mapped.MemoryUsed | The amount of memory used by the JVM for the mapped buffer pool (in bytes). | Gauge | ||
Mapped.TotalCapacity | The number of buffers in the mapped buffer pool (in bytes). | Gauge |
注:
- UsedHeap、MaxHeap、CommittedHeap区别,参见文章:
https://www.baeldung.com/java-heap-used-committed-max - direct buffer pool、mapped buffer pool介绍,参见文章:https://stackoverflow.com/questions/15657837/what-is-mapped-buffer-pool-direct-buffer-pool-and-how-to-increase-their-size
线程
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
Job-/TaskManager | Status.JVM.Threads | Count | The total number of live threads. | Gauge |
垃圾回收
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
Job-/TaskManager | Status.JVM.GarbageCollector | <GarbageCollector>.Count | The total number of collections that have occurred. | Gauge |
<GarbageCollector>.Time | The total time spent performing garbage collection. | Gauge |
类加载(ClassLoader)
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
Job-/TaskManager | Status.JVM.ClassLoader | ClassesLoaded | The total number of classes loaded since the start of the JVM. | Gauge |
ClassesUnloaded | The total number of classes unloaded since the start of the JVM. | Gauge |
网络
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
TaskManager | Status.Network | AvailableMemorySegments | The number of unused memory segments. | Gauge |
TotalMemorySegments | The number of allocated memory segments. | Gauge | ||
Task | buffers | inputQueueLength | The number of queued input buffers. (ignores LocalInputChannels which are using blocking subpartitions) | Gauge |
outputQueueLength | The number of queued output buffers. | Gauge | ||
inPoolUsage | An estimate of the input buffers usage. (ignores LocalInputChannels) | Gauge | ||
inputFloatingBuffersUsage | An estimate of the floating input buffers usage, dedicated for credit-based mode. (ignores LocalInputChannels) | Gauge | ||
inputExclusiveBuffersUsage | An estimate of the exclusive input buffers usage, dedicated for credit-based mode. (ignores LocalInputChannels) | Gauge | ||
outPoolUsage | An estimate of the output buffers usage. | Gauge | ||
Network.<Input|Output>.<gate|partition> (only available if taskmanager.net.detailed-metrics config option is set) | totalQueueLen | Total number of queued buffers in all input/output channels. | Gauge | |
minQueueLen | Minimum number of queued buffers in all input/output channels. | Gauge | ||
maxQueueLen | Maximum number of queued buffers in all input/output channels. | Gauge | ||
avgQueueLen | Average number of queued buffers in all input/output channels. | Gauge |
注:
Flink内存管理机制参见:https://blog.csdn.net/lvwenyuan_1/article/details/103404591
Default shuffle service
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
TaskManager | Status.Shuffle.Netty | AvailableMemorySegments | The number of unused memory segments. | Gauge |
TotalMemorySegments | The number of allocated memory segments. | Gauge | ||
Task | Shuffle.Netty.Input.Buffers | inputQueueLength | The number of queued input buffers. | Gauge |
inPoolUsage | An estimate of the input buffers usage. | Gauge | ||
Shuffle.Netty.Output.Buffers | outputQueueLength | The number of queued output buffers. | Gauge | |
outPoolUsage | An estimate of the output buffers usage. | Gauge | ||
Shuffle.Netty.<Input|Output>.<gate|partition> (only available if taskmanager.net.detailed-metrics config option is set) | totalQueueLen | Total number of queued buffers in all input/output channels. | Gauge | |
minQueueLen | Minimum number of queued buffers in all input/output channels. | Gauge | ||
maxQueueLen | Maximum number of queued buffers in all input/output channels. | Gauge | ||
avgQueueLen | Average number of queued buffers in all input/output channels. | Gauge | ||
Task | Shuffle.Netty.Input | numBytesInLocal | The total number of bytes this task has read from a local source. | Counter |
numBytesInLocalPerSecond | The number of bytes this task reads from a local source per second. | Meter | ||
numBytesInRemote | The total number of bytes this task has read from a remote source. | Counter | ||
numBytesInRemotePerSecond | The number of bytes this task reads from a remote source per second. | Meter | ||
numBuffersInLocal | The total number of network buffers this task has read from a local source. | Counter | ||
numBuffersInLocalPerSecond | The number of network buffers this task reads from a local source per second. | Meter | ||
numBuffersInRemote | The total number of network buffers this task has read from a remote source. | Counter | ||
numBuffersInRemotePerSecond | The number of network buffers this task reads from a remote source per second. | Meter |
注:
Job、Task、Subtask定义参见:https://stackoverflow.com/questions/53610342/difference-between-job-task-and-subtask-in-flink
集群
Scope | Metrics | Description | Type |
---|---|---|---|
JobManager | numRegisteredTaskManagers | The number of registered taskmanagers. | Gauge |
numRunningJobs | The number of running jobs. | Gauge | |
taskSlotsAvailable | The number of available task slots. | Gauge | |
taskSlotsTotal | The total number of task slots. | Gauge |
可用性
Scope | Metrics | Description | Type |
---|---|---|---|
Job (only available on JobManager) | restartingTime | The time it took to restart the job, or how long the current restart has been in progress (in milliseconds). | Gauge |
uptime | The time that the job has been running without interruption. Returns -1 for completed jobs (in milliseconds). | Gauge | |
downtime | For jobs currently in a failing/recovering situation, the time elapsed during this outage. Returns 0 for running jobs and -1 for completed jobs (in milliseconds). | Gauge | |
fullRestarts | The total number of full restarts since this job was submitted. Attention: Since 1.9.2, this metric also includes fine-grained restarts. | Gauge |
CheckPointing
Scope | Metrics | Description | Type |
---|---|---|---|
Job (only available on JobManager) | lastCheckpointDuration | The time it took to complete the last checkpoint (in milliseconds). | Gauge |
lastCheckpointSize | The total size of the last checkpoint (in bytes). | Gauge | |
lastCheckpointExternalPath | The path where the last external checkpoint was stored. | Gauge | |
lastCheckpointRestoreTimestamp | Timestamp when the last checkpoint was restored at the coordinator (in milliseconds). | Gauge | |
lastCheckpointAlignmentBuffered | The number of buffered bytes during alignment over all subtasks for the last checkpoint (in bytes). | Gauge | |
numberOfInProgressCheckpoints | The number of in progress checkpoints. | Gauge | |
numberOfCompletedCheckpoints | The number of successfully completed checkpoints. | Gauge | |
numberOfFailedCheckpoints | The number of failed checkpoints. | Gauge | |
totalNumberOfCheckpoints | The number of total checkpoints (in progress, completed, failed). | Gauge | |
Task | checkpointAlignmentTime | The time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far (in nanoseconds). | Gauge |
RocksDB
IO
Scope | Metrics | Description | Type |
---|---|---|---|
Job (only available on TaskManager) | <source_id>.<source_subtask_index>.<operator_id>.<operator_subtask_index>.latency | The latency distributions from a given source subtask to an operator subtask (in milliseconds). | Histogram |
Task | numBytesInLocal | Attention: deprecated, use Default shuffle service metrics. | Counter |
numBytesInLocalPerSecond | Attention: deprecated, use Default shuffle service metrics. | Meter | |
numBytesInRemote | Attention: deprecated, use Default shuffle service metrics. | Counter | |
numBytesInRemotePerSecond | Attention: deprecated, use Default shuffle service metrics. | Meter | |
numBuffersInLocal | Attention: deprecated, use Default shuffle service metrics. | Counter | |
numBuffersInLocalPerSecond | Attention: deprecated, use Default shuffle service metrics. | Meter | |
numBuffersInRemote | Attention: deprecated, use Default shuffle service metrics. | Counter | |
numBuffersInRemotePerSecond | Attention: deprecated, use Default shuffle service metrics. | Meter | |
numBytesOut | The total number of bytes this task has emitted. | Counter | |
numBytesOutPerSecond | The number of bytes this task emits per second. | Meter | |
numBuffersOut | The total number of network buffers this task has emitted. | Counter | |
numBuffersOutPerSecond | The number of network buffers this task emits per second. | Meter | |
Task/Operator | numRecordsIn | The total number of records this operator/task has received. | Counter |
numRecordsInPerSecond | The number of records this operator/task receives per second. | Meter | |
numRecordsOut | The total number of records this operator/task has emitted. | Counter | |
numRecordsOutPerSecond | The number of records this operator/task sends per second. | Meter | |
numLateRecordsDropped | The number of records this operator/task has dropped due to arriving late. | Counter | |
currentInputWatermark | The last watermark this operator/tasks has received (in milliseconds). Note: For operators/tasks with 2 inputs this is the minimum of the last received watermarks. | Gauge | |
Operator | currentInput1Watermark | The last watermark this operator has received in its first input (in milliseconds). Note: Only for operators with 2 inputs. | Gauge |
currentInput2Watermark | The last watermark this operator has received in its second input (in milliseconds). Note: Only for operators with 2 inputs. | Gauge | |
currentOutputWatermark | The last watermark this operator has emitted (in milliseconds). | Gauge | |
numSplitsProcessed | The total number of InputSplits this data source has processed (if the operator is a data source). | Gauge |
连接器(Connector)
Kafka Connectors
Scope | Metrics | User Variables | Description | Type |
---|---|---|---|---|
Operator | commitsSucceeded | n/a | The total number of successful offset commits to Kafka, if offset committing is turned on and checkpointing is enabled. | Counter |
Operator | commitsFailed | n/a | The total number of offset commit failures to Kafka, if offset committing is turned on and checkpointing is enabled. Note that committing offsets back to Kafka is only a means to expose consumer progress, so a commit failure does not affect the integrity of Flink's checkpointed partition offsets. | Counter |
Operator | committedOffsets | topic, partition | The last successfully committed offsets to Kafka, for each partition. A particular partition's metric can be specified by topic name and partition id. | Gauge |
Operator | currentOffsets | topic, partition | The consumer's current read offset, for each partition. A particular partition's metric can be specified by topic name and partition id. | Gauge |
Kinesis Connectors
Scope | Metrics | User Variables | Description | Type |
---|---|---|---|---|
Operator | millisBehindLatest | stream, shardId | The number of milliseconds the consumer is behind the head of the stream, indicating how far behind current time the consumer is, for each Kinesis shard. A particular shard's metric can be specified by stream name and shard id. A value of 0 indicates record processing is caught up, and there are no new records to process at this moment. A value of -1 indicates that there is no reported value for the metric, yet. | Gauge |
Operator | sleepTimeMillis | stream, shardId | The number of milliseconds the consumer spends sleeping before fetching records from Kinesis. A particular shard's metric can be specified by stream name and shard id. | Gauge |
Operator | maxNumberOfRecordsPerFetch | stream, shardId | The maximum number of records requested by the consumer in a single getRecords call to Kinesis. If ConsumerConfigConstants.SHARD_USE_ADAPTIVE_READS is set to true, this value is adaptively calculated to maximize the 2 Mbps read limits from Kinesis. | Gauge |
Operator | numberOfAggregatedRecordsPerFetch | stream, shardId | The number of aggregated Kinesis records fetched by the consumer in a single getRecords call to Kinesis. | Gauge |
Operator | numberOfDeggregatedRecordsPerFetch | stream, shardId | The number of deaggregated Kinesis records fetched by the consumer in a single getRecords call to Kinesis. | Gauge |
Operator | averageRecordSizeBytes | stream, shardId | The average size of a Kinesis record in bytes, fetched by the consumer in a single getRecords call. | Gauge |
Operator | runLoopTimeNanos | stream, shardId | The actual time taken, in nanoseconds, by the consumer in the run loop. | Gauge |
Operator | loopFrequencyHz | stream, shardId | The number of calls to getRecords in one second. | Gauge |
Operator | bytesRequestedPerFetch | stream, shardId | The bytes requested (2 Mbps / loopFrequencyHz) in a single call to getRecords. | Gauge |
操作系统资源(System resources)
操作系统资源相关指标,默认是关闭不采集的。
监控 Checkpoint
Flink的web接口提供一个窗口用于监控任务的checkpoint,这些数据在任务被终止后仍然可用。这里提供了四个不同的窗口展示checkpoint信息,分别是Overview, History, Summary, 和 Configuration。下面将依次讲解。
Overview
Overview 窗口列出了如下这些数据。如果 JobManager 进程挂了,这些数据将丢失。
- Checkpoint Counts
- Triggered:Job启动后,被触发的 checkpoint 总数。
- In Progress:程序中的 checkpoint 总数。
- Completed:Job启动后,成功完成的 checkpoint 总数。
- Failed:Job启动后,失败的 checkpoint 总数。
- Restored:Job启动后,恢复的 checkpoint 数。这个指标同事反映了Job提交后,被重新启动的次数。需要注意,带 savepoint 的首次提交,也被记做一次恢复。同时,如果 JobManager 挂了,计数将被重置。
- Latest Completed Checkpoint:最后一个成功完成的 checkpoint。点击它,可以获取到 subtask 级别的详细的数据。
- Latest Failed Checkpoint:最后一个失败的 checkpoint。点击它,可以获取到 subtask 级别的详细的数据。
- Latest Savepoint:通过外部途径,最后一次触发 savepoint 。点击它,可以获取到 subtask 级别的详细的数据。
- Latest Restore:这里有两种恢复操作。
- Restore from Checkpoint:从常规的、周期性的 checkpoint 恢复。
- Restore from Savepoint:从 savepoint 恢复。