用户Flink任务监控功能-设计

最新推荐文章于 2024-03-29 11:18:39 发布

老哥哥-老刘

最新推荐文章于 2024-03-29 11:18:39 发布

阅读量121

点赞数

文章标签： flink 大数据

本文链接：https://blog.csdn.net/qq_18379607/article/details/135206326

版权

设计文档：用户Flink任务监控功能

目标

设计和实现一个功能，允许用户通过查询Prometheus获取其Flink任务Pod的各项指标，并使用Vue的ECharts库将这些指标可视化呈现。

技术选择

后端：使用Java（Spring Boot）或其他后端框架。
前端：使用Vue.js作为前端框架，ECharts库用于可视化。
数据获取：使用Axios或其他HTTP库从后端API获取数据。
部署：将后端应用程序部署到服务器上，确保Prometheus可以访问。

Prometheus与flink打通

对于flink，目前可以支持通过在启动任务的时候通过主动推送数据的方式，将一些关键指标传递给prometheus体系中的pushgateway进行保存。一般地，通过官方的文档了解，需要在flink的flink-conf.yaml中配置如下信息：

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: false
metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2
metrics.reporter.promgateway.interval: 60 SECONDS

值得注意的是，因为目前使用的flink运行模式为k8s application模式，这就意味着所有的taskmanager、jobmanager所在的节点主机（node）都能够成功连接到prometheus所在的主机以进行数据推送。

Prometheus与ECharts的开发优势

1. 简化开发过程

开发难度低: 相对于Grafana，使用ECharts在前端开发方面的难度较低。ECharts提供了直观且易于使用的JavaScript图表库，可以轻松地创建各种类型的图表，包括折线图、柱状图、饼图等。开发人员可以更快速地构建和调整图表以满足特定需求。

2. 安全性控制

自定义安全性: 使用自定义前端开发，可以更好地控制安全性。可以实施自定义身份验证和授权机制，确保只有经过身份验证的用户能够访问系统，并限制他们的访问权限。这可以更容易地适应的安全性需求。

3. 用户隔离和自定义视图

用户隔离: Prometheus与ECharts的组合使得用户隔离更加容易。可以实施自定义逻辑，确保每个用户只能访问和查询与他们自己相关的数据。这可以更好地满足多租户环境中的需求，确保用户只能看到自己的数据。
自定义视图: 使用ECharts，可以轻松自定义图表的外观和交互。这意味着可以根据用户需求创建特定的监控仪表板和可视化效果，而无需受限于Grafana的默认视图。这种灵活性使能够更好地满足不同用户的需求。

4. 降低二次开发成本

低二次开发成本: 使用ECharts进行前端开发，可以更容易地满足特定的定制需求，而无需涉及大量的Grafana界面调整。这可以降低二次开发成本，并允许更快速地响应用户的需求变化。

5. 系统整合

更紧密的系统整合: 使用Prometheus和ECharts，可以更紧密地集成监控系统与StreamPark系统和服务。可以轻松地将监控数据与其他应用程序数据进行整合，以支持更多的分析和洞察力。

Prometheus与flink结合后的可采集指标

一般地，使用flink的原生prometheus功能，能够在一定程度上收集到一些信息并传递与prometheus进行记录。具体的指标请看如下的表格。官方提供了如下几个可以采集的指标。一般地，此类信息为瞬时值（Gauge），需要使用prometheus的时序数据库进行存储。

METRIC（指标名称）	UNIT（单位）	DESCRIPTION（描述）
`flink_jobmanager_Status_JVM_CPU_Load`	Percentage	JobManager - recent CPU usage of the JVM, due to unclear reasons is not functioning as expected (For more information on workarounds see: How can I see the percentage CPU usage of jobmanager or taskmanagers of a Stream pipeline.)
`flink_jobmanager_Status_JVM_CPU_Time`	Nanoseconds	JobManager - CPU Time used by the JVM
`flink_jobmanager_Status_JVM_Memory_Heap_Used`	Bytes	JobManager - amount of heap memory currently used
`flink_jobmanager_Status_JVM_Memory_Heap_Committed`	Bytes	JobManager - amount of heap memory guaranteed to be available to the JVM
`flink_jobmanager_Status_JVM_Memory_Heap_Max`	Bytes	JobManager - maximum amount of heap memory that can be used for memory management
`flink_jobmanager_Status_JVM_Memory_NonHeap_Used`	Bytes	JobManager - amount of non-heap memory currently used
`flink_jobmanager_Status_JVM_Memory_NonHeap_Committed`	Bytes	JobManager - amount of non-heap memory guaranteed to be available to the JVM
`flink_jobmanager_Status_JVM_Memory_NonHeap_Max`	Bytes	JobManager - maximum amount of non-heap memory that can be used for memory management
`flink_jobmanager_Status_JVM_Memory_Direct_Count`	Count	JobManager - number of buffers in the direct buffer pool
`flink_jobmanager_Status_JVM_Memory_Direct_MemoryUsed`	Bytes	JobManager - amount of memory used by the JVM for the direct buffer pool
`flink_jobmanager_Status_JVM_Memory_Direct_TotalCapacity`	Bytes	JobManager - total capacity of all buffers in the direct buffer pool
`flink_jobmanager_Status_JVM_Memory_Mapped_Count`	Count	JobManager - number of buffers in the mapped buffer pool
`flink_jobmanager_Status_JVM_Memory_Mapped_MemoryUsed`	Bytes	JobManager - amount of memory used by the JVM for the mapped buffer pool
`flink_jobmanager_Status_JVM_Memory_Mapped_TotalCapacity`	Bytes	JobManager - number of buffers in the mapped buffer pool
`flink_taskmanager_Status_JVM_CPU_Load`	Percentage	TaskManager - recent CPU usage of the JVM, due to unclear reasons is not functioning as expected (For more information on workarounds see: How can I see the percentage CPU usage of jobmanager or taskmanagers of a Stream pipeline.)
`flink_taskmanager_Status_JVM_CPU_Time`	Nanoseconds	TaskManager - CPU Time used by the JVM
`flink_taskmanager_Status_JVM_Memory_Heap_Used`	Bytes	TaskManager - amount of heap memory currently used
`flink_taskmanager_Status_JVM_Memory_Heap_Committed`	Bytes	TaskManager - amount of heap memory guaranteed to be available to the JVM
`flink_taskmanager_Status_JVM_Memory_Heap_Max`	Bytes	TaskManager - maximum amount of heap memory that can be used for memory management
`flink_taskmanager_Status_JVM_Memory_NonHeap_Used`	Bytes	TaskManager - amount of non-heap memory currently used
`flink_taskmanager_Status_JVM_Memory_NonHeap_Committed`	Bytes	TaskManager - amount of non-heap memory guaranteed to be available to the JVM
`flink_taskmanager_Status_JVM_Memory_NonHeap_Max`	Bytes	TaskManager - maximum amount of non-heap memory that can be used for memory management
`flink_taskmanager_Status_JVM_Memory_Direct_Count`	Count	TaskManager - number of buffers in the direct buffer pool
`flink_taskmanager_Status_JVM_Memory_Direct_MemoryUsed`	Bytes	TaskManager - amount of memory used by the JVM for the direct buffer pool
`flink_taskmanager_Status_JVM_Memory_Direct_TotalCapacity`	Bytes	TaskManager - total capacity of all buffers in the direct buffer pool
`flink_taskmanager_Status_JVM_Memory_Mapped_Count`	Count	TaskManager - number of buffers in the mapped buffer pool
`flink_taskmanager_Status_JVM_Memory_Mapped_MemoryUsed`	Bytes	TaskManager - amount of memory used by the JVM for the mapped buffer pool
`flink_taskmanager_Status_JVM_Memory_Mapped_TotalCapacity`	Bytes	TaskManager - number of buffers in the mapped buffer pool

Flink Cluster Metrics

METRIC	DESCRIPTION
`flink_jobmanager_numRegisteredTaskManagers`	Total Number of Registered Task Managers
`flink_jobmanager_numRunningJobs`	Total Number of Running Jobs
`flink_jobmanager_taskSlotsTotal`	Total Number of Task Slots Allocated
`flink_jobmanager_taskSlotsAvailable`	Total Number of Task Slots Available

Flink I/O Metrics

METRIC	DESCRIPTION
`flink_taskmanager_job_task_currentLowWatermark`	Task - currentLowWatermark: the lowest watermark this task has received
`flink_taskmanager_job_task_numBytesInLocal`	Task - numBytesInLocal: the total number of bytes this task has read from a local source
`flink_taskmanager_job_task_numBytesInLocalPerSecond`	Task - numBytesInLocalPerSecond: the number of bytes this task reads from a local source per second
`flink_taskmanager_job_task_numBytesInRemote`	Task - numBytesInRemote: the total number of bytes this task has read from a remote source
`flink_taskmanager_job_task_numBytesInRemotePerSecond`	Task - numBytesInRemotePerSecond: the number of bytes this task reads from a remote source per second
`flink_taskmanager_job_task_numBytesOut`	Task - numBytesOut: the total number of bytes this task has emitted
`flink_taskmanager_job_task_numBytesOutPerSecond`	Task - numBytesOutPerSecond: the number of bytes this task emits per second
`flink_taskmanager_job_task_numRecordsIn`	Task/Operator - numRecordsIn: the total number of records this operator/task has received
`flink_taskmanager_job_task_numRecordsInPerSecond`	Task/Operator - numRecordsInPerSecond: the number of records this operator/task receives per second
`flink_taskmanager_job_task_numRecordsOut`	Task/Operator - numRecordsOut: the total number of records this operator/task has emitted
`flink_taskmanager_job_task_numRecordsOutPerSecond`	Task/Operator - numRecordsOutPerSecond: the number of records this operator/task sends per second
`flink_taskmanager_job_task_operator_latency`	Operator - latency: the latency distributions from all incoming sources

关键步骤

步骤 1：Prometheus查询接口

1.1 创建后端API接口

首先，在后端应用程序中创建一个API接口，以接收来自前端的请求，并执行Prometheus查询。这可以使用Java（使用Spring Boot等框架）或其他后端语言/框架来实现。以下是一个简化的示例：

@RestController
@RequestMapping("/api")
public class PrometheusController {

    @Autowired
    private PrometheusQueryService prometheusQueryService;

    @GetMapping("/getFlinkMetrics")
    public ResponseEntity<?> getFlinkMetrics(
        @RequestParam String namespace,
        @RequestParam String podName
    ) {
        // 构建PromQL查询
        String promQLQuery = "container_cpu_usage_seconds_total" +
                             "{namespace=\"" + namespace + "\", pod=\"" + podName + "\"}";

        // 执行查询
        String result = prometheusQueryService.executeQuery(promQLQuery);

        // 处理结果并返回
        // ...
    }
}

1.2 构建Prometheus查询语句

在上述示例中，我们构建了一个PromQL查询语句，以获取指定命名空间和Pod名称的CPU使用情况。可以根据需要构建不同的查询，以获取其他指标，如内存使用情况、网络流量等。以下是一些示例查询：

获取CPU使用率：

container_cpu_usage_seconds_total{namespace="your-namespace", pod="your-pod-name"}

获取内存使用率：

container_memory_usage_bytes{namespace="your-namespace", pod="your-pod-name"}

获取网络接收速率：

container_network_receive_bytes_total{namespace="your-namespace", pod="your-pod-name"}

以上的查询语句只是进行实例展示，具体的需要获取哪些指标以及这些指标的promQL语句应该如何写，需要和产品、prometheus熟悉的人员进行对接、确认。

1.3 执行Prometheus查询

在API接口中，使用Prometheus Java客户端库或HTTP请求库来执行构建的PromQL查询。以下是一些示例代码：

使用Prometheus Java客户端库：

Query query = new Query(promQLQuery, Instant.now());
QueryResponse response = prometheus.query(query);
// 处理响应并返回数据给前端

使用HTTP请求：

// 使用HTTP库发送GET请求到Prometheus服务器
// 处理响应并返回数据给前端

1.4 处理结果并返回

最后，需要处理Prometheus查询的结果，并将其转化为适合前端呈现的数据格式。根据查询的结果，可以提取时间序列数据，将其转化为JSON格式，并将其返回给前端Vue应用程序。

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "container": "your-container-id",
          "pod": "your-pod-name",
          "namespace": "your-namespace"
        },
        "values": [
          [timestamp1, value1],
          [timestamp2, value2],
          [timestamp3, value3],
          // 更多数据点...
        ]
      },
      // 更多时间序列...
    ]
  }
}

在这里插入图片描述

对于Prometheus返回的数据为散点值，prometheus提供的官方图表（graph）的方法是将此类数据按照时序进行折线图绘制。一般可以通过前端界面规则返回为一个时序图，用来描绘其曲线。

在处理结果时，确保适当地处理错误情况，例如查询失败或无数据可用时的情况，以便向前端提供有用的反馈信息。

总之，这个步骤的关键是创建一个后端API接口，该接口能够执行Prometheus查询，并将查询结果转化为前端可用的数据格式。此外，确保在API接口中构建安全性和错误处理机制，以确保系统的稳定性和安全性。

步骤 2：前端呈现

由于笔者为后端程序员，前端部分将不给出详细的代码，该部分待前段程序员补齐前端示例代码。

2.1 创建Vue.js前端应用程序

使用Vue CLI或其他项目模板创建Vue.js前端应用程序。

2.2 集成ECharts图表

安装ECharts库，确保Vue.js应用程序能够使用它。

2.3 使用Axios从后端获取数据

在Vue.js应用程序中使用Axios或其他HTTP库，发送GET请求到后端API以获取Prometheus查询的结果数据。上文中已经对数据类型进行了介绍，具体的图表绘制可以参考其返回数据的数据结构进行设计。

2.4 呈现数据

创建一个Vue组件，用于显示Flink任务的指标数据。
使用ECharts图表库将从后端获取的数据映射到图表上，以可视化呈现指标。
前端开发人员可以根据提供的数据格式和API进行图表的呈现和交互开发。

步骤 3：用户界面

3.1 创建用户界面

创建一个Web页面，其中包含输入字段（如文本框或下拉菜单）和一个提交按钮，用于用户输入Flink任务的相关信息。
界面还可以包括用于显示图表和数据表格的区域，这些区域在获取数据后将用于呈现。

3.2 用户输入任务信息

在用户界面上添加输入字段，以允许用户输入以下信息：
- 命名空间（租户名）：Flink任务所在的Kubernetes命名空间。
- 任务名称：用户要监控的Flink任务的名称。

3.3 提交查询请求

添加一个提交按钮，当用户填写完任务信息后，可以点击此按钮以提交查询请求。
在点击按钮时，前端应用程序将调用后端API，将用户输入的命名空间和任务名称传递给后端API以执行Prometheus查询。

3.4 实时显示查询结果

在用户界面上，可以实时显示查询结果，包括图表和数据表格。
当后端API返回查询结果时，前端应用程序将结果显示在相应的区域中。

3.5 提供用户反馈

在界面上提供适当的用户反馈，例如加载指示器或错误消息，以告知用户查询状态。

老哥哥-老刘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用户Flink任务监控功能-设计

用户隔离: Prometheus与ECharts的组合使得用户隔离更加容易。可以实施自定义逻辑，确保每个用户只能访问和查询与他们自己相关的数据。这可以更好地满足多租户环境中的需求，确保用户只能看到自己的数据。自定义视图: 使用ECharts，可以轻松自定义图表的外观和交互。这意味着可以根据用户需求创建特定的监控仪表板和可视化效果，而无需受限于Grafana的默认视图。这种灵活性使能够更好地满足不同用户的需求。
复制链接

扫一扫