Prometheus客户端工具process-exporter监控进程是否存在

一、背景

Prometheus客户端工具process-exporter,可以用来监控进程是否存在,使用其来做服务组件的监控非常方便,如笔者这里使用 process-exporter 来监控 maxwell 实例是否存活。

二、原理

应用服务部署后,经由 process-exporter 根据指定的抓取规则,形成Prometheus 定期可以采集的监控指标。
然后Prometheus负责将目标指标,转化为实际的报警逻辑。

三、安装

待下载的软件:
process-exporter-0.5.0.linux-amd64.tar.gz

下载软件:
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.5.0/process-exporter-0.5.0.linux-amd64.tar.gz

解压:
tar -xvf process-exporter-0.5.0.linux-amd64.tar.gz -C /data/bigdata_devops/

创建软链接:
ln -s process-exporter-0.5.0.linux-amd64 process-exporter

四、应用实践

4.1 process-exporter配置说明

创建配置文件
可用的模板变量:
{{.Comm}} 包含原始可执行文件的basename
{{.ExeBase}} 包含可执行文件的basename
{{.ExeFull}} 包含可执行文件的完全限定路径
{{.Matches}} 映射包含应用命令行中指定内容所产生的所有匹配项

4.2 配置样例

vim process-cfg-maxwell.yaml

process_names:
  - name: "{{.Matches}}"
    cmdline:
    - '/data/maxwell/company_custom_config/db_2_kafka.properties'

五、服务高可用

vim bigdata_prometheus_healthcheck.service

[Unit]
Description=Bigdata maxwell_healthcheck
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/data/bigdata_devops/process-exporter/process-exporter -config.path /data/bigdata_devops/process-exporter/process-cfg-maxwell.yaml
WorkingDirectory=/data/bigdata_devops/process-exporter/
StandardOutput=inherit
StandardError=inherit
Restart=always
RestartSec=20

六、process-exporter指标收集

打印搜集到的指标:

cat print_metric.sh 

#!/bin/bash
curl http://localhost:9256/metrics > prometheus-metrics.txt

具体搜集到的指标项内容如下:

cat prometheus-metrics.txt 
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 6.5913e-05
go_gc_duration_seconds{quantile="0.25"} 9.774e-05
go_gc_duration_seconds{quantile="0.5"} 0.000102863
go_gc_duration_seconds{quantile="0.75"} 0.000121537
go_gc_duration_seconds{quantile="1"} 0.010470896
go_gc_duration_seconds_sum 0.013293763
go_gc_duration_seconds_count 27
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 9
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.167208e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.908604e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.452171e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 255873
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 446464
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.167208e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.620864e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.506752e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 5664
# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes_total counter
go_memstats_heap_released_bytes_total 2.883584e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.127616e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6152862289964602e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 12065
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 261537
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 27776
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 32768
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 45144
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 98304
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.220717e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.212416e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.212416e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.0590456e+07
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="prometheus"} 10322.643
http_request_duration_microseconds_count{handler="prometheus"} 1
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="prometheus"} 66
http_request_size_bytes_count{handler="prometheus"} 1
# HELP http_requests_total Total number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total{code="200",handler="prometheus",method="get"} 1
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="prometheus"} 14052
http_response_size_bytes_count{handler="prometheus"} 1
# HELP namedprocess_namegroup_context_switches_total Context switches
# TYPE namedprocess_namegroup_context_switches_total counter
namedprocess_namegroup_context_switches_total{ctxswitchtype="nonvoluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]"} 2
namedprocess_namegroup_context_switches_total{ctxswitchtype="voluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]"} 77779
# HELP namedprocess_namegroup_cpu_seconds_total Cpu user usage in seconds
# TYPE namedprocess_namegroup_cpu_seconds_total counter
namedprocess_namegroup_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="system"} 0.5899999999999963
namedprocess_namegroup_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="user"} 1.960000000000008
# HELP namedprocess_namegroup_major_page_faults_total Major page faults
# TYPE namedprocess_namegroup_major_page_faults_total counter
namedprocess_namegroup_major_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1
# HELP namedprocess_namegroup_memory_bytes number of bytes of memory in use
# TYPE namedprocess_namegroup_memory_bytes gauge
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="resident"} 2.6741587968e+10
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="swapped"} 0
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="virtual"} 3.2450228224e+10
# HELP namedprocess_namegroup_minor_page_faults_total Minor page faults
# TYPE namedprocess_namegroup_minor_page_faults_total counter
namedprocess_namegroup_minor_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 2174
# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1
# HELP namedprocess_namegroup_num_threads Number of threads
# TYPE namedprocess_namegroup_num_threads gauge
namedprocess_namegroup_num_threads{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 46
# HELP namedprocess_namegroup_oldest_start_time_seconds start time in seconds since 1970/01/01 of oldest process in group
# TYPE namedprocess_namegroup_oldest_start_time_seconds gauge
namedprocess_namegroup_oldest_start_time_seconds{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1.615283711e+09
# HELP namedprocess_namegroup_open_filedesc number of open file descriptors for this group
# TYPE namedprocess_namegroup_open_filedesc gauge
namedprocess_namegroup_open_filedesc{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 121
# HELP namedprocess_namegroup_read_bytes_total number of bytes read by this group
# TYPE namedprocess_namegroup_read_bytes_total counter
namedprocess_namegroup_read_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 122880
# HELP namedprocess_namegroup_states Number of processes in states Running, Sleeping, Waiting, Zombie, or Other
# TYPE namedprocess_namegroup_states gauge
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Other"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Running"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Sleeping"} 47
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Waiting"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Zombie"} 0
# HELP namedprocess_namegroup_thread_context_switches_total Context switches for these threads
# TYPE namedprocess_namegroup_thread_context_switches_total counter
namedprocess_namegroup_thread_context_switches_total{ctxswitchtype="nonvoluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 2
namedprocess_namegroup_thread_context_switches_total{ctxswitchtype="voluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 77779
# HELP namedprocess_namegroup_thread_count Number of threads in this group with same threadname
# TYPE namedprocess_namegroup_thread_count gauge
namedprocess_namegroup_thread_count{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 46
# HELP namedprocess_namegroup_thread_cpu_seconds_total Cpu user/system usage in seconds
# TYPE namedprocess_namegroup_thread_cpu_seconds_total counter
namedprocess_namegroup_thread_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="system",threadname="java"} 0.58
namedprocess_namegroup_thread_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="user",threadname="java"} 1.92
# HELP namedprocess_namegroup_thread_io_bytes_total number of bytes read/written by these threads
# TYPE namedprocess_namegroup_thread_io_bytes_total counter
namedprocess_namegroup_thread_io_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",iomode="read",threadname="java"} 122880
namedprocess_namegroup_thread_io_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",iomode="write",threadname="java"} 1.200128e+06
# HELP namedprocess_namegroup_thread_major_page_faults_total Major page faults for these threads
# TYPE namedprocess_namegroup_thread_major_page_faults_total counter
namedprocess_namegroup_thread_major_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 1
# HELP namedprocess_namegroup_thread_minor_page_faults_total Minor page faults for these threads
# TYPE namedprocess_namegroup_thread_minor_page_faults_total counter
namedprocess_namegroup_thread_minor_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 2174
# HELP namedprocess_namegroup_threads_wchan Number of threads in this group waiting on each wchan
# TYPE namedprocess_namegroup_threads_wchan gauge
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="ep_poll"} 1
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="futex_wait_queue_me"} 45
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="sk_wait_data"} 1
# HELP namedprocess_namegroup_worst_fd_ratio the worst (closest to 1) ratio between open fds and max fds among all procs in this group
# TYPE namedprocess_namegroup_worst_fd_ratio gauge
namedprocess_namegroup_worst_fd_ratio{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 0.029541015625
# HELP namedprocess_namegroup_write_bytes_total number of bytes written by this group
# TYPE namedprocess_namegroup_write_bytes_total counter
namedprocess_namegroup_write_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1.200128e+06
# HELP namedprocess_scrape_errors general scrape errors: no proc metrics collected during a cycle
# TYPE namedprocess_scrape_errors counter
namedprocess_scrape_errors 0
# HELP namedprocess_scrape_partial_errors incremented each time a tracked proc's metrics collection fails partially, e.g. unreadable I/O stats
# TYPE namedprocess_scrape_partial_errors counter
namedprocess_scrape_partial_errors 213
# HELP namedprocess_scrape_procread_errors incremented each time a proc's metrics collection fails
# TYPE namedprocess_scrape_procread_errors counter
namedprocess_scrape_procread_errors 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.19
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 6.242304e+06
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.61528438298e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.084321792e+09
### 回答1: Prometheus是一个用于监控和度量系统和应用程序的开源工具,而Process Exporter是一个Prometheus的插件,用于监控进程级别的指标。下面是使用Process Exporter监控进程的基本步骤: 1. 下载和安装Process Exporter 首先需要从Process Exporter的官方网站下载适合自己操作系统的二进制文件,并将其安装到需要监控的主机上。 2. 配置Process Exporter 可以使用命令行参数或配置文件来配置Process Exporter。在配置文件中,需要指定要监控进程的名称或PID,并定义指标的名称和标签。例如,以下是一个简单的配置文件: ``` process_names: - name: "my-process" cmdline: - "/usr/local/bin/my-process" metrics: - name: "my_process_cpu_percent" help: "CPU utilization for my-process" type: "gauge" match: name: "my-process" labels: process_name: "my-process" ``` 在这个例子中,我们定义了一个名为“my-process”的进程,并且使用了一个名为“my_process_cpu_percent”的指标来监控进程的CPU利用率。我们还定义了一个标签“process_name”,用于标识该指标来自哪个进程。 3. 运行Process Exporter 运行Process Exporter以开始收集指标。可以使用以下命令来启动Process Exporter: ``` ./process-exporter --config.path=/path/to/config.yml ``` 其中“/path/to/config.yml”是你刚刚创建的配置文件的路径。 4. 配置PrometheusPrometheus的配置文件中,需要添加一个job来收集Process Exporter提供的指标。例如,以下是一个基本的Prometheus配置文件: ``` global: scrape_interval: 15s scrape_configs: - job_name: 'process-exporter' static_configs: - targets: ['localhost:9256'] ``` 在这个例子中,我们定义了一个名为“process-exporter”的job,用于收集Process Exporter提供的指标。我们将目标设置为“localhost:9256”,其中“9256”是Process Exporter监听的端口。 5. 重新启动Prometheus 在修改了Prometheus的配置文件后,需要重新启动Prometheus以使更改生效。 6. 查看指标 打开Prometheus的Web界面,可以查看Process Exporter提供的指标。可以使用PromQL查询语言来查询和可视化这些指标。例如,以下是一个查询来查找“my-process”的CPU利用率: ``` my_process_cpu_percent{process_name="my-process"} ``` ### 回答2: Prometheus是一个开源的监控系统,而Process-ExporterPrometheus的一个插件,用于监控系统中的进程状态和性能指标。下面是Process-Exporter的使用方法: 1. 下载和安装Process-Exporter:可以从Process-Exporter的GitHub页面下载最新版本的二进制文件,并将它安装在你的系统中。 2. 配置Process-Exporter:创建一个配置文件,例如prometheus.yml,并在其中指定需要监控进程和相关参数的配置。这个配置文件可以指定进程的名称、启动命令、以及需要监控的指标,如CPU使用率、内存使用量等。将配置文件保存在合适的位置,并确保Process-Exporter可以读取到这个配置文件。 3. 运行Process-Exporter:在终端中运行Process-Exporter的二进制文件,并指定配置文件的位置。例如,可以使用命令"process-exporter --config.path=/path/to/config.yml"来启动Process-Exporter。它会读取配置文件,并开始监控指定的进程。 4. 配置Prometheus:打开Prometheus的配置文件prometheus.yml,并添加对Process-Exporter监控指标的配置。例如,可以添加以下内容: ``` scrape_configs: - job_name: 'process-exporter' static_configs: - targets: ['localhost:9091'] ``` 其中,'localhost:9091'是Process-Exporter的默认监听地址和端口号。 5. 启动Prometheus:运行Prometheus的二进制文件,并指定配置文件的位置。例如,可以使用命令"prometheus --config.file=/path/to/prometheus.yml"来启动Prometheus。 6. 查看监控数据:打开浏览器,并访问Prometheus的Web界面。可以通过查询指定的Process-Exporter监控指标,并将其可视化展示出来。也可以使用PromQL查询语言,编写自定义的查询语句来获取所需的进程信息。 通过以上步骤,你就可以使用Process-Exporter监控系统中的进程状态和性能指标了。它可以帮助你及时发现问题,并做出相应的优化和调整。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值