一、背景
Prometheus客户端工具process-exporter,可以用来监控进程是否存在,使用其来做服务组件的监控非常方便,如笔者这里使用 process-exporter 来监控 maxwell 实例是否存活。
二、原理
应用服务部署后,经由 process-exporter 根据指定的抓取规则,形成Prometheus 定期可以采集的监控指标。
然后Prometheus负责将目标指标,转化为实际的报警逻辑。
三、安装
待下载的软件:
process-exporter-0.5.0.linux-amd64.tar.gz
下载软件:
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.5.0/process-exporter-0.5.0.linux-amd64.tar.gz
解压:
tar -xvf process-exporter-0.5.0.linux-amd64.tar.gz -C /data/bigdata_devops/
创建软链接:
ln -s process-exporter-0.5.0.linux-amd64 process-exporter
四、应用实践
4.1 process-exporter配置说明
创建配置文件
可用的模板变量:
{{.Comm}} 包含原始可执行文件的basename
{{.ExeBase}} 包含可执行文件的basename
{{.ExeFull}} 包含可执行文件的完全限定路径
{{.Matches}} 映射包含应用命令行中指定内容所产生的所有匹配项
4.2 配置样例
vim process-cfg-maxwell.yaml
process_names:
- name: "{{.Matches}}"
cmdline:
- '/data/maxwell/company_custom_config/db_2_kafka.properties'
五、服务高可用
vim bigdata_prometheus_healthcheck.service
[Unit]
Description=Bigdata maxwell_healthcheck
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/data/bigdata_devops/process-exporter/process-exporter -config.path /data/bigdata_devops/process-exporter/process-cfg-maxwell.yaml
WorkingDirectory=/data/bigdata_devops/process-exporter/
StandardOutput=inherit
StandardError=inherit
Restart=always
RestartSec=20
六、process-exporter指标收集
打印搜集到的指标:
cat print_metric.sh
#!/bin/bash
curl http://localhost:9256/metrics > prometheus-metrics.txt
具体搜集到的指标项内容如下:
cat prometheus-metrics.txt
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 6.5913e-05
go_gc_duration_seconds{quantile="0.25"} 9.774e-05
go_gc_duration_seconds{quantile="0.5"} 0.000102863
go_gc_duration_seconds{quantile="0.75"} 0.000121537
go_gc_duration_seconds{quantile="1"} 0.010470896
go_gc_duration_seconds_sum 0.013293763
go_gc_duration_seconds_count 27
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 9
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.167208e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.908604e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.452171e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 255873
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 446464
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.167208e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.620864e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.506752e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 5664
# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes_total counter
go_memstats_heap_released_bytes_total 2.883584e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.127616e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6152862289964602e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 12065
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 261537
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 27776
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 32768
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 45144
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 98304
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.220717e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.212416e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.212416e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.0590456e+07
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="prometheus"} 10322.643
http_request_duration_microseconds_count{handler="prometheus"} 1
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="prometheus"} 66
http_request_size_bytes_count{handler="prometheus"} 1
# HELP http_requests_total Total number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total{code="200",handler="prometheus",method="get"} 1
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="prometheus"} 14052
http_response_size_bytes_count{handler="prometheus"} 1
# HELP namedprocess_namegroup_context_switches_total Context switches
# TYPE namedprocess_namegroup_context_switches_total counter
namedprocess_namegroup_context_switches_total{ctxswitchtype="nonvoluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]"} 2
namedprocess_namegroup_context_switches_total{ctxswitchtype="voluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]"} 77779
# HELP namedprocess_namegroup_cpu_seconds_total Cpu user usage in seconds
# TYPE namedprocess_namegroup_cpu_seconds_total counter
namedprocess_namegroup_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="system"} 0.5899999999999963
namedprocess_namegroup_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="user"} 1.960000000000008
# HELP namedprocess_namegroup_major_page_faults_total Major page faults
# TYPE namedprocess_namegroup_major_page_faults_total counter
namedprocess_namegroup_major_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1
# HELP namedprocess_namegroup_memory_bytes number of bytes of memory in use
# TYPE namedprocess_namegroup_memory_bytes gauge
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="resident"} 2.6741587968e+10
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="swapped"} 0
namedprocess_namegroup_memory_bytes{groupname="map[:/data/maxwell/db_2_kafka.properties]",memtype="virtual"} 3.2450228224e+10
# HELP namedprocess_namegroup_minor_page_faults_total Minor page faults
# TYPE namedprocess_namegroup_minor_page_faults_total counter
namedprocess_namegroup_minor_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 2174
# HELP namedprocess_namegroup_num_procs number of processes in this group
# TYPE namedprocess_namegroup_num_procs gauge
namedprocess_namegroup_num_procs{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1
# HELP namedprocess_namegroup_num_threads Number of threads
# TYPE namedprocess_namegroup_num_threads gauge
namedprocess_namegroup_num_threads{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 46
# HELP namedprocess_namegroup_oldest_start_time_seconds start time in seconds since 1970/01/01 of oldest process in group
# TYPE namedprocess_namegroup_oldest_start_time_seconds gauge
namedprocess_namegroup_oldest_start_time_seconds{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1.615283711e+09
# HELP namedprocess_namegroup_open_filedesc number of open file descriptors for this group
# TYPE namedprocess_namegroup_open_filedesc gauge
namedprocess_namegroup_open_filedesc{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 121
# HELP namedprocess_namegroup_read_bytes_total number of bytes read by this group
# TYPE namedprocess_namegroup_read_bytes_total counter
namedprocess_namegroup_read_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 122880
# HELP namedprocess_namegroup_states Number of processes in states Running, Sleeping, Waiting, Zombie, or Other
# TYPE namedprocess_namegroup_states gauge
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Other"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Running"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Sleeping"} 47
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Waiting"} 0
namedprocess_namegroup_states{groupname="map[:/data/maxwell/db_2_kafka.properties]",state="Zombie"} 0
# HELP namedprocess_namegroup_thread_context_switches_total Context switches for these threads
# TYPE namedprocess_namegroup_thread_context_switches_total counter
namedprocess_namegroup_thread_context_switches_total{ctxswitchtype="nonvoluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 2
namedprocess_namegroup_thread_context_switches_total{ctxswitchtype="voluntary",groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 77779
# HELP namedprocess_namegroup_thread_count Number of threads in this group with same threadname
# TYPE namedprocess_namegroup_thread_count gauge
namedprocess_namegroup_thread_count{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 46
# HELP namedprocess_namegroup_thread_cpu_seconds_total Cpu user/system usage in seconds
# TYPE namedprocess_namegroup_thread_cpu_seconds_total counter
namedprocess_namegroup_thread_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="system",threadname="java"} 0.58
namedprocess_namegroup_thread_cpu_seconds_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",mode="user",threadname="java"} 1.92
# HELP namedprocess_namegroup_thread_io_bytes_total number of bytes read/written by these threads
# TYPE namedprocess_namegroup_thread_io_bytes_total counter
namedprocess_namegroup_thread_io_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",iomode="read",threadname="java"} 122880
namedprocess_namegroup_thread_io_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",iomode="write",threadname="java"} 1.200128e+06
# HELP namedprocess_namegroup_thread_major_page_faults_total Major page faults for these threads
# TYPE namedprocess_namegroup_thread_major_page_faults_total counter
namedprocess_namegroup_thread_major_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 1
# HELP namedprocess_namegroup_thread_minor_page_faults_total Minor page faults for these threads
# TYPE namedprocess_namegroup_thread_minor_page_faults_total counter
namedprocess_namegroup_thread_minor_page_faults_total{groupname="map[:/data/maxwell/db_2_kafka.properties]",threadname="java"} 2174
# HELP namedprocess_namegroup_threads_wchan Number of threads in this group waiting on each wchan
# TYPE namedprocess_namegroup_threads_wchan gauge
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="ep_poll"} 1
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="futex_wait_queue_me"} 45
namedprocess_namegroup_threads_wchan{groupname="map[:/data/maxwell/db_2_kafka.properties]",wchan="sk_wait_data"} 1
# HELP namedprocess_namegroup_worst_fd_ratio the worst (closest to 1) ratio between open fds and max fds among all procs in this group
# TYPE namedprocess_namegroup_worst_fd_ratio gauge
namedprocess_namegroup_worst_fd_ratio{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 0.029541015625
# HELP namedprocess_namegroup_write_bytes_total number of bytes written by this group
# TYPE namedprocess_namegroup_write_bytes_total counter
namedprocess_namegroup_write_bytes_total{groupname="map[:/data/maxwell/db_2_kafka.properties]"} 1.200128e+06
# HELP namedprocess_scrape_errors general scrape errors: no proc metrics collected during a cycle
# TYPE namedprocess_scrape_errors counter
namedprocess_scrape_errors 0
# HELP namedprocess_scrape_partial_errors incremented each time a tracked proc's metrics collection fails partially, e.g. unreadable I/O stats
# TYPE namedprocess_scrape_partial_errors counter
namedprocess_scrape_partial_errors 213
# HELP namedprocess_scrape_procread_errors incremented each time a proc's metrics collection fails
# TYPE namedprocess_scrape_procread_errors counter
namedprocess_scrape_procread_errors 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.19
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 6.242304e+06
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.61528438298e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.084321792e+09