Prometheus
- Prometheus Server: 收集指标和存储时间序列数据,并提供查询接口
- ClientLibrary: 客户端库
- Push Gateway: 短期存储指标数据。主要用于临时性的任务
- Exporters: 采集已有的第三方服务监控指标并暴露metrics
- Web UI: 简单的Web控制台
系统环境
CentOS Linux release 7.6.1810 (Core)
解压
# 下载地址:https://prometheus.io/download/
tar xf prometheus-2.34.0.linux-amd64.tar.gz
修改配置文件
mv prometheus-2.34.0.linux-amd64 prometheus-2.34.0
cd prometheus-2.34.0 && mkdir data
#====== prometheus.yml ======#
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
添加至system
cat >> /usr/lib/systemd/system/prometheus.service << EOF
[Unit]
Description=The Prometheus 2 monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
User=wxw
ExecStart=/home/wxw/prometheus/prometheus \
--storage.tsdb.path=/home/wxw/prometheus/data \
--config.file=/home/wxw/prometheus/prometheus.yml
Restart=on-failure
StartLimitInterval=1
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF
备份配置文件
cp prometheus.yml prometheus.yml.bak
同步时间
普罗米修斯依赖于精确的时间,时间漂移可能会导致意外的查询结果
sudo yum -y install ntp
sudo ntpdate time1.aliyun.com
启动
sudo systemctl start prometheus
# nohup ./prometheus --config.file=prometheus.yml &
访问
sudo firewall-cmd --zone=public --permanent --add-port=9090/tcp
sudo firewall-cmd --reload
主机监控的node_exporter
tar xf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64 node_exporter-1.3.1
cd node_exporter
sudo vim /usr/lib/systemd/system/node_exporter.service
# 添加系统服务
[Unit]
Description=node_exporter
[Service]
User=wxw
ExecStart=/home/wxw/node_exporter/node_exporter \
--web.disable-exporter-metrics \
--log.level=error
[Install]
WantedBy=multi-user.target
# 启动
sudo systemctl daemon-reload
sudo systemctl start node_exporter
# nohup ./node_exporter --web.listen-address=":9100" --web.disable-exporter-metrics &
配置服务端job
vim prometheus.yml
- job_name: 'host-monitor'
scrape_interval: 10s
static_configs:
- targets: ['192.168.3.201:9100']
labels:
instance: node1
重启prometheus
sudo systemctl restart prometheus
Grafana
配置
cat >> /etc/yum.repos.d/grafana.repo << EOF
[grafana]
name=grafana
baseurl=https://mirrors.aliyun.com/grafana/yum/rpm
repo_gpgcheck=0
enabled=1
gpgcheck=0
EOF
$ sudo yum makecache
$ sudo yum install grafana
启动
sudo systemctl enable --now grafana-server
# nohup ./grafana-server &
Alertmanager
安装
tar xf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 alertmanager-0.24.0
邮件告警
备份原始文件
cp alertmanager.yml alertmanager.yml.bak
配置alertmanager.yml
#====== alertmanager.yml ======#
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'wangxiwen95@163.com'
smtp_auth_username: 'wangxiwen95@163.com'
smtp_auth_password: 'HEWFSVKTHHBUIAZR'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'wangxiwen95@163.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
检查配置文件
$ ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 1 receivers
- 0 templates
启动
mkdir ./data
sudo vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager System
Documentation=alertmanager System
[Service]
User=wxw
ExecStart=/home/wxw/alertmanager/alertmanager \
--config.file=/home/wxw/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-relaod
sudo systemctl start alertmanager
# ./alertmanager --config.file=alertmanager.yml &
配置prometheus.yaml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# 定义告警文件
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/*.yml"
编写规则
cd prometheus && mkdir rules && cd rules
#====== host_monitor.yml ======#
groups:
- name: node-up
rules:
- alert: node-up
expr: up == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{
{$labels.instance}}Instance has been down for more than 5 seconds"
检查配置文件
$ ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found
SUCCESS: prometheus.yml is valid prometheus config file syntax
Checking rules/host_monitor.yml
SUCCESS: 1 rules found
优化告警模板
新建模板文件
cd alertmanager && vim email.tmpl
{
{
define "email.to.html" }}
{
{
if gt (len .Alerts.Firing) 0 }}{
{
range .Alerts }}
@告警
告警程序: prometheus_alert <br>
告警级别: {
{
.Labels.severity }} 级 <br>
告警类型: {
{
.Labels.alertname }} <br>
故障主机: {
{
.Labels.instance }} <br>
告警主题: {
{
.Annotations.summary }} <br>
告警详情: {
{
.Annotations.description }} <br>
触发时间: {
{
.StartsAt }} <br>
{
{
end }}
{
{
end }}
{
{
if gt (len .Alerts.Resolved) 0 }}{
{
range .Alerts }}
@恢复:
告警主机:{
{
.Labels.instance }} <br>
告警主题:{
{
.Annotations.summary }} <br>
恢复时间: {
{
.EndsAt }} <br>
{
{
end }}
{
{
end }}
{
{
end
修改文件使用模板
#====== alertmanager.yml ======#
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'wangxiwen95@163.com'
smtp_auth_username: 'wangxiwen95@163.com'
smtp_auth_password: 'HEWFSVKTHHBUIAZR'
smtp_require_tls: false
# 打开模板
templates:
- '/home/wxw/alertmanager/email.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'wangxiwen95@163.com'
html: '{
{ template "email.to.html" . }}' ## 使用模板方式发送
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
检查配置文件重启
$ ./amtool check-config alertmanager.yml
$ sudo systemctl restart alertmanager
企业微信告警
修改alertmanager.yml
#====== alertmanager.yml ======#
global:
resolve_timeout: 5m
# 打开模板
templates:
- '/home/wxw/alertmanager/wechat.tmpl'
# 企业微信告警
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'wwa274450828cc9189'
# to_party: '1'
to_user: 'WangXiWen'
agent_id: '1000002'
api_secret: 'yTaolZ_bwq0sRc6YeSD_qEcGM4RFh8O12DnphNjy26Y'
send_resolved: true
message: '{
{ template "wechat.tmpl" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
修改模板
vim wechat.tmpl
{