1.2、Ubuntu 18.04 / 20.04 / 22.04
1.3、debain 10 / debain 11 / debain 12
4.2、Ubuntu 18.04 / 20.04 / 22.04 或 debain 10/11/12
系统环境版本信息
项目 | 版本 | 规格 |
系统环境 | Ubuntu 20.04.6 LTS | 48C65G256G系统盘 4T数据盘 |
Zabbix-server | 6.0.26 | 192.168.2.20 |
Grafana | 10.0.3 | 192.168.2.20 |
Zabbix-agent | 6.0.26/6.0.28 | 小版本不影响 |
Zabbix-agent2 | 6.0.27 | 小版本不影响 |
Nginx | 1.18.0 | 192.168.2.20 |
Mysql | 8.0.36 | 192.168.2.20 |
PHP | 7.4.3 | 192.168.2.20 |
一、zabbix-server安装
1、安装zabbix-server仓库
wget https://repo.zabbix.com/zabbix/6.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.0-4+ubuntu20.04_all.deb
dpkg -i zabbix-release_6.0-4+ubuntu20.04_all.deb
apt update
2、安装Zabbix server,Web前端,agent
apt install zabbix-server-mysql zabbix-frontend-php zabbix-nginx-conf zabbix-sql-scripts zabbix-agent
3、创建初始数据库
create database zabbix character set utf8mb4 collate utf8mb4_bin;
create user zabbix@localhost identified by 'password';
grant all privileges on zabbix.* to zabbix@localhost;
set global log_bin_trust_function_creators = 1;
zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql --default-character-set=utf8mb4 -uzabbix -p zabbix
登录数据库:
set global log_bin_trust_function_creators = 0;
4、配置zabbix_server.conf
目录:/etc/zabbix/zabbix_server.conf
LogFile=/var/log/zabbix/zabbix_server.log
PidFile=/run/zabbix/zabbix_server.pid
5、运行zabbix_server
systemctl restart zabbix-server zabbix-agent nginx php7.4-fpm
systemctl enable zabbix-server zabbix-agent nginx php7.4-fpm
二、grafana安装
1、安装grafana源仓库
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
2、安装grafana服务
sudo apt-get install grafana=10.0.3
3、grafana启动
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
4、zabbix插件安装
grafana-cli plugins install alexanderzobnin-zabbix-app
三、zabbix-agent 安装
1、安装
1.1、CentOS 7.9
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/pcre2-10.23-2.el7.x86_64.rpm
rpm -ivh pcre2-10.23-2.el7.x86_64.rpm
wget https://mirrors.tuna.tsinghua.edu.cn/zabbix/zabbix/6.0/rhel/7/x86_64/zabbix-agent-6.0.26-release1.el7.x86_64.rpm
rpm -ivh zabbix-agent-6.0.26-release1.el7.x86_64.rpm
1.2、Ubuntu 18.04 / 20.04 / 22.04
apt-get -y install libpcre2-8-0 // apt -y install libpcre2-8-0
wget https://repo.zabbix.com/zabbix/6.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.0-4+ubuntu18.04_all.deb
dpkg -i zabbix-release_6.0-4+ubuntu18.04_all.deb
1.3、debain 10 / debain 11 / debain 12
wget https://repo.zabbix.com/zabbix/6.0/debian/pool/main/z/zabbix-release/zabbix-release_6.0-1+debian10_all.deb
dpkg -i zabbix-release_6.0-1+debian10_all.deb
目录: /etc/zabbix/zabbix_agentd.conf
PidFile=/var/run/zabbix/zabbix_agentd.pid
LogFile=/var/log/zabbix-agent/zabbix_agentd.log
LogFileSize=0
Server=192.168.2.20
ServerActive=192.168.2.20
Include=/etc/zabbix/zabbix_agentd.conf.d/*.conf
3、启动zabbix-agent
systemctl status zabbix-agent
systemctl enable zabbix-agent
4、开启防火墙
4.1、CentOS 7.9
firewall-cmd --permanent --add-rich-rule 'rule family=ipv4 source address=192.168.2.20/24 port
port=10050 protocol=tcp accept'
4.2、Ubuntu 18.04 / 20.04 / 22.04 或 debain 10/11/12
ufw allow proto tcp from 192.168.2.20/32 to any port 10050
ufw status
四、zabbix-agent2 安装
wget https://repo.zabbix.com/zabbix/6.2/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.2-4%2Bubuntu22.04_all.deb
sudo dpkg -i zabbix-release_6.2-4+ubuntu22.04_all.deb
2、配置修改
2.1、docker插件配置
目录:/etc/zabbix/zabbix_agent2.d/plugins.d/docker.conf
Plugins.Docker.Endpoint=unix:///var/run/docker.sock
2.2、zabbix_agent2配置
目录:/etc/zabbix/zabbix_agent2.conf
PidFile=/var/run/zabbix/zabbix_agent2.pid
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=0
Server=192.168.2.20
ServerActive=192.168.2.20
Hostname=cr10
Include=/etc/zabbix/zabbix_agent2.d/*.conf
PluginSocket=/run/zabbix/agent.plugin.sock
ControlSocket=/run/zabbix/agent.sock
Include=./zabbix_agent2.d/plugins.d/*.conf
UnsafeUserParameters=1
UserParameter=GPU.ID,/etc/zabbix/gpu_id.sh
UserParameter=GPU.Utilization[*],nvidia-smi -i $1 --query-gpu=utilization.$2 --format=csv,noheader,nounits
UserParameter=GPU.Memory[*],nvidia-smi -i $1 --query-gpu=memory.$2 --format=csv,noheader,nounits
UserParameter=GPU.Fan[*],nvidia-smi -i $1 --query-gpu=fan.speed --format=csv,noheader,nounits
UserParameter=GPU.Temp[*],nvidia-smi -i $1 --query-gpu=temperature.gpu --format=csv,noheader,nounits
UserParameter=GPU.Pid[*],nvidia-smi -i $1 --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits | awk -v OFS="," '{print "PID:"$(NF-2), "Process_name:"$(NF-1), "Used GPU Memory (MiB):"$NF}'
UserParameter=GPU.Idel.Utilization[*],/etc/zabbix/gpu_idel.sh $1
UserParameter=GPU.Process.count[*],nvidia-smi -i $1 --query-compute-apps=pid --format=csv,noheader | wc -l
UserParameter=GPU.Process,nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits |wc -l
2.3、gpu_id脚本
GPUS=(`nvidia-smi -L | awk -F ' |:' '{print $2}'`)
printf "\"{#GPU_ID}\":\"${GPUS[$i]}\"}"
if [ $i -lt $[$LENGTH-1] ];then
2.4、gpu_idel脚本
gpu_util=$(nvidia-smi -i $1--query-gpu=utilization.gpu --format=csv,noheader,nounits)
gpu_idle_util=$((100 - gpu_util))
2.5、zabbix用户添加到docker组
2.6、启动
systemctl status zabbix-agent2
systemctl enable zabbix-agent2
3、开启防火墙
ufw allow proto tcp from 192.168.2.20/32 to any port 10050
五、zabbix界面操作
1、主机自动发现添加
1.1、新建主机群组
1.2、创建自动发现规则
1.3、创建发现动作
zabbi界面"配置"—"动作"—"发现动作"—"创建动作"
添加:自动发现规则等于CR 、主机IP地址等于192.168.2.1-254
执行内容:"添加主机"、"添加到主机群组"、"链接到模板"、"启用主机"、"设置主机清单模式"
2、自定义监控项添加
2.1、新建模板
2.2、新建自动发现规则
zabbi界面"配置"—"模板"—选中"GPU"—点击"自动发现"
2.3、新建监控项原型
键值:GPU.Utilization[{#GPU.ID},gpu]
2.4、新建触发器原型
名称:GPU {#GPU_ID} Mem Utilization too High
表达式:min(/GPU/GPU.Utilization[{#GPU_ID},memory],5m)>=90
六、微信机器人告警
1、微信机器人添加
登录微信:任意选中一个群聊,打开群设置,点击"添加机器人",然后选择"新创建一个机器人"
2、zabbix配置
2.1、添加报警媒介
脚本参数:{ALERT.SUBJECT} {ALERT.MESSAGE}
2.2、添加用户组
2.3、添加用户
2.4、添加动作
zabbi界面"配置"—"动作"—"Trigger actions"—"创建动作"
条件: Value of tag GPU 等于 Idel 、触发器示警度 大于等于 警告
Send to user groups : wx_alter
主题:故障 {TRIGGER.STATUS}: {HOSTNAME1} "{TRIGGER.NAME}"
告警时间:{EVENT.DATE} {EVENT.TIME}
Send to user groups : wx_alter
主题:已恢复 {TRIGGER.STATUS}: {HOSTNAME1} "{TRIGGER.NAME}"
告警时间:{EVENT.DATE} {EVENT.TIME}
恢复时间:{EVENT.RECOVERY.DATE} {EVENT.RECOVERY.TIME}
2.5、微信告警脚本
目录:/etc/zabbix/alertscripts/wx_alter.py
webhook_url = ''
if 'PROBLEM' in alert_subject:
"content": "# <font color=\"warning\">**%s**</font> \n"% (alert_subject) +
"content": "# <font color=\"info\">**%s**</font> \n"% (alert_subject) +
headers = {'Content-Type': 'application/json'}
response = requests.post(webhook_url, headers=headers, data=json.dumps(data))