本系列文章旨在记录作者搭建nagios监控的安装及配置步骤,都经过测试,欢迎指正。
nagios简介:
Nagios是一款开源的免费网络监视工具,能有效监控Windows、Linux和Unix的主机状态,交换机路由器等网络设置,打印机等。在系统或服务状态异常时发出邮件或短信报警第一时间通知网站运维人员,在状态恢复后发出正常的邮件或短信通知。
本篇文章将详细说明如何在ubuntu12.04 server 上安装nagios,并监控本机基本信息。
nagios监控主服务器的配置在上一节介绍:《ubuntu 安装配置 nagios》
一、准备
1.更新ubuntu系统
sudo apt-get update
sudo apt-get upgrade
2.依赖基本包:
sudo apt-get install build-essential
sudo apt-get install libssl0.9.8 libssl-dev openssl (openssl貌似已经安装了)
sudo apt-get install libgd2-noxpm libgd2-noxpm-dev
sudo apt-get install apache2 (防止check_http时出现Connection refused错误)
(安装完nagios plugin后可以检查一下http,检查:/usr/local/nagios/libexec/check_http -H 127.0.0.1
错误结果:Connection refusedHTTP CRITICAL - Unable to open TCP socket
启动apache: service apache2 start 后
再检查,正确结果: HTTP OK: HTTP/1.1 200 OK - 452 bytes in 0.001 second response time |time=0.001221s;;;0.000000 size=452B;;;0 )
3.下载 下载所需安装包,在/usr/local/src目录下载
wget http://prdownloads.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.15.tar.gz
wget http://prdownloads.sourceforge.net/sourceforge/nagios/nrpe-2.12.tar.gz
4.添加nagios用户和组
groupadd nagios
useradd -g nagios -s /sbin/nologin nagios
5.在被监控机器上安装nagios plugin
tar zxvf nagios-plugins-1.4.15.tar.gz
cd nagios-plugins-1.4.15
./configure --with-nagios-user=nagios --with-nagios-group=nagios
make
make install
修改nagios目录用户和组
chown -R nagios:nagios /usr/local/nagios/
6.在被监控机器上安装nrpe
tar zxvf nrpe-2.12.tar.gz
cd nrpe-2.12
./configure
出错: checking for SSL libraries... configure: error: Cannot find ssl libraries
解决,创建一个user/lib/libssl.so=>/usr/lib/x86_64-linux-gnu/libssl.so的简单符号连接:
ln -s /usr/lib/x86_64-linux-gnu/libssl.so /usr/lib/libssl.so
这里/usr/lib/x86_64-linux-gnu/libssl.so目录可能不是这一个,可以通过命令whereis ssl来查看,32位ubuntu上可能是/usr/lib/i386-linux-gnu/libssl.so
重新
./configure
编译安装:
make all
make install-plugin
make install-daemon
make install-daemon-config
修改nagios目录用户和组
chown -R nagios:nagios /usr/local/nagios/
7.修改NRPE配置文件,让监控主机可以访问被监控主机的NRPE,缺省NRPE配置文件中只允许本机访问NRPE的Daemon
vi /usr/local/nagios/etc/nrpe.cfg
#缺省为127.0.0.1,只能本机访问
allowed_hosts=127.0.0.1,192.168.0.102 (多个ip,用逗号隔开)
配置command(可能有些已经配置好了)
command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
command[check_disk]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /
command[check_http]=/usr/local/nagios/libexec/check_http -H 127.0.0.1 -w 5 -c 10
command[check_ping]=/usr/local/nagios/libexec/check_ping -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
command[check_ssh]=/usr/local/nagios/libexec/check_ssh -4 127.0.0.1
command[check_swap]=/usr/local/nagios/libexec/check_swap -w 30% -c 10%
(注:一定要注意command的路径写对了,不对的话,页面可能会报“NRPE: Unable to read output”错误,黄色显示)
8. 验证nrpe:
重启nrpe:
killall nrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
被监控机检查: /usr/local/nagios/libexec/check_nrpe -H localhost
监控主机检查: /usr/local/nagios/libexec/check_nrpe -H 被监控机IP
成功返回nrpe版本号: NRPE v2.12
9.将被监控机器需要监控的内容添加到监控服务器nagios的配置文件中
以标准的localhost.cfg为基础创建被监控机配置文件linuxmachine1.cfg
cp /usr/local/nagios/etc/objects/localhost.cfg /usr/local/nagios/etc/machines/linuxmachine1.cfg
vi /usr/local/nagios/etc/machines/linuxmachine1.cfg
内容如下(红色为需要修改的地方):
# Define a host for the machine
define host{
use linux-server ; Name of host template to use
host_name linux-machine1
alias linux-machine1
address 192.168.0.103
}
# Define an hostgroup for Linux machines
define hostgroup{
hostgroup_name linux-machines-group1 ; The name of the hostgroup
alias Linux Machines Group1 ; Long name of the group
members linux-machine1 ; Comma separated list of hosts that belong to this group
}
# SERVICE DEFINITIONS
# Define a service to "ping" the target machine
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description PING
check_command check_nrpe!check_ping
}
# Define a service to check the disk space of the root partition
# Warning if < 20% free, critical if
# < 10% free space on partition.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description Root Partition
check_command check_nrpe!check_disk
}
# Define a service to check the number of currently logged in
# Warning if > 20 users, critical
# if > 50 users.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description Current Users
check_command check_nrpe!check_users
}
# Define a service to check the number of currently running procs
# Warning if > 250 processes, critical if
# > 400 users.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description Total Processes
check_command check_nrpe!check_procs
}
# Define a service to check the load on the machine.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description Current Load
check_command check_nrpe!check_load
}
# Define a service to check the swap usage the machine.
# Critical if less than 10% of swap is free, warning if less than 20% is free
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description Swap Usage
check_command check_nrpe!check_swap
}
# Define a service to check SSH on the machine.
# Disable notifications for this service by default, as not all users may have SSH enabled.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description SSH
check_command check_nrpe!check_ssh
notifications_enabled 0
}
# Define a service to check HTTP on the machine.
# Disable notifications for this service by default, as not all users may have HTTP enabled.
define service{
use generic-service ; Name of service template to use
host_name linux-machine1
service_description HTTP
check_command check_nrpe!check_http
notifications_enabled 0
}
保存退出,将该文件路径添加到nagios配置文件/usr/local/nagios/etc/nagios.cfg中
vi /usr/local/nagios/etc/nagios.cfg
添加: cfg_file=/usr/local/nagios/etc/machines/linuxmachine1.cfg
添加监听该linux-group1的用户信息
vi /usr/local/nagios/etc/objects/contacts.cfg
修改nagiosadmin信息为:
define contact{
contact_name nagiosuser1 ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
email xxx@163.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}}
修改contactgroup如下:
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosuser1
}
10.配置完成,验证配置有无错误
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
没有错误的话,重新启动nagios
killall nagios
/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
查看运行状态: /usr/local/nagios/bin/nagiostats
11.重新启动apache2,页面访问查看
service apache2 restart
访问http://nagios主机ip/nagios, 输入用户名nagiosuser1 密码,查看页面: