Install Monitoring Systems
We install ganglia in /opt/services instead of the normal location. This seperates it from the OS install and allows us to re-install and/or upgrade the OS without worry of interfearing with installed services like ganglia. We also keep /opt/services on a seperate disk which allows us to replace the entire OS disk without interfearing with installed services. Finally, we try to seperate the local configurations of a service (ganglia-local) from the service itsetlf (ganglia). This allow us to easily upgrade the service without having to reconfigure it to work the way the previous version did.
Ganglia
Compiling Ganglia
-
Make an area for ganglia to live
-
mkdir -p /opt/services/ganglia-3.2.0
-
Ganglia requires libconfuse. Download libconfuse from
http://www.nongnu.org/confuse/
-
cd /tmp tar xfvz /home/src/ganglia/src/confuse-2.7.tar.gz cd confuse-2.7 configure --prefix=/opt/services/ganglia-3.2.0 --enable-shared make make install
-
Ganglia also requires rrdtool for head nodes. Download rrdtool from
http://www.rrdtool.org/
-
cd /tmp tar xfvz /home/src/ganglia/src/rrdtool-1.4.6.tar.gz cd rrdtool-1.4.6 configure --prefix=/opt/services/ganglia-3.2.0 --enable-shared make make install
-
Download ganglia from
http://ganglia.sourceforge.net/
-
cd /tmp tar xfvz /home/src/ganglia/src/ganglia-3.2.0.tar.gz cd ganglia-3.2.0 LDFLAGS="-L/opt/services/ganglia-3.2.0/lib" configure --prefix=/opt/services/ganglia-3.2.0 --with-libconfuse=/opt/services/ganglia-3.2.0 --with-gmetad make make install (cd /opt/services ; ln -s ganglia-3.2.0 ganglia) mkdir -p /opt/services/ganglia-local/bin mkdir -p /opt/services/ganglia-local/etc/conf.d mkdir -p /opt/services/ganglia-local/init.d mkdir -p /opt/services/ganglia-local/lib64/ganglia/python_modules
If 32bit make lib instead of lib64
-
Create client configure script
-
cp gmond/gmond.conf /opt/services/ganglia-local/etc cp /opt/services/ganglia/etc/conf.d/modpython.conf /opt/services/ganglia-local/etc/conf.d
edit /opt/services/ganglia-local/etc/gmond.conf and at least set the name in the cluster block. You may also want to change the ports used in the three channel blocks.
-
Create client startup script
-
cp gmond/gmond.init /opt/services/ganglia-local/init.d/nrao-gmond
edit /opt/services/ganglia-local/init.d/nrao-gmond
-
Create server configure script
-
cp gmetad/gmetad.conf /opt/services/ganglia-local/etc
edit /opt/services/ganglia-local/etc/gmetad.conf and at least set the data_source to the name you set in gmond.conf
e.g.data_source "Cluster" node1.example.edu:8649
edit /opt/services/ganglia-local/etc/conf.d/modpython.conf and change params and include to reference ganglia-local.
e.g.params = "/opt/services/ganglia-local/lib64/ganglia/python_modules"
e.g.include('/opt/services/ganglia-local/etc/conf.d/*.pyconf')
-
Create server startup script
-
cp gmetad/gmetad.init /opt/services/ganglia-local/init.d/nrao-gmetad
edit /opt/services/ganglia-local/init.d/nrao-gmetad
Make tarball to install on clients
cd /opt/services tar cfvz ganglia_nrao_`uname -i`-3.2.0.tgz ganglia*copy ganglia_nrao_`uname -i`-3.2.0.tgz /home/src/ganglia
Install Ganglia Client
cd /opt/services ; tar xfvz /home/src/ganglia/ganglia_nrao_`arch -i`-3.2.0.tgz ln -s /opt/services/ganglia-local/init.d/nrao-gmond /etc/init.d chkconfig --add nrao-gmond /etc/init.d/nrao-gmond start
Install Infiniband (optional)
Download the InfiniB and network performance script from http://ganglia.info/gmetric/Save it as
/opt/services/ganglia-local/bin/infin.py
and create a startup script to run it.
-
Edit
-
GMETRIC = '/opt/services/ganglia/bin/gmetric' GMOND_CONF="/opt/services/ganglia-local/etc/gmond.conf"
Because we install ganglia in a non-standard location we had to edit infin.py to include the GMOND_CONF. I will attach our version to this page.
/opt/services/ganglia-local/bin/infin.py
ln -s /opt/services/ganglia-local/init.d/nrao-infin /etc/init.d chkconfig --add nrao-infin /etc/init.d/nrao-infin start
Install Disk Metrics (optional)
Dwonload diskstats.py from https://github.com/ganglia/gmond_python_modules/pull/1/filesSave it as
/opt/services/ganglia-local/lib64/ganglia/python_modules/diskstats.py
Download disk_gmetric.sh from http://ben.hartshorne.net/ganglia/
Save it as
/opt/services/ganglia-local/bin/disk_gmetric.sh
Then write a
/etc/init.d/nrao-disk_gmetric
which runs
/opt/services/ganglia-local/bin/disk_gmetric.sh
every 30 seconds
ln -s /opt/services/ganglia-local/init.d/nrao-disk_gmetric /etc/init.d chkconfig --level 345 nrao-disk_gmetric on /etc/init.d/nrao-disk_gmetric start
Installing Ganglia Server
-
Install the tarball made in the previous section
-
cd /opt/services tar xfvz /home/src/ganglia/ganglia_nrao_`uname -i`-3.2.0.tgz
-
edit gmetad.conf and set the following
-
rrd_rootdir "/opt/services/ganglia-local/var/rrds"
Then make that directory
-
mkdir -p /opt/services/ganglia-local/var/rrds chown nobody /opt/services/ganglia-local/var/rrds
-
Install the apache web server (which is a taks left up to the reader) and configure a virutal host for ganglia. Then
-
mkdir /opt/services/ganglia/www cp -R ganglia-3.2.0/web/* to /opt/services/ganglia/www
edit /opt/services/ganglia/www/conf.php and modify the following
-
$gmetad_root = "/opt/services/ganglia-local/var"; define("RRDTOOL", "/opt/services/ganglia/bin/rrdtool"); $time_ranges = array( 'halfhour'=>1800, 'hour'=>3600, '2hour'=>7200, '4hour'=>14400, '8hour'=>28800, 'day'=>86400, 'week'=>604800, 'month'=>2419200, 'year'=>31449600 );
-
Finally
-
mkdir -p /opt/services/ganglia-local/var/dwoo/ chown apache /opt/services/ganglia-local/var/dwoo/
Nagios
Installing a Client
This is only necessary if you need to monitor something that can only be done locally to the client (like 3ware card or disk usage):
echo "nagios:x:1103:1103:nagios:/var/log/nagios:/bin/sh" >> /etc/passwd echo 'nagios:!!:15280::::::' >> /etc/shadow echo "nagios:x:1103:" >> /etc/group echo "nagios ALL = NOPASSWD: /opt/services/nagios-local/plugins/check_3ware.sh" >> /etc/sudoers sed -i -e 's/^Defaults.*requiretty/#Defaults requiretty/' /etc/sudoers cd /opt/services ; tar xfvz /home/src/nagios/client/nagios-1.4.15-x86_64-nrao.tgz
edit
/opt/services/nagios-local/plugins/check_3ware.sh
and set TWCLI to the full path of the tw_cli program e.g. /opt/services/3ware/CLI/tw_cli
ln -s /opt/services/nrpe-local/init.d/nrao-nrpe /etc/init.d chkconfig --add nrao-nrpe on /etc/init.d/nrao-nrpe start
Nagios Front End(website) Administration
-
http://nagios.aoc.nrao.edu/
-
Login: admin
password: the admin passwd
Acknowledge Nagios Alerts
- Click on Tactical Overview on the Left Side menu. Any alerts/issue will show up as red boxes.
- Click on the red box and you can see the detail of the alert. If problem is a service (ie http) then the service will be highlighted.
- Click on the problem, you will have a list of options on the left
- Click on the icon of the man shoveling Acknowledge Problem.
- Fill in the dialog box. This will prevent any further messages being sent about this problem
- Once the problem as been cleared the system or service will automatically go back to its normal state.
Administrating our Nagios Server
Nagios lives on the server hugin in /opt/services/nagios. All configurations specific to aoc/nrao are contained in /opt/services/nagios/etc/nrao. Before any kind of monitoring, including services, can be done on a system you first must define the host.
Adding a host definition
edit the file/opt/services/nagios/etc/nrao/nraohosts.cfg
and add a new entry like this
define host{ use linux-server host_name hugin alias nagios address 10.64.1.32 }
If you want to specify a group for a server, add it to the appropriate hostgroup definition found near the bottom of the nraohost.cfg file.
Adding monitoring for new service via nagios
If the service, like http, is already being monitored on existing servers and you just need to monitor it on a new server then add the name to the existing definition in the nraoservices.cfg filedefine service{ use generic-service ;For monitor http services host_name hugin, vivaldi, penn, gila, acorn, magnolia, occam, smrti, whatever service_description http check_command check_http }
If this is a new service, you may first need to define the nagios command that you will be using. A list of prebuilt commands can be found in
/opt/services/nagios/libexec
To define a new command, edit the file nraocommands.cfg and add something like this:
define command{ command_name check_cups command_line $USER1$/check_http -H $HOSTADDRESS$ -p 631 }
In this example we are using the check_http plugin command to check the status of cups(port 631). Once the command is defined, then you can add an entry for it in the nraoservices.cfg.
Restarting/Reloading nagios definitions
Once additions are made, nagios configs need to be reloaded./etc/init.d/nrao-nagios reloadNagios will check for configuration errors and will not reload if problems exists.