MONITORING the performance of hosts in your cluster is somewhat a boring task. I've seen various programs written to access system usage informations with handmade methods, such as filtering output of commands, or reading fields from files in the '/proc' directory. These days I have to write a program to retrieve informations, such as the CPU load, disk and memory usage, from a bunch of hosts, and I've found that SNMP was a good candidate to accomplish this task.
The Simple Network Management Protocol (SNMP) is an application layer protocol that facilitates the exchange of management information between network devices. It is part of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite. SNMP enables network administrators to manage network performance, find and solve network problems, and plan for network growth.
I'll introduce what I've found about SNMP in the following sections, including the installation, configuration of the agent, and how to use the manager the pull informations from the agent.
1. Installation
On my Ubuntu box I just issued one line of command to install all the stuffs:
$ sudo aptitude install snmp snmpd
and on other platforms this should be easy, too.
2. Configuration
Edit the configuration file of the SNMP agent:
$ sudo vi /etc/snmp/snmpd.conf
change the first step of access control section from:
# sec.name source community
com2sec paranoid default public
#com2sec readonly default public
#com2sec readwrite default private
to:
# sec.name source community
com2sec paranoid localhost public
com2sec readonly 10.0.0.0/16 public
#com2sec readwrite default private
this allows hosts belonging to the local network 10.0.0.0/16 to access this box's information in readonly mode.
Then restart the service:
$ sudo /etc/init.d/snmpd restart
and you can test it with:
$ snmpwalk -v 2c -c public localhost system
SNMPv2-MIB::sysDescr.0 = STRING: Linux linkup 2.6.22-14-generic #1 SMP Sun Oct 14 23:05:12 GMT 2007 i686
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1098742) 3:03:07.42
SNMPv2-MIB::sysContact.0 = STRING: Root <root@localhost> (configure /etc/snmp/snmpd.local.conf)
...
However, we've not finished yet. If you issue the following command, which seems to be identical to the former, you'll be surprised to see the result(substitute 10.0.12.4 with your IP address):
$ snmpwalk -v 2c -c public 10.0.12.4 system
Timeout: No Response from 10.0.12.4
This is due to we didn't specify an IP address to listen to. By default, snmpd listen to the loopback interface. To change this, append the following line to the configuration file:
agentaddress 10.0.12.4:161
and restart the service, the problem will be solved.
3. Read informations
After set up on all your nodes, you may want to gather informations from them. The 'snmp' package contains a bunch of tools for sending SNMP requests to the agent, and receiving the responses. For example:
'snmpwalk' retrieve a subtree of management values using SNMP GETNEXT requests. This is useful for observing and debuging your scripts.
'snmpget' communicates with a network entity using SNMP GET requests. This is used to retrieve the informations you need.
All the commands need a OID argument. You can search 'SNMP MIB' with your favourite search engine to learn about it.
To retrieve the CPU load, memory and disk usage from hosts, you need to find the corresponding MIBs for them. I find that my linux box don't give result on the cpu load MIB(.1.3.6.1.2.1.25.3.3.1.2). So I have to get it with the ucdavis MIB. Another thing needs to be noticed is that from the hrStorageTable(.1.3.6.1.2.1.25.2.3) you can get both the memory and disk usage. Issue the following command and check which index is corresponding to your physical and virtual memory:
$ snmpwalk -v 2c -c public 10.0.12.4 .1.3.6.1.2.1.25.2.3.1.3
HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Memory Buffers
HOST-RESOURCES-MIB::hrStorageDescr.2 = STRING: Real Memory
HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Swap Space
HOST-RESOURCES-MIB::hrStorageDescr.4 = STRING: /
HOST-RESOURCES-MIB::hrStorageDescr.5 = STRING: /sys
HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: /boot
HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: /home
HOST-RESOURCES-MIB::hrStorageDescr.8 = STRING: /media/sda1
HOST-RESOURCES-MIB::hrStorageDescr.9 = STRING: /media/sda5
HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: /media/sda6
HOST-RESOURCES-MIB::hrStorageDescr.11 = STRING: /sys/kernel/security
the result indicates that the index of 2 is for my physical memory, and 3 for my swap space.
Following is a example to retrieve the CPU load, memory and disk usage from hosts:
#!/bin/sh
work_dir=$(dirname $0)
cpu_load_mib=.1.3.6.1.2.1.25.3.3.1.2
storage_unit_mib=.1.3.6.1.2.1.25.2.3.1.4
storage_size_mib=.1.3.6.1.2.1.25.2.3.1.5
storage_used_mib=.1.3.6.1.2.1.25.2.3.1.6
cpu_load_ucd_mib=.1.3.6.1.4.1.2021.10.1.5.1
for host; do
. $work_dir/$host
echo "cpu usage"
cpu_load=
if [ $platform = 'win32' ]; then
cpu_count=
for i in $cpu_indexes; do
load=$(snmpget -v $snmp_version -c $snmp_community $host /
$cpu_load_mib.$i | cut -d ' ' -f 4)
cpu_count=$(($cpu_count+1))
cpu_load=$(($cpu_load+$load))
done
cpu_load=$(($cpu_load/$cpu_count))
else
cpu_load=$(snmpget -v $snmp_version -c $snmp_community $host /
$cpu_load_ucd_mib | cut -d ' ' -f 4)
fi
echo $cpu_load
echo "memory usage"
mem_size=
mem_used=
for i in $memory_indexes; do
unit=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_unit_mib.$i | cut -d ' ' -f 4)
size=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_size_mib.$i | cut -d ' ' -f 4)
used=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_used_mib.$i | cut -d ' ' -f 4)
mem_size=$(($mem_size+$unit/1024*$size))
mem_used=$(($mem_used+$unit/1024*$used))
done
echo $mem_size $mem_used
echo "disk usage"
disk_size=
disk_used=
for i in $disk_indexes; do
unit=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_unit_mib.$i | cut -d ' ' -f 4)
size=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_size_mib.$i | cut -d ' ' -f 4)
used=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_used_mib.$i | cut -d ' ' -f 4)
disk_size=$(($disk_size+$unit/1024*$size))
disk_used=$(($disk_used+$unit/1024*$used))
done
echo $disk_size $disk_used
done
Following is two sample configuration files to the former script:
$ cat 10.0.15.141
platform=win32
snmp_version=2c
snmp_community=public
cpu_indexes='2 3 4 5'
memory_indexes='8'
disk_indexes='2 3 4'
$ cat 10.0.15.99
platform=linux
snmp_version=2c
snmp_community=public
cpu_indexes=''
memory_indexes='2'
disk_indexes='4'
4. Other tools
tkmib is useful to inspect the MIB tree. On ubuntu you can install it from aptitute.
MRTG monitor SNMP network devices and draw pretty pictures showing how much traffic has passed through each interface.
Cacti is regarded as the best graph front-end for SNMP network and the substitution to MRTG.
The Simple Network Management Protocol (SNMP) is an application layer protocol that facilitates the exchange of management information between network devices. It is part of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite. SNMP enables network administrators to manage network performance, find and solve network problems, and plan for network growth.
I'll introduce what I've found about SNMP in the following sections, including the installation, configuration of the agent, and how to use the manager the pull informations from the agent.
1. Installation
On my Ubuntu box I just issued one line of command to install all the stuffs:
$ sudo aptitude install snmp snmpd
and on other platforms this should be easy, too.
2. Configuration
Edit the configuration file of the SNMP agent:
$ sudo vi /etc/snmp/snmpd.conf
change the first step of access control section from:
# sec.name source community
com2sec paranoid default public
#com2sec readonly default public
#com2sec readwrite default private
to:
# sec.name source community
com2sec paranoid localhost public
com2sec readonly 10.0.0.0/16 public
#com2sec readwrite default private
this allows hosts belonging to the local network 10.0.0.0/16 to access this box's information in readonly mode.
Then restart the service:
$ sudo /etc/init.d/snmpd restart
and you can test it with:
$ snmpwalk -v 2c -c public localhost system
SNMPv2-MIB::sysDescr.0 = STRING: Linux linkup 2.6.22-14-generic #1 SMP Sun Oct 14 23:05:12 GMT 2007 i686
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1098742) 3:03:07.42
SNMPv2-MIB::sysContact.0 = STRING: Root <root@localhost> (configure /etc/snmp/snmpd.local.conf)
...
However, we've not finished yet. If you issue the following command, which seems to be identical to the former, you'll be surprised to see the result(substitute 10.0.12.4 with your IP address):
$ snmpwalk -v 2c -c public 10.0.12.4 system
Timeout: No Response from 10.0.12.4
This is due to we didn't specify an IP address to listen to. By default, snmpd listen to the loopback interface. To change this, append the following line to the configuration file:
agentaddress 10.0.12.4:161
and restart the service, the problem will be solved.
3. Read informations
After set up on all your nodes, you may want to gather informations from them. The 'snmp' package contains a bunch of tools for sending SNMP requests to the agent, and receiving the responses. For example:
'snmpwalk' retrieve a subtree of management values using SNMP GETNEXT requests. This is useful for observing and debuging your scripts.
'snmpget' communicates with a network entity using SNMP GET requests. This is used to retrieve the informations you need.
All the commands need a OID argument. You can search 'SNMP MIB' with your favourite search engine to learn about it.
To retrieve the CPU load, memory and disk usage from hosts, you need to find the corresponding MIBs for them. I find that my linux box don't give result on the cpu load MIB(.1.3.6.1.2.1.25.3.3.1.2). So I have to get it with the ucdavis MIB. Another thing needs to be noticed is that from the hrStorageTable(.1.3.6.1.2.1.25.2.3) you can get both the memory and disk usage. Issue the following command and check which index is corresponding to your physical and virtual memory:
$ snmpwalk -v 2c -c public 10.0.12.4 .1.3.6.1.2.1.25.2.3.1.3
HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Memory Buffers
HOST-RESOURCES-MIB::hrStorageDescr.2 = STRING: Real Memory
HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Swap Space
HOST-RESOURCES-MIB::hrStorageDescr.4 = STRING: /
HOST-RESOURCES-MIB::hrStorageDescr.5 = STRING: /sys
HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: /boot
HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: /home
HOST-RESOURCES-MIB::hrStorageDescr.8 = STRING: /media/sda1
HOST-RESOURCES-MIB::hrStorageDescr.9 = STRING: /media/sda5
HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: /media/sda6
HOST-RESOURCES-MIB::hrStorageDescr.11 = STRING: /sys/kernel/security
the result indicates that the index of 2 is for my physical memory, and 3 for my swap space.
Following is a example to retrieve the CPU load, memory and disk usage from hosts:
#!/bin/sh
work_dir=$(dirname $0)
cpu_load_mib=.1.3.6.1.2.1.25.3.3.1.2
storage_unit_mib=.1.3.6.1.2.1.25.2.3.1.4
storage_size_mib=.1.3.6.1.2.1.25.2.3.1.5
storage_used_mib=.1.3.6.1.2.1.25.2.3.1.6
cpu_load_ucd_mib=.1.3.6.1.4.1.2021.10.1.5.1
for host; do
. $work_dir/$host
echo "cpu usage"
cpu_load=
if [ $platform = 'win32' ]; then
cpu_count=
for i in $cpu_indexes; do
load=$(snmpget -v $snmp_version -c $snmp_community $host /
$cpu_load_mib.$i | cut -d ' ' -f 4)
cpu_count=$(($cpu_count+1))
cpu_load=$(($cpu_load+$load))
done
cpu_load=$(($cpu_load/$cpu_count))
else
cpu_load=$(snmpget -v $snmp_version -c $snmp_community $host /
$cpu_load_ucd_mib | cut -d ' ' -f 4)
fi
echo $cpu_load
echo "memory usage"
mem_size=
mem_used=
for i in $memory_indexes; do
unit=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_unit_mib.$i | cut -d ' ' -f 4)
size=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_size_mib.$i | cut -d ' ' -f 4)
used=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_used_mib.$i | cut -d ' ' -f 4)
mem_size=$(($mem_size+$unit/1024*$size))
mem_used=$(($mem_used+$unit/1024*$used))
done
echo $mem_size $mem_used
echo "disk usage"
disk_size=
disk_used=
for i in $disk_indexes; do
unit=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_unit_mib.$i | cut -d ' ' -f 4)
size=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_size_mib.$i | cut -d ' ' -f 4)
used=$(snmpget -v $snmp_version -c $snmp_community $host /
$storage_used_mib.$i | cut -d ' ' -f 4)
disk_size=$(($disk_size+$unit/1024*$size))
disk_used=$(($disk_used+$unit/1024*$used))
done
echo $disk_size $disk_used
done
Following is two sample configuration files to the former script:
$ cat 10.0.15.141
platform=win32
snmp_version=2c
snmp_community=public
cpu_indexes='2 3 4 5'
memory_indexes='8'
disk_indexes='2 3 4'
$ cat 10.0.15.99
platform=linux
snmp_version=2c
snmp_community=public
cpu_indexes=''
memory_indexes='2'
disk_indexes='4'
4. Other tools
tkmib is useful to inspect the MIB tree. On ubuntu you can install it from aptitute.
MRTG monitor SNMP network devices and draw pretty pictures showing how much traffic has passed through each interface.
Cacti is regarded as the best graph front-end for SNMP network and the substitution to MRTG.