下面是包括ganglia的安装,和ganglia监控hadoop的配置的完整过程。以及在安装过程中遇到的各种问题和解决方法的记录。ganglia版本为3.6,hadoop版本为cdh5.
准备工作
准备工作中的各项安装,在每个节点,包括gmetad和gmond上都需要。1. 安装依赖
- yum -y install apr-devel apr-util check-devel cairo-devel pango-devel libxml2-devel rpmbuild glib2-devel dbus-devel freetype-devel fontconfig-devel gcc-c++ expat-devel python-devel libXrender-devel
- yum -y install php-gd.x86_64
- yum -y install rrdtool.x86_64
- yum -y install rrdtool-devel.x86_64
- wget http://jaist.dl.sourceforge.net/project/expat/expat/2.1.0/expat-2.1.0.tar.gz
- ./configure --prefix=/usr/local/expat
- make
- make install
- mkdir /usr/local/expat/lib64
- cp -a /usr/local/expat/lib/* /usr/local/expat/lib64/
- wget http://ftp.twaren.net/Unix/NonGNU//confuse/confuse-2.7.tar.gz
- ./configure CFLAGS=-fPIC --disable-nls --prefix=/usr/local/confuse
- make
- make install
- mkdir -p /usr/local/confuse/lib64
- cp -a -f /usr/local/confuse/lib/* /usr/local/confuse/lib64/
gmetad安装
1. 安装gmetad
- ./configure --with-gmetad --enable-gexec --with-libconfuse=/usr/local/confuse --with-libexpat=/usr/local/expat --prefix=/usr/local/ganglia --sysconfdir=/etc/ganglia --with-libpcre=no
- make
- make install
- mkdir -p /var/lib/ganglia/rrds
- mkdir -p /var/lib/ganglia/dwoo
- chown -R root:root /var/lib/ganglia
- 添加data_source项,根据需要添加。一个data_source对应一个监控集群。
- data_source "optest" host:port
- cp -f gmetad/gmetad.init /etc/init.d/gmetad
- cp -f /usr/local/ganglia/sbin/gmetad /usr/sbin/gmetad
- service gmetad start
gmond安装
1. 安装gmond:- ./configure --enable-gexec --with-libconfuse=/usr/local/confuse --with-libexpat=/usr/local/expat --prefix=/usr/local/ganglia --sysconfdir=/etc/ganglia --with-libpcre=no
- make
- make install
- cp -f gmond/gmond.init /etc/init.d/gmond
- cp -f /usr/local/ganglia/sbin/gmond /usr/sbin/gmond
- gmond --default_config > /etc/ganglia/gmond.conf
- 配置/etc/ganglia/gmond.conf。飘红是需要做改动的地方。然后将配置好的conf文件拷贝到每个节点上。
<span style="font-family:Microsoft YaHei;font-size:14px;"> globals {
daemonize = yes
setuid = yes
<span style="color:#ff0000;">user = root</span>
......</span>
<span style="font-family:Microsoft YaHei;font-size:14px;"> cluster {</span>
<span style="font-family:Microsoft YaHei;font-size:14px;"> <span style="color:#ff0000;">name = "test"
owner = "root"</span>
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
<span style="color:#ff0000;"> #bind_hostname = yes
#mcast_join = 239.2.11.71
host = your.host
port = 8649</span>
ttl = 1
}
udp_recv_channel {
<span style="color:#ff0000;"> #mcast_join = 239.2.11.71
port = 8649
#bind = 239.2.11.71
#retry_bind = true</span>
# buffer = 10485760
}</span>
4. 启动服务:
- service gmond start
ganglia web配置
1. 启动Apache:
- service httpd start
3. 修改用户:
- vi Makefile
- apache_user = apache
- make install
- service httpd restart
- service gmetad restart
ganglia监控hadoop配置
1. 配置hadoop的hadop-metrics2.properties文件。并分发到hadoop集群。
<span style="font-family:Microsoft YaHei;font-size:14px;"> *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.period=10
*.sink.ganglia.supportsparse=true
*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
#namenode
namenode.sink.ganglia.servers=gmetad.server:8649
#resourcemanager
resourcemanager.sink.ganglia.servers=gmetad.server8649
maptask.sink.ganglia.servers=gmetad.server:8649
reducetask.sink.ganglia.servers=gmetad.server:8649
#datanode
datanode.sink.ganglia.servers=gmetad.server:8649
#nodemanager
nodemanager.sink.ganglia.servers=gmetad.server:8649
#maptask.sink.ganglia.servers=gmetad.server:8649
#reducetask.sink.ganglia.servers=gmetad.server:8649</span>
2. 重启hadoop集群
error解决
1. gmetad启动后执行“service gmetad status”,报错“gmetad dead but subsys locked ”
解决:
chown -R nobody:nobody /var/lib/ganglia
2. 图片无法显示
解决:
判断可能是没有安装php-gd,安装: yum install php-gd.x86_64。或者可能是是conf_default.php中rrdtool的目录配置不对。
3. 一旦配置启动hadoop metrics,所有的监控数据立即失效。
解决:
ganglia原理是从各个节点的gmond汇总到主gmond,然后由gmond汇报给gmetad。所以以下方法可用:
(1) 在某个节点上修改gmong.conf的debug_level,重启gmond查看debug输出进行差错。
(2) 查看/var/log/messages中的gmond和gmetad的日志输出,逐步解决其中的报错。
本次我解决了message报的各个错,然后hadoop ganglia监控正常了。
4. [PYTHON] Can't open the python module path /etc/ganglia/lib64/ganglia/python_modules. Module python_module
解决:
mkdir -p /usr/local/ganglia/lib64/ganglia/python_modules
5. RRD_create: creating '/var/lib/ganglia/rrds/test/slave01/rpc.RetryCache/NameNodeRetryCache.CacheCleared.rrd': No such file or directory
解决:
手动创建不存在的目录(/rpc.RetryCache)并修改权限为nobody
6. ganglia运行日志查看
解决:
ganglia目前来看没有日志可看。两个方法,一是启用gmond的debug,查看输出;二是查看/var/log/messages中的输出。