Nagios分布式解决方案－Mod Gearman【转载】

最新推荐文章于 2019-05-15 00:14:12 发布

Kewei-Yu

最新推荐文章于 2019-05-15 00:14:12 发布

阅读量1.2k

点赞数

Nagios分布式解决方案－Mod Gearman

Nagios的分布式解决方案不止一种，关于NSCA和DNX，我想引用了Assaf的一段话来说明它们存在的不足。

I have used two of this option :DNX and NSCA , and in the end , due to
network restrictions had to use the NSCA to monitor the remote sites.
DNX is a good option if all your workers and target hosts are on the
same LAN - as the DNX module does not have a "host affinity" logic which
means any host will check any other host , and if it can't reach it -
nagios will get the critical error alert for it .
NSCA gives you the ability to monitor a remote site "locally" with a
nagios on the site LAN and report it back to the main central server
with a more fine grained testing , but a bit more overhead in management.

About the DNX issue - you may want to check the mod_gearman
http://labs.consol.de/lang/de/nagios/mod-gearman/
as it is better designed to a distributed nagios deployment .

从上面可以看出，NSCA维护起来比较麻烦。为什么麻烦呢，每次当你更改了分布式server上的监控配置时，同样还需要将中心server上的配置调整一致。
DNX的问题则在于网络限制，如果让它的workers对所有要监控的机器没有限制，则还是不错的。

注：另外还有merlin和opsview两款分布式监控软件，它们是基于nagios core做的2次开发。
http://www.op5.org/community/plugin-inventory/op5-projects/merlin
http://www.opsview.com

接下来我们就来介绍一下Mod-Gearman

一、什么是Mod-Gearman

Mod_Gearman是Nagios分布式检查的一种简单实现方法，提高了nagios的灵活性。Mod-Gearman还可以降低单台nagios主机的负载。它由三部分组成：
 *有一个NEB模块，它同Nagios core驻留在一起，将servicechecks，hostchecks和eventhandler添加进Gearman 队列。
 *一个或多个工人客户端，用于执行检查。工人可以被配置成只运行指定的主机或者服务组检查。
 *至少需要运行一个Gearman Job Server

二、它是如何工作的

当broker模块载入后，它捕获所有的servicecheck、hostcheck和eventhandler事件。Eventhandler被送到一个一般的eventhandler队列。在指定主机组的hostcheck，被送到一个单独的主机组队列，所有其它没有匹配的hostcheck被送到一个一般的主机队列。servicechecks首先匹配服务组，然后匹配主机组，所有没有匹配的servicecheck被送到一个一般的服务队列。NEB模块启动一个单进程来监控所有返回到check_results中的结果。

工作流是这样的：

Nagios 要进行一个service 检查
检查被Mod-Gearman拦截
Mod-Gearman将job放入service队列
一个worker获取job，执行后将结果返回到check_result队列
Mod-Gearman获取result job，并将result放到check result list
Nagios reaper从result list读取所有的检查，并更新host和service状态。

你可以对指定的worker设置hosgroupst或者servicegroups。下例为Japan使用了一个单独的hostgroup，为jmx4perl使用了一个单独的servicegroup，如下：
+-----------------------------------+---------------------------+-----------------------+----------------------+
| Queue Name             | Worker Available   |   Jobs Waiting   | Jobs Running |
+-----------------------------------+---------------------------+-----------------------+----------------------+
| check_results             | 1                 | 0               | 0             |
| host                    | 50                | 0               | 1             |
| service                  | 50                | 0               | 13            |
| servicegroup_jmx4perl     | 3                 | 0               | 3             |
| hostgroup_japan          | 3                 | 1               | 3             |
| eventhandler             | 50                | 0               | 0             |
+----------------------------------+----------------------------+-----------------------+---------------------+

一个worker可以处理一个或多个队列。所以，你可以启动一个worker，只处理hostgroup_japan组。一个worker只处理jmx4per检查。一个worker处理所有其它的队列。也可以在每一个队列上有多个worker，这样可以降低负载。

三、一般情况

负载均衡

分布式监控

如果要在不同的网络段执行检查，你可以为worker指定hostgroups(或者servicegroups)。这样一般的主机和服务队列将不会被这个worker执行，它只执行指定组中的检查。

分布式加负载均衡

四、安装

OMD
从0.48版本的OMD开始，其中包含了Mod-Gearman,安装OMD是使用Mod-Gearman比较简单的方式。
OMD[test]:~$ omd config set DISTRIBUTED_MONITORING mod-gearman

http://omdistro.org/

源代码
安装源码需要首先安装如下软件：
 gcc/g++
 autoconf/automake/autoheader
 libtool
 libgearman(>=0.14)

安装命令如下：
#> ./configure
#> make
#> make install

拷贝mod_gearman.o到nagios安装目录，并向nagios.cfg中添加一条broker记录,如下：
broker_module=...../mod_gearman.o server=localhost:4730 eventhandler=yes services=yes hosts=yes

最后一步是通过init 脚本或者如下命令启动一个或多个worker。worker上应该使用同neb module相同的配置文件。
./mod_gearman_worker --server=localhost:4730 --services --hosts

注：确保你的Gearmand job server已经启动。可以使用init 脚本或者如下的命令启动：
/usr/sbin/gearmand -t 10 -j 0

Patch Nagios
Mod-Gearman需要的功能已经包含在Nagios3.2.2中。
Nagios 3.2.2之前的版本，不能使用分布式eventhandler。如果要使用旧的nagios版本，那么需要应用相应的补丁到Nagios源代码，然后编译，然后使用新生成的nagios 二进制文件。如果只打算分布式host/service检查，无需使用补丁。

五、配置

Nagios Core

在nagios.cfg文件中，添加形如下面的broker：
broker_module=/usr/local/share/naigos/mod_gearman.o keyfile=/usr/local/share/nagios/secret.txt server=localhost eventhandler=yes hosts=yes services=yes

Common Options
NEB模块和worker共享的选项

config
这个文件中包含了配置。
config=/etc/nagios3/mod_gm_worker.conf

debug
模块显示的详细信息程度。
 0 －只有错误
 1 － debug信息
 2 － trace 信息
默认是0.
debug=1

server
设置你的gearman job server地址。可以添加多个server。Mod-Gearman使用第一个可用的server 。
server=localhost:4730,remote_host:4730

eventhandler
如果需要模块实现eventhandler分布式执行，需要定义此选项
eventhandler=yes

services
如果需要模块实现service check的分布式执行，需要定义此选项
services=yes

hosts
如果需要模块实现host check的分布式执行，需要定义此选项
hosts=yes

hostgroups
设置一个主机组列表，不同的主机组可以到不同的queue中。
hostgroups=name1,name2,name3

servicegroups
设置一个服务组列表，不同的服务组可以到不同的queue中
servicegroups=name1,name2,name3

encryption
加密功能默认是开启的，当使用加密时，需要指定一个共享的密码key=password，或者一个共享的密码文件keyfile=passwd.file
encryption=yes

key
将用于加密数据包的共享密码。字少8个字符，最多32个。
key=secret

keyfile
这个文件将用于存放共享密码。
keyfile=/path/to/secret.file

Server Options
下面是仅用于NEB模块的参数

localhostgroups
设置一列不被gearman执行的主机组
localhostgroups=name1,name2,name3

localservicegroups
设置一列不被gearman执行的服务组
localservicegroups=name1,name2,name3

result_workers
处理返回结果的进程数。通常一个就足够了。
result_workers=3

perfdata
设置模块是否分配性能数据给gearman。注意，gearman并不处理性能数据，只是将性能数据写到队列中。可以使用pnp4nagios来处理性能数据。
perfdata=yes

perfdata_mode
将主机或者服务的性能数据将是一个个单独的job，以overwrite模式写到perfdata_queue中。如果使用append模式存储性能数据，那性能数据存储的时长依赖内存的多少。设置为overwrite，将防止性能数据队列过大。
 1－ overwrite
 2－ append
默认是1
perfdata_mode=1

result_queue
设置结果队列。当多个nagios将结果放入同一个gearman 对立中时，设立结果队列是必须的。默认：check_results
result_queue=check_results_nagios1

Worker 选项
适用于worker的附件选项

identifier
worker的标识。Will be used for the worker_identifier queue for status requests。如果在同一台主机上使用了多个worker，也许需要为worker设置identifier。默认是当前主机的名字。
identifier=hostname_test

pidfile
pidfile的路径
pidfile=/path/to/pid.file

logfile
logfile的路径
logfile=/path/to/log.file

min-worker
任何时候，应该启动的最小worker进程数。默认为1。
min-worker=1

max-worker
任何时候，最多能运行的worker数量。你可以通过设置max-worker等于min-worker来取消worker的动态调整。当设置此值为1是，这个worker上执行的所有service检查，只能一个一个的执行。默认为20
max-worker=20

spawn-rate
设置当有job等待时，每秒中生成的新worker数。默认为：1
spawn-rate=1

idle-timeout
这个参数设置，当没有jobs等待时，worker空闲等待多少秒后退出。设置为0是禁止此选项。默认为：10
idle-timeout=30

max-jobs
控制worker在退出之前要处理的job数量。使用这个参数来控制workers在高负载过后，减少的速度。设置为0将禁止此选项。默认为：1000
max-jobs=500

fork_on_exec
使用这个选项禁止plugin执行时额外的fork。这个选项将会降低worker所在主机的负载。默认是yes
fork_on_exec=no

dup_server
设置复制的结果数据将被送往的gearman job server地址。可以指定多个地址。
dup_server=logserver:4730,logserver2:4730

六、队列名字

如果想查看你的gearman server job队列。可以使用tools/queu_top.pl工具。它会显示并每秒刷新队列状态：
+-----------------------------------+---------+-------+--------+-----------+
| Name                   | Worker | Avail | Queue | Running |
+-----------------------------------+---------+-------+--------+-----------+
| check_results             | 1     | 1    | 0     | 0      |
| host                    | 3     | 3     | 0     | 0      |
| service                  | 3     | 3     | 0     | 0      |
| eventhandler             | 3     | 3     | 0     | 0      |
| servicegroup_jmx4perl     | 3     | 3     | 0     | 0      |
| hostgroup_japan          | 3     | 3     | 0     | 0      |
+-----------------------------------+---------+--------+-------+-----------+

check_results
这个队列被neb module监视，从worker处取结果回来。对于这个队列，你不需要额外的worker。result worker的数量可以设置为最大256，但是通常一个就够了。一个worker可以每秒处理几千个results。

host
如果你开启了hosts=yes的主机检查开关。这是一个普通主机检查队列。在一个主机进入此队列前，将被检查是否有本地组或者特定主机组匹配。如果没有匹配，这个队列将被使用。

service
如果开启了services=yes的服务检查开关。这个队列将用于普通的service检查。在一个服务进入这个队列之前，它会被检查是否匹配本地主机和服务组，如果没有匹配，则使用这个队列。

hostgroup_<name>
每一个通过hostgroups=... 选项定义的主机组都会有一个队列。确保对于每个hostgroup至少有一个worker。使用--hostgroups=... 参数来启动工作在hostgroup队列上相应的worker。注意，如果一个服务检查匹配主机组，那么也会包含在这个主机组队列中。

servicegroup_<name>
这个队列为每一个通过servicegroup=...选项定义的servicegroup而创建。

eventhandler
这是一个用于所有eventhandler的普通队列。如果开启了eventhandler，确保你有一个worker工作于这个队列。通过--events选项来启动worker工作于此队列。

perfdata
这是一个用于所有性能数据的普通队列。如果你通过--perfdata=yes开启了，则这个队列会被创建使用。性能数据不会被gearman worker自己处理，需要使用PNP4Nagios。

七、性能
这个插件可以将负载分担到多个worker上。一个单独的nagios instance可以放置到Gearman job server上的任务数是吞吐量的主要限制。保持Gearman job server靠近nagios box，或者最好将两者放在同一台设备上。

八、 How To

如何监视Job Server和Worker
./check_gearman -H <job server hostname> -q worker_<worker hostname> -t 10 -s check
check_gearman OK - localhost has 10 worker and is working on 1 jobs |worker=10 running=1 total_jobs_done=1508

监视Job Server的话，可以这样：
./check_gearman -H localhost -t 20

如何提交被动检查
你可以使用send_gearman来提交主动和被动检查到一个gearman job server。
./send_gearman --server=<job server> --encryption=no --host=”<hostname>” --service=”<service>” --messgae=”message”

九、关于通知
通知不能被分发。

十、提示
 确保每一个队列至少有一个worker。
 对你的gearmand server和mod_gearman worker增加日志检查
 确保所有的gearman 检查都在本地组里。Gearman自身检查不应该通过gearman来监视。
 直接写nagios command file的检查（如check_mk）,应该运行在本地worker或者本地服务组中。
 gearmand server越靠近Nagios，性能会好些
 如果有些检查不能平行运行，则可以设置单独的worker 选项 -max-worker=1，这样这些检查将会一个接着一个执行。
 确保所有的worker，都拥有相同的nagios-plugins路径。否则它们无法被worker发现。

转载：http://blog.chinaunix.net/uid-261392-id-2138986.html
原文： http://labs.consol.de/nagios/mod-gearman/#_hints

Kewei-Yu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Nagios分布式解决方案－Mod Gearman【转载】

Nagios分布式解决方案－Mod Gearman Nagios的分布式解决方案不止一种，关于NSCA和DNX，我想引用了Assaf的一段话来说明它们存在的不足。 I have used two of this option :DNX and NSCA , and in the end , due to network restrictions had to
复制链接

扫一扫