使用open-falcon的人估计都会去折腾该监控系统的报警过程,因为一个监控系统的核心功能就是监控报警,报警也是监控的最终目的。所以,了解一个监控系统的报警原理是每一位使用者必有的好奇心。好像是没有弄明白一件事,心理层面就会有一根刺插在那,非要把他拔掉一样。我想这不是对追求知识的执着,而仅仅是强迫症的一种表现。下面,是我对open-falcon报警信息处理过程的分析思路。包括:前期环境的准备、分析过程、处理过程、处理的优化。系统环境: Ubuntu15.04_64bit、open-falcon源码、redis、mysql、golang、gcc等
1、搭建开发环境
1.1安装c语言环境
sudo apt-get install build-essential
1.2安装golang环境
去csdn下载免费的go1.4.2.linux-amd64.tar.gz,进入下载目录
sudo tar -zxvf go1.4.2.linux-amd64.tar.gz -C /usr/local/
编辑 /etc/profile 文件添加环境变量:sudo vi /etc/profile 追加下面内容到文件末尾:
export GOROOT=/usr/local/go
export GOBIN=$GOROOT/bin
export PATH=$PATH:$GOBIN
export GOPATH=$HOME/goproj
重新加载环境变量:
source /etc/profile
查看golang版本:
go version
1.3安装redis、mysql
sudo apt-get install mysql-server mysql-client libmysqlclient*
wget http://download.redis.io/releases/redis-3.0.5.tar.gz
tar zxvf redis-3.0.5.tar.gz
cd redis-3.0.5/
sudo apt-get install tcl
make
sudo make install
1.4源码编译open-falcon
mkdir $HOME/goproj
cd $HOME/goproj
mkdir -p src/github.com
cd src/github.com
git clone --recursive https://github.com/XiaoMi/open-falcon.git
这里以安装alarm模块为例子,其他的可以参考官方文档,我应该也会在博客更新
cd open-falcon/alarm/
sudo chmod 777 /usr/local/go/bin/
go get ./...
./control build
2、报警信息分析
要分析报警信息,首先要产生报警信息。通过用户界面添加模板,在模板中添加报警规则。例如:内存的空闲空间少于100%即报警。这样的报警规则肯定会被触发,需要注意的是添加报警规则的同时,需要设置报警接受用户组,该用户组里面添加相应的用户。然后,添加的模板需要跟主机组进行绑定,在主机组里添加相应的被监控机器。最后,坐等报警。
2.1查找redis数据库
使用redis-cli连接redis数据库,查询是否存在报警信息:
key ×
打印如下信息:
1) "session:obj:fe557589a85711e58528000c29bd7b56"
2) "t:uids:4"
3) "foo"
4) "team:obj:5"
5) "team:id:alarm"
6) "user:obj:6"
7) "team:id:alarm_info"
8) "user:obj:7"
9) "user:obj:8"
10) "user:id:admin"
11) "user:obj:11"
12) "t:uids:6"
13) "t:uids:5"
14) "user:obj:1"
15) "user:obj:10"
看到team:id:alarm、team:id:alarm_info时就知道产生的报警信息,当使用 get 命令查询team:id:alarm、team:id:alarm_info时,返回的并不是报警的信息,所以team:id:alarm、team:id:alarm_info不是报警信息的key。怎么办呢?似乎就找不到报警信息了。恩,可以去查看官方文档是怎么说的。
2.2阅读open-falcon文档
报警信息是由judge模块产生的,每次产生报警信息都会记录到redis数据库,而且详细划分报警的等级,那么为什么会没有报警在redis里面呢? 来看alarm模块,每次产生报警信息的时候都会及时上报给用户,我们也可以在界面上看到完整的报警信息,但是这些信息却没有在redis查询到。那么,只能开始阅读以上两个模块redis操作的源代码。
2.3阅读open-falcon源码
judge模块使用LPUSH命令写报警信息到redis里面,LPUSH(从队列的左边入队一个或多个元素),把报警信息写到了redis队列里面,等待别的进程获取。到这已经有点眉目了,如果,队列里面的报警信息出队了,所以redis就查询不到报警信息。alarm模块使用BRPOP命令获取redis里的报警信息,BRPOP(删除,并获得该列表中的最后一个元素,或阻塞,直到有一个可用),把报警信息从redis里面出队并且删除该报警信息。
2.4修改源码记录报警信息
judge中redis的报警信息写日志:
log.Printf("redis key is %v", redisKey)
log.Printf("redis value is %v", string(bs))
alarm中redis的报警信息写日志:
log.Printf("redis key is %v", redisKey)
log.Printf("redis value is %v", string(bs))
重新编译两个模块、重新启动,坐等报警信息的再次产生。
2.5查看日志、查看redis
judge记录到redis的报警信息如下:
2015/12/22 14:12:48 judge.go:82: redis key is event:p0
2015/12/22 14:12:48 judge.go:83: redis value is {"id":"s_7_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":7,"metric":"mem.memfree.percent"," tags":{},"func":"all(#3)","operator":"\u003c=","rightValue":100,"maxStep":3,"priority":0,"note":"memfree alarm test","tpl":{"id":2,"name":"memer y","parentId":0,"actionId":1,"creator":"admin"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":33.335713095833036,"current Step":3,"eventTime":1450764720,"pushedTags":{}}
alarm获取到redis的报警信息如下:
2015/12/22 14:04:13 reader.go:65: the redis key is: [event:p0 event:p1 event:p2 event:p3 event:p4 event:p5 0]
2015/12/22 14:04:13 reader.go:66: the redis value is: [event:p0 {"id":"s_7_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":7,"metric":"mem.me mfree.percent","tags":{},"func":"all(#3)","operator":"\u003c=","rightValue":100,"maxStep":3,"priority":0,"note":"memfree alarm test","tpl":{"id" :2,"name":"memery","parentId":0,"actionId":1,"creator":"admin"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":38.62473525 142191,"currentStep":1,"eventTime":1450764120,"pushedTags":{}}]
此时,查看redis还是无法获取报警信息的。如果,想要查询得到报警信息可以停止alarm模块,这样报警信息就会一直存在redis队列里面。
3、报警信息处理
3.1停止alarm模块
停止alarm模块就可以从redis里面读取报警信息了:
进入alarm目录,命令行输入:
./control stop
judge记录到报警信息后,使用redis-cli查询报警信息:
127.0.0.1:6379> KEYS *
127.0.0.1:6379> KEYS *
1) "event:p0"
127.0.0.1:6379> TYPE event:p0
list
127.0.0.1:6379> lpop event:p0
"{\"id\":\"s_7_9e899684e61cce209c14444cfb4e33bc\",\"strategy\":{\"id\":7,\"metric\":\"mem.memfree.percent\",\"tags\":{},\"func\":\"all(#3)\",\"operator\":\"\\u003c=\",\"rightValue\":100,\"maxStep\":3,\"priority\":0,\"note\":\"memfree alarm test\",\"tpl\":{\"id\":2,\"name\":\"memery\",\"parentId\":0,\"actionId\":1,\"creator\":\"admin\"}},\"expression\":null,\"status\":\"PROBLEM\",\"endpoint\":\"bogon\",\"leftValue\":33.335713095833036,\"currentStep\":3,\"eventTime\":1450764720,\"pushedTags\":{}}"
也可以通过C语言程序获取该报警信息。
3.2c语言获取报警信息
连接redis数据库:
redisContext* conn = redisConnect("127.0.0.1",6379);
获取报警信息:
redisReply* reply = redisCommand(conn,"BRPOP event:p0 0 ");
或者
redisReply* reply = redisCommand(conn,"RPOP event:p0");
BRPOP、RPOP 是redis出队列命令,BRPOP是阻塞模式,0表示一直阻塞;RPOP是非阻塞模式。
分析报警信息,redisCommand函数返回的redisReply是一个数据结构,如下:
/* This is the reply object returned by redisCommand() */
typedef struct redisReply {
int type; /* REDIS_REPLY_* */
long long integer; /* The integer when type is REDIS_REPLY_INTEGER */
int len; /* Length of string */
char *str; /* Used for both REDIS_REPLY_ERROR and REDIS_REPLY_STRING */
size_t elements; /* number of elements, for REDIS_REPLY_ARRAY */
struct redisReply **element; /* elements vector for REDIS_REPLY_ARRAY */
} redisReply;
其中 type 表示返回结果的类型,包括如下:
#define REDIS_REPLY_STRING 1
#define REDIS_REPLY_ARRAY 2
#define REDIS_REPLY_INTEGER 3
#define REDIS_REPLY_NIL 4
#define REDIS_REPLY_STATUS 5
#define REDIS_REPLY_ERROR 6
而BRPOP、RPOP对应的操作返回#define REDIS_REPLY_ARRAY 2是一个数组类型,处理如下:
for(i = 0; i < reply->elements; ++i){
redisReply* childReply = reply->element[i];
if (childReply->type == REDIS_REPLY_STRING)
printf("The value is %s.\n",childReply->str);
}
运行结果打印信息如下:
The value is event:p0.
The value is {"id":"s_3_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":3,"metric":"mem.memfree.percent","tags":{},"func":"all(#3)","operator":"\u003c","rightValue":100,"maxStep":20,"priority":0,"note":"鍐呭瓨浣跨敤閲忓お澶,"tpl":{"id":3,"name":"local","parentId":0,"actionId":2,"creator":"root"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":27.076937721615383,"currentStep":1,"eventTime":1450851060,"pushedTags":{}}.
注意,需要安装hiredis客户端支持c语言操作redis
3.3c语言保存报警信息
结果写到文档:
fd=fopen("./alarm_info_log","rw+");
fwrite(childReply->str,1,childReply->len,fd);
写入成功后查看alarm_info_log内容: cat alarm_info_log
event:p0{"id":"s_3_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":3,"metric":"mem.memfree.percent","tags":{},"func":"all(#3)","operator:"\u003c","rightValue":100,"maxStep":20,"priority":0,"note":"鍐呭瓨浣跨敤閲忓お澶,"tpl":{"id":3,"name":"local","parentId":0,"actionId":2,"creator":"root"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":27.076937721615383,"currentStep":1,"eventTime":1450851060,"pushedTags":{}}liang@bogon:~/redis/proc
注意保证alarm_info_log文件可读写权限
到此为止,已经完全的获取到open-falcon报警时保存在redis数据库的信息,得知如何获取该信息之后,就可以开始去做进一步的事情。