一. 技术点梳理

- Nginx:其实一个web server,此流程中做反向代理,起到分发用户请求的作用,在集群环境时,也可以用它实现负载均衡;
- Spawn cgi:提供一个cgi网关接口,可以将server服务,快速的暴露出去以便对外提供服务,对外提供的服务走fcgi协议,fcgi协议是一种特殊的http请求,而http请求安全性相对差一些,因为容易受到外部的攻击;
- Thrift RPC: 通过执行thrift命令,可以帮助我们快速的生成client和server代码,同时由于rpc跨语言,同时只要遵循client和server端通信的rpc协议和接口规范,两端完全可以使用不同的语言进行开发.另外,当生产中,使用c++或java进行开发时,可以大大提高冰并发请求的性能;
- Flume:作为一个通道对接log server产生的log文件,使其经过source,channel,最后抵达自己重定义的HBaseSink.之所以要使用自己重定义的HBaseSink,是为了完成将非结构的日志数据,转换成结构化的数据分别存储在表中不同的列里面,而HBaseSink原来的sink只能帮助我们指定不同的列;
- HBase:存储经过结构化处理的用户行为日志信息,后期可以使用Hive对HBase中的数据进行统计和分析;
- Glog :谷歌的一个开源日志框架,可以实现快速的写入log日志文件,并可以指定文件存放的位置,单个文件的大小,分割log日志文件的周期等;功能类似java中常用的log4j;
- ab压测工具:用来模拟大量用户的并发请求,测试并发请求时,所有请求的平均响应时间;
二.任务拆分
环境提示:
Python版本: Python 2.7
Java版本: 1.7.0_80
Linux版本: Centos7
2.1 [任务1]:安装thrift,并调通单机版的thrift(Python版本)
1. thrift版本号
thrift-0.9.3
2. 下载源码包
wget http://apache.fayea.com/thrift/0.9.3/thrift-0.9.3.tar.gz
安装wget命令:
yum -y install wget
3. 安装thrift
3.1 安装依赖条件
thrift核心代码使用c++编写,用了boost库.执行以下命令,安装相关依赖:
yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
3.2 解压安装
tar xzvf thrift-0.9.3.tar.gz
3.3 配置语言
进入解压之后的文件夹,配置支持与不支持的语言
./configure --with-cpp --with-boost --with-python --without-csharp --with-java --without-erlang --without-perl --with-php --without-php_extension --without-ruby --without-haskell --without-go
执行命令,可能会报出如下错误信息:
configure: error: "Error: libcrypto required."
解决办法:
安装 openssl openssl-devel,执行命令如下:
yum -y install openssl openssl-devel
3.4 编译
> 执行make命令: make (编译前,要保证系统中安装有g++)
> 执行make install命令: make install
3.5 测试是否安装成功
> 执行thrift命令: thrift 或 thrift -help
> 查看thrift安装路径: which thrift
4 设置server和client通信的接口数据格式:
定义scheme,为通过执行thrift命令生成client和server代码做准备:
cat RecSys.thrift
service RecSys {
string rec_data(1:string data)
}
5 根据接口格式(scheme)生成代码(python)
执行命令:
thrift --gen py RecSys.thrift
执行过程,可能会出错,可根据提示安装对应的内容,实例如下:
pip install thrift==0.9.3
成功生成之后,查看client端代码:
cat client.py
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
#追加目录,识别对应的库
sys.path.append("gen-py")
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from RecSys import RecSys
# from demo.ttypes import *
try:
# Make Socket
# 建立socket, IP 和port要写对
transport = TSocket.TSocket('localhost', 9900)
# Buffering is critical. Raw sockets are very slow
# 选择传输层,这块要和服务器的设置一样
transport = TTransport.TBufferedTransport(transport)
# Wrap in a protocol
# 选择传输协议,这个也要和服务器保持一致,负责无法通信
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = RecSys.Client(protocol)
# Connect!
transport.open()
# Call server services
rst = client.rec_data("are you ok!")
print rst
# close transport
transport.close()
except Thrift.TException, ex:
print "%s" % (ex.message)
server端代码:
cat server.py
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
sys.path.append('gen-py')
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from thrift.server import TServer
from RecSys import RecSys
from RecSys.ttypes import *
class RecSysHandler(RecSys.Iface):
def rec_data(self, a):
print "Receive: %s" %(a)
return "ok"
if __name__ == "__main__":
# 实例化handler
handler = RecSysHandler()
# 设置processor
processor = RecSys.Processor(handler)
# 设置端口
transport = TSocket.TServerSocket('localhost', port=9900)
# 设置传输层
tfactory = TTransport.TBufferedTransportFactory()
# 设置传输协议
pfactory = TBinaryProtocol.TBinaryProtocolFactory()
server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
print 'Starting the server...'
server.serve()
print 'done'
6 启动测试
分别启动client和server
1. python client.py
2. python server.py
在启动过程中,如果提示没有哪些模块,使用pip命令进行安装对应版本的模块即可.
示例:
pip install thrift==0.9.3
6.1 下载安装pip
wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate
6.2 pip安装
# tar -xzvf pip-1.5.4.tar.gz
# cd pip-1.5.4
# python setup.py install
6.3 pip安装包
# pip install SomePackage
[...]
Successfully installed SomePackage
7 访问测试
启动成功后,可以进行访问测试一下.
通过以上操作,基本就完整了一个最基本的C/S架构(python).2.2 [任务2]:调通单机版的thrift(c++版本)
thrift是帮我们仅仅生成了server端,之所以要用c++版本,是因为c++性能更高,并发效果更好,生产中是最常用的.1. 执行thrift命令,生产server端代码
thrift --gen cpp RecSys.thrift
2. 编译之前要安装一个包,否则编译可能会通不过,已安装的话,请忽略:
yum install boost-devel-static
3. 复制lib包,统一管理
cp -raf thrift/ /usr/local/include/
4. 编译
编译server:
执行thrift命令之后,会生成一个gen-cpp的文件夹,进入执行以下命令,分别进行编译:
g++ -g -Wall -I./ -I/usr/local/include/thrift RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp -L/usr/local/lib/*.so -lthrift -o server
client端代码需要我们自己手写,代码实例如下:
cat client.cpp
#include "RecSys.h"
#include <iostream>
#include <string>
#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace std;
using std::string;
using boost::shared_ptr;
int main(int argc, char **argv) {
boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
transport->open();
RecSysClient client(protocol);
string send_data = "are you ok?";
string receive_data;
client.rec_data(receive_data, send_data);
cout << "receive server data: " << receive_data << endl;
transport->close();
return 0;
}
编译client:
g++ -g -Wall -I./ -I/usr/local/include/thrift RecSys.cpp client.cpp -L/usr/local/lib/*.so -lthrift -o client
5. 运行
运行server
./server
(1) 运行过程可能会出现以下错误:
./server: error while loading shared libraries: libthrift-0.9.3.so: cannot open shared object file: No such file or directory
解决方案: 修改ld.so.conf文件
(py27tf) [root@singler cgi_demo]# vim /etc/ld.so.conf
1 include ld.so.conf.d/*.conf
2 /usr/local/lib
执行命令,重新加载配置: ldconfig
(2)二次执行时,可能报以下异常信息:
[root@master gen-cpp]# ./server
Thrift: Tue May 15 12:49:33 2018 TServerSocket::listen() BIND 9090
terminate called after throwing an instance of 'apache::thrift::transport::TTransportException'
what(): Could not bind: Transport endpoint is not connected
Aborted
解决方案: 执行以下命令,找到对应已经启动的程序,然后杀掉即可:
ps -aux
并通过以下命令杀掉即可:
kill -9 进程号
运行client:
./client
注意:
在执行修改client或server端代码前,要删除client和server的可执行程序,并重新进行编译,删除命令如下:
rm -rf server client
2.3 [任务3]:实现thrift多语言互通
Thrift RPC跨语言,只要client和server端遵循通信的rpc协议和接口规范,两端完全可以使用不同的语言进行开发.
client: python
server: c++
1. 执行thrift命令,生成server端代码
thrift --gen cpp RecSys.thrift
自动生成gen-cpp目录,这个目录下的代码仅完成server功能
2. 在server端代码进行如下内容的修改:
server端代码要进行局部修改,以便更好的进行测试:
23 void rec_data(std::string& _return, const std::string& data) {
24 // Your implementation goes here
25 printf("==============\n");
26 std::cout << "receive client data: " << data << std::endl;
27
28 std::string ack = "i am ok !!!";
29 _return = ack;
30
31 }
3. client端代码仍使用任务1中的Python代码:
只需要对应server端,修改端口号即可.
4.运行
运行server
./server
启动server过程中,可能会出现端口被占用的情况,通过以下命令,可以查到占用的进程,并杀死进程
ps aux
kill -9 进程号
运行client
Python client.py
2.4 [任务4]:搭建ngx服务器
1. nginx版本
nginx-1.14.0
2. 下载:
wget http://nginx.org/download/nginx-1.14.0.tar.gz
3. 解压安装:
tar xvzf nginx-1.14.0.tar.gz
4. 配置安装路径:
解压后,进入文件夹nginx-1.14.0中,执行如下配置,并执行安装路径:
./configure --prefix=/usr/local/nginx
5. 安装依赖包
若安装出错,可尝试安装如下依赖包:
yum -y install pcre-devel zlib-devel
6. 执行编译
进入安装路径/usr/local/nginx中,执行如下命令,进行编译:
make
make install
7. 启动nginx
./sbin nginx
8. 访问测试一下
在浏览器中输入安装nginx所在机器的ip,测试访问一下,出现以下页面,即表示安装成功.
9. 查看一下运行的端口
netstat -antup | grep -w 80
ps aux | grep nginx
10. kill掉线程
killall -9 nginx
centos7精简安装后,使用中发现没有killall命令。经查找,可以通过以下命令解决:
yum install psmisc
2.5 [任务5]:配合Spawn cgi完成独立server
通过cgi提供的网关接口,可以将自己的server服务提供给外部,供外部用户进行访问请求.可以理解为提供了一种代理,可以在非应用程序所在的机器上操作应用程序,并对应用程序发送请求.
1. 下载cgi
wget http://download.lighttpd.net/spawn-fcgi/releases-1.6.x/spawn-fcgi-1.6.4.tar.gz
2. 解压
tar xzvf spawn-fcgi-1.6.4.tar.gz
3. 三部曲:配置,编辑,安装
configure
make
make install
4. copy bin目录,将所有的bin目录都放在一起
cp src/spawn-fcgi /usr/local/nginx/sbin/
5. 安装fcgi
fcgi是cgi应用程序依赖的库.
5.1 下载
wget ftp://ftp.ru.debian.org/gentoo-distfiles/distfiles/fcgi-2.4.1-SNAP-0910052249.tar.gz
5.2 解压
tar xzvf fcgi-2.4.1-SNAP-0910052249.tar.gz
5.3 修改一处代码
进入解压之后的目录,执行如下命令,修改一处代码:
]# find . -name fcgio.h
./include/fcgio.h
]# vim ./include/fcgio.h
在#include <iostream>下添加一个标准输出:
#include <cstdio>
5.3 配置安装三部曲
./configure
make
make install
6. 创建一个fcgi demo:
(py27tf) [root@singler cgi_demo]# vim test.c
代码如下:
#include <stdio.h>
#include <stdlib.h>
#include <fcgi_stdio.h>
int main() {
int count = 0;
while(FCGI_Accept() >= 0) {
printf("Content-type: text/html\r\n"
"\r\n"
""
"Hello Badou EveryBody!!!"
"Request number %d running on host %s "
"Process ID: %d\n ", ++count, getenv("SERVER_NAME"), getpid());
}
return 0;
}
7. 编译我们的代码:
gcc -g -o test test.c -lfcgi
8. 可能提示找不到lib库,修改ld.so.conf文件
(py27tf) [root@singler cgi_demo]# vim /etc/ld.so.conf
1 include ld.so.conf.d/*.conf
2 /usr/local/lib
执行命令,重新加载配置: ldconfig
9. 执行二进制可执行文件,测试:
./test
10. 测试成功后,启动spawn cgi进行代理:
/usr/local/nginx/sbin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /root/thrift_test/cgi_demo/test
参数说明:
-f 启动应用文件的存放路径
-p 启动一个应用程序(常驻进程),对外访问的端口
11. 检查端口是否正常
netstat -antup |grep 8088
12. 配置ngx的反向代理:
配置nginx反向代理,使用户请求通过nginx反向代理打到通过cgi暴露的server服务上.
配置反向代理,主要依赖nginx.conf文件:
43 location / {
44 root html;
45 index index.html index.htm;
46 }
47
48 location ~ /recsys$ {
49 fastcgi_pass 127.0.0.1:8088;
50 include fastcgi_params;
51 }
13. 启动nginx
./nginx/sbin/nginx
14. 测试:(ip换成本地自己的ip)
http://192.168.87.100/recsys
以上test代码,还可以进一步升级,以完成只读型的demo:
也即是由于无法接收和解析参数,扩展性不强,需要代码加固和升级:
比如能够解析如下方式的请求:
http://192.168.87.100/recsys?itemid=111&userid=012&action=click&ip=10.11.11.10
2.6 [任务6]:Thrift RPC和Spawn cgi进行联合试验,完成日志服务器
1. c++的client代码:
cat client.cpp
#include "RecSys.h"
#include <iostream>
#include <string>
#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>
#include <fcgi_stdio.h>
#include <fcgiapp.h>
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace std;
using std::string;
using boost::shared_ptr;
inline void send_response(
FCGX_Request& request, const std::string& resp_str) {
FCGX_FPrintF(request.out, "Content-type: text/html;charset=utf-8\r\n\r\n");
FCGX_FPrintF(request.out, "%s", resp_str.c_str());
FCGX_Finish_r(&request);
}
int main(int argc, char **argv) {
// step 1. init fcgi
FCGX_Init();
FCGX_Request request;
FCGX_InitRequest(&request, 0, 0);
// step 2. connect server rpc
boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
transport->open();
RecSysClient client(protocol);
while(FCGX_Accept_r(&request) >= 0) {
// http page -> client
std::string send_data = FCGX_GetParam("QUERY_STRING", request.envp);
string receive_data;
// client -> server
// server -> client
client.rec_data(receive_data, send_data);
cout << "receive http params: " << send_data << std::endl;
cout << "receive server data: " << receive_data << endl;
// client -> http page
send_response(request, receive_data);
}
transport->close();
return 0;
}
注意:
这里要注意头文件引入的路径,不要搞错了,我就在这里栽了一个小跟头.
编译命令:
g++ -g -Wall -I./ -I/usr/local/include RecSys.cpp client.cpp -L/usr/local/lib/*.so -lthrift -lfcgi -o client
注意编译命令中,我将头文件的引入路径设置为:
/usr/local/include
同样我对client.cpp的头文件引入路径,也做了修改,否则会找不到的.
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TBufferTransports.h>
#include <thrift/protocol/TBinaryProtocol.h>
#include <fcgi_stdio.h>
#include <fcgiapp.h>
当你不确定头文件存放路径时,可通过执行如下命令,进行全局搜索:
find / -name 库文件名
实例: find / -name fcgiapp.h
说明:
/ 表示全盘查找
-name 名称
2. 汇总编译命令
Makefile修改为:
(py27tf) [root@singler gen-cpp]# cat Makefile
G++ = g++
CFLAGS = -g -Wall
INCLUDES = -I./ -I/usr/local/include/thrift
LIBS = -L/usr/local/lib/*.so -lthrift -lfcgi
OBJECTS = RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp
CLI_OBJECTS = RecSys.cpp client.cpp
server: $(OBJECTS)
$(G++) $(CFLAGS) $(INCLUDES) $(OBJECTS) $(LIBS) -o server
client: $(CLI_OBJECTS)
$(G++) $(CFLAGS) $(INCLUDES) $(CLI_OBJECTS) $(LIBS) -o client
.PHONY: clean
clean:
rm -rf server client
3. 运行
运行server
./server
观察响应服务是否启动:
nginx:netstat -antup | grep nginx
cgi代理:netstat -antup | grep 8088
4. 启动cgi服务
/usr/local/nginx/sbin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /root/thrift_test/thrift_demo/gen-cpp/client
5.启动nginx
设置反向代理,这里就不赘述了.
usr/local/nginx/sbin/nginx
观察一下nginx是否启动成功:
nginx:netstat -antup | grep nginx
6.浏览器访问测试
http://192.168.175.20/recsys?item=111&userid=123
如果能正常返回信息,同时server端正常输出日志,说明没有问题了.
2.7 [任务7]:ab压测,模拟log日志
ab压测,是一个非常常用的工具
1. yum安装ab压测工具
在任意目录下,执行均可:
yum -y install httpd-tools
2. 测试是否安装成功
ab -V
3.执行ab命令,压力测试任务6的服务
ab
-c 20
-n 5000
'http://192.168.87.100/recsys?itemids=111,222,333,444&userid=012&action=click&ip=10.11.11.10'
参数说明:
-c 一次产生的请求个数。默认是一次一个。
-n 是所有的请求个数
请求地址要带上引号,否则命令中,遇到&号后边的内容会被处理成一个后台任务.
测试结果如下图所示:
2.8 [任务8]:写入log(glog - google 日志模块)
glogs功能类似java中常用的log4j功能.
1. 下载glog:
git clone https://github.com/google/glog.git
2. 编译glog:
./autogen.sh && ./configure && make && make install
完成后,会在/usr/local/lib路径下看到libglog*一系列库
3. 完善server代码:
3.1 首先引入头文件:
#include <glog/logging.h>
3.2 在主流程起始初始化glog
#定义log产生的位置
FLAGS_log_dir = "/root/thrift_test/thrift_demo/gen-cpp/logs";
google::InitGoogleLogging(argv[0]);
3.3 代码中log输出命令:
LOG(INFO) << data;
LOG(ERROR) << data;
LOG(WARNING) << data;
LOG(FATAL) << data;
说明一下,在程序中输出FATAL级别的日志,会导致程序运行结束.若想观察日志连续输出,请使用INFO,或WARNING级别.
3.4 运行
运行server
g++ -g -Wall -I./ -I/usr/local/include RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp -L/usr/local/lib/*.so -lthrift -lglog -o server
通过spawn cgi启动client
/usr/local/bin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /usr/local/src/thrift_test/gen-cpp/client
启动nginx
/usr/local/nginx/sbin/nginx
注意:
server -> client -> nginx 要依次起来,并进行检查确认.
4. 执行压测,配置日志测试
ab -c 2 -n 50 'http://192.168.175.20/recsys?userid=110&itemid=333&type=show&ip=10.11.1.2'
2.9 [任务9]:对接flume进行实时流打通
两个flume组合使用,一个寄生在log server服务器上,一个对接hbase.
log server + flume -- > flume + hbase
启动位于log server上的flume:
./bin/flume-ng agent
-c conf
-f conf/flume-client.properties
-n a1
-Dflume.root.logger=INFO,console
启动位于hbase机器上的flume:
./bin/flume-ng agent
-c conf
-f conf/flume-server.properties
-n a1
-Dflume.root.logger=INFO,console
日志格式log data如下:
I0513 14:49:56.568998 33273 RecSys_server.skeleton.cpp:32] userid=110&itemid=333&type=show&ip=192.11.1.200
在实践中,做一下简化,使用一个Flume对接INFO级别的日志测试一下:
flume配置文件:
这里要注意指向正确的日志文件路径:
[root@master conf]# vi log_exec_console.conf
mple.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /usr/local/tmp/logs/gen-cpp/server.INFO
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
通过如下脚本启动flume-ng:
flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/log_exec_console.conf --name a1 -Dflume.root.logger=INFO,console
输出日志如下所示:
2018-05-16 23:12:12,102 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,103 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,106 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,108 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,110 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,111 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,114 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,116 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,120 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,122 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,125 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,127 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,129 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,130 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,130 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,131 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,131 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,132 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2.10 [任务10]:连接hbase完成日志入库
这里模拟logserver和hbase不在同一台机器上,由flume client采集log日志,传输到位于hbase服务器上的flume server接收.
hbase版本说明:
1. hadoop : hadoop-1.2.1
2. java : jdk1.7.0_80
3. hbase : hbase-0.98.24-hadoop1
flume版本:
4. flume : apache-flume-1.6.0-bin
注意: 之所以在这里特别强调版本,是因为我在这掉进坑里了,在里面折腾了好几天.
1. 在hbase中创建表
执行如下hbase命令:
create 'user_action_table','action_log'
put 'user_action_table', '111', 'action_log:userid', '2002xue'
put 'user_action_table', '111', 'action_log:itemid', '12345'
put 'user_action_table', '111', 'action_log:type', 'click'
put 'user_action_table', '111', 'action_log:ip', '11.10.5.27'
scan 'user_action_table'
truncate 'user_action_table'
describe 'user_action_table'
2. 位于hbase服务器上的flume-server
配置文件内容:
[root@master conf]# cat flume-server.properties
#agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.175.10
a1.sources.r1.port = 52020
# set sink to hdfs
#a1.sinks.k1.type = logger
a1.sinks.k1.type = hbase
a1.sinks.k1.table = user_action_table
a1.sinks.k1.columnFamily = action_log
#a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.serializer = com.badou.hbase.FlumeHbaseEventSerializer
a1.sinks.k1.serializer.columns = userid,itemid,type,ip
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
自己实现的RegexHbaseEventSerializer代码如下:
package com.badou.hbase;
import java.nio.charset.Charset;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
import java.util.Locale;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Pattern;
import org.apache.commons.lang.RandomStringUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.HbaseEventSerializer;
import org.apache.hadoop.hbase.client.Increment;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Row;
import com.google.common.base.Charsets;
import com.google.common.collect.Lists;
public class FlumeHbaseEventSerializer implements HbaseEventSerializer {
// Config vars
/** Regular expression used to parse groups from event data. */
public static final String REGEX_CONFIG = "regex";
public static final String REGEX_DEFAULT = " ";
/** Whether to ignore case when performing regex matches. */
public static final String IGNORE_CASE_CONFIG = "regexIgnoreCase";
public static final boolean INGORE_CASE_DEFAULT = false;
/** Comma separated list of column names to place matching groups in. */
public static final String COL_NAME_CONFIG = "colNames";
public static final String COLUMN_NAME_DEFAULT = "ip";
/** Index of the row key in matched regex groups */
public static final String ROW_KEY_INDEX_CONFIG = "rowKeyIndex";
/** Placeholder in colNames for row key */
public static final String ROW_KEY_NAME = "ROW_KEY";
/** Whether to deposit event headers into corresponding column qualifiers */
public static final String DEPOSIT_HEADERS_CONFIG = "depositHeaders";
public static final boolean DEPOSIT_HEADERS_DEFAULT = false;
/** What charset to use when serializing into HBase's byte arrays */
public static final String CHARSET_CONFIG = "charset";
public static final String CHARSET_DEFAULT = "UTF-8";
/*
* This is a nonce used in HBase row-keys, such that the same row-key never
* gets written more than once from within this JVM.
*/
protected static final AtomicInteger nonce = new AtomicInteger(0);
protected static String randomKey = RandomStringUtils.randomAlphanumeric(10);
protected byte[] cf;
private byte[] payload;
private List<byte[]> colNames = Lists.newArrayList();
private boolean regexIgnoreCase;
private Charset charset;
@Override
public void configure(Context context) {
String regex = context.getString(REGEX_CONFIG, REGEX_DEFAULT);
regexIgnoreCase = context.getBoolean(IGNORE_CASE_CONFIG, INGORE_CASE_DEFAULT);
context.getBoolean(DEPOSIT_HEADERS_CONFIG, DEPOSIT_HEADERS_DEFAULT);
Pattern.compile(regex, Pattern.DOTALL + (regexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
charset = Charset.forName(context.getString(CHARSET_CONFIG, CHARSET_DEFAULT));
String cols = new String(context.getString("columns"));
String colNameStr;
if (cols != null && !"".equals(cols)) {
colNameStr = cols;
} else {
colNameStr = context.getString(COL_NAME_CONFIG, COLUMN_NAME_DEFAULT);
}
String[] columnNames = colNameStr.split(",");
for (String s : columnNames) {
colNames.add(s.getBytes(charset));
}
}
@Override
public void configure(ComponentConfiguration conf) {}
@Override
public void initialize(Event event, byte[] columnFamily) {
event.getHeaders();
this.payload = event.getBody();
this.cf = columnFamily;
}
protected byte[] getRowKey(Calendar cal) {
String str = new String(payload, charset);
String tmp = str.replace("\"", "");
String[] arr = tmp.split(" ");
String log_data = arr[5];
String[] param_arr = log_data.split("&");
String userid = param_arr[0];
String itemid = param_arr[1];
String type = param_arr[2];
String ip_str = param_arr[3];
// String dataStr = arr[3].replace("[", "");
// String rowKey = getDate2Str(dataStr) + "-" + clientIp + "-" + nonce.getAndIncrement();
String rowKey = ip_str + "-" + nonce.getAndIncrement();
return rowKey.getBytes(charset);
}
protected byte[] getRowKey() {
return getRowKey(Calendar.getInstance());
}
@Override
public List<Row> getActions() throws FlumeException {
List<Row> actions = Lists.newArrayList();
byte[] rowKey;
String body = new String(payload, charset);
System.out.println("body===>"+body);
String tmp = body.replace("\"", "");
System.out.println("tmp===>"+tmp);
// String[] arr = tmp.split(REGEX_DEFAULT);
String[] arr = tmp.split(" ");
System.out.println("arr[1]===>"+arr[1].toString());
System.out.println("arr[2]===>"+arr[2].toString());
System.out.println("arr[3]===>"+arr[3].toString());
System.out.println("arr[4]===>"+arr[4].toString());
System.out.println("arr[5]===>"+arr[5].toString());
String log_data = arr[5];
String[] param_arr = log_data.split("&");
String userid = param_arr[0].split("=")[1];
String itemid = param_arr[1].split("=")[1];
String type = param_arr[2].split("=")[1];
String ip_str = param_arr[3].split("=")[1];
System.out.println("===========");
System.out.println("===========");
System.out.println("===========");
System.out.println("===========");
System.out.println(userid);
System.out.println(itemid);
System.out.println(type);
System.out.println(ip_str);
System.out.println("===========");
System.out.println("===========");
System.out.println("===========");
System.out.println("===========");
try {
rowKey = getRowKey();
Put put = new Put(rowKey);
put.add(cf, colNames.get(0), userid.getBytes(Charsets.UTF_8));
put.add(cf, colNames.get(1), itemid.getBytes(Charsets.UTF_8));
put.add(cf, colNames.get(2), type.getBytes(Charsets.UTF_8));
put.add(cf, colNames.get(3), ip_str.getBytes(Charsets.UTF_8));
actions.add(put);
} catch (Exception e) {
throw new FlumeException("Could not get row key!", e);
}
return actions;
}
@Override
public List<Increment> getIncrements() {
return Lists.newArrayList();
}
@Override
public void close() {}
public static String getDate2Str(String dataStr) {
SimpleDateFormat formatter = null;
SimpleDateFormat format = null;
Date date = null;
try {
formatter = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss", Locale.ENGLISH);
date = formatter.parse(dataStr);
format = new SimpleDateFormat("yyyy-MM-dd-HH:mm:ss");
} catch (Exception e) {
e.printStackTrace();
}
return format.format(date);
}
}
将编写的RegexHbaseEventSerializer类打好jar放到如下flume的lib目录下:
/usr/local/src/apache-flume-1.6.0-bin/lib
在这一步,我也报了异常(主要是数组脚本越界),最后通过详细输出日志的方式,排查出了原因.
另外,说明一点,这里在编写java代码时,引入的flume的jar包以及hbase的jar都要和服务器上的保持一致,同时编译的java环境也要和服务器上的保持一致.
当flume和hbase版本不匹配时,可能会出现如下异常信息:
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)V
具体可以参考文章: https://cloud.tencent.com/developer/article/1025430
启动flume-sever的命令:
flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume-server.properties --name a1 -Dflume.root.logger=INFO,console
3. flume-client位于不同的虚拟机上:
其中从conf配置文件内容为:
[root@master conf]# cat flume_client.conf
#a1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
#a1.sources.r1.command = tail -F /tmp/1.log
a1.sources.r1.command = tail -F /usr/local/tmp/logs/gen-cpp/server.INFO
# set sink1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.175.10
a1.sinks.k1.port = 52020
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
启动flume_client:
flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_client.conf --name a1 -Dflume.root.logger=INFO,console
4. 启动server
[root@master gen-cpp]# pwd
/usr/local/src/thrift_test/gen-cpp
[root@master gen-cpp]# ./server
5. 使用cgi启动client
/usr/local/bin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /usr/local/src/thrift_test/gen-cpp/client
6.启动nginx
/usr/local/nginx/sbin/nginx
7.浏览器访问测试
http://192.168.175.20/recsys?userid=111&itemid=123&type=click&ip=10.1.8.27
8. 查询hbase
scan 'user_action_table'