Flume日志收集
Apache Flume简介
- Flume用于将多种来源的日志以流的方式传输至Hadoop或者其它目的地
- 一种可靠、可用的高效分布式数据收集服务
- Flume拥有基于数据流上的简单灵活架构,支持容错、故障转移与恢复
- 由Cloudera 2009年捐赠给Apache,现为Apache顶级项目
Flume架构
-
Client:客户端,数据产生的地方,如Web服务器
-
Event:事件,指通过Agent传输的单个数据包,如日志数据通常对应一行数据
-
Agent:代理,一个独立的JVM进程
- Flume以一个或多个Agent部署运行
- Agent包含三个组件
- Source
- Channel
- Sink
-
运行机制
Flume的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(Sink)后,flume删除自己缓存的数据。
Flume 安装部署
- 安装步骤
# 解压
[root@jzy1 opt]# tar -zxf flume-ng-1.6.0-cdh5.14.2.tar.gz
# 移动并重命名
[root@jzy1 opt]# mv apache-flume-1.6.0-cdh5.14.2-bin/ soft/flume160
# 拷贝配置目录 并修改相关配置
[root@jzy1 conf]# cp flume-env.sh.template flume-env.sh
[root@jzy1 conf]# vi flume-env.sh
# 添加jdk路径
export JAVA_HOME=/opt/soft/jdk180
# 配置环境变量
# vi /etc/profile 末尾添加
export FLUME_HOME=/opt/soft/flume160
export PATH=$PATH:$FLUME_HOME/bin
# 验证是否安装成功
[root@jzy1 ~]# flume-ng version
Flume Agent基本配置
agent.sources = s1
agent.channels = c1
agent.sinks = sk1
#设置Source为netcat 端口为5678,使用的channel为c1
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 5678
agent.sources.s1.channels = c1
#设置Sink为logger模式,使用的channel为c1
agent.sinks.sk1.type = logger
agent.sinks.sk1.channel = c1
#设置channel为capacity
agent.channels.c1.type = memory
#执行启动命令 在控制台输出
flume-ng agent --name agent -f h0.conf -Dflume.root.logger=INFO,console
案例实现
netcat源
[root@jzy1 flumeconf]# vi conf_0804.properties
- 新建一个配置文件,配置文件内容如下
# 组件别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.56.21
a1.sources.r1.port = 6666
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory # 设置类型为内存传输
a1.channels.c1.capacity = 1000 # 设置该管道通道中最大可以存储的event数量
a1.channels.c1.transactionCapacity = 100 # 每次最大可以从source中拿到或者送到sink中的event数量
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 下载插件 nc
yum install nmap-ncat.x86_64 -y
- 运行该文件,在控制台输出
flume-ng agent -n a1 -c conf -f /opt/flumeconf/conf_0804.properties -Dflume.root.logger=INFO,console
- 启动nc
c 192.168.56.21 6666
- 数据被完整显示,如下图
利用Spooling Directory源监控目录操作示例
- 示例:将存放存在datas中的cusotmer.csv文件读取以logger形式输出
a2.channels=c2
a2.sources=s2
a2.sinks=k2
a2.sources.s2.type=spooldir
a2.sources.s2.spoolDir=/opt/datas
a2.channels.c2.type=memory
a2.channels.c2.capacity=10000
a2.channels.c2.transactionCapacity=1000
a2.sinks.k2.type=logger
a2.sinks.k2.channel=c2
a2.sources.s2.channels=c2
# 运行
flume-ng agent -n a2 -c conf -f /opt/flumeconf/conf_0805_readfile.properties -Dflume.root.logger=INFO,console
- 读取后的文件后缀名会改变,将文件格式相同的文件拖入datas中,永久自动监测,events.csv拖入后文件名也改变,如果想要拖入文件不被读取,停止运行并把该文件后缀名删除即可
exec源
这里不再详细演示,配置文件如下
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置Source为exec
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/datas/exectest.txt
#source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
#指定Sink
a1.sinks.sk1.type = logger
#sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
http源
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
a1.sources.s1.type = http
a1.sources.s1.port = 5140
#source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
#指定Sink
a1.sinks.sk1.type = logger
#sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
实现:
curl -XPOST localhost:5140 -d '[{"headers":{"h1":"v1","h2":"v2"},"body":"hello flume"}]'
taildir源
解析的文件会指定输出到某文件中
配置文件如下
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1 f2
# 配置filegroups的f1
a1.sources.s1.filegroups.f1 = /opt/datas/tail_1/example.log
a1.sources.s1.filegroups.f2 = /opt/datas/tail_2/.*log.*
#指定position的位置
a1.sources.s1.positionFile = /opt/datas/tail_position/taildir_position.json
#指定headers
a1.sources.s1.headers.f1.headerKey1 = value1
a1.sources.s1.headers.f2.headerKey1 = value2
a1.sources.s1.headers.f2.headerKey2 = value3
a1.sources.s1.fileHeader = true
#source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
#指定Sink
a1.sinks.sk1.type = logger
#sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
hdfs sink 示例:检测并上传hdfs
a2.channels=c2
a2.sources=s2
a2.sinks=k2
a2.sources.s2.type=spooldir
a2.sources.s2.spoolDir=/opt/datas
a2.source.s2.fileHeader=false
a2.channels.c2.type=memory
a2.channels.c2.capacity=10000
a2.channels.c2.transactionCapacity=1000
a2.sinks.k2.type=hdfs
a2.sinks.k2.hdfs.path=hdfs://jzy1:9000/data/customers
a2.sinks.k2.hdfs.rollCount=5000
a2.sinks.k2.hdfs.rollSize=600000
a2.sinks.k2.hdfs.batchSize=500
a2.sinks.k2.channel=c2
a2.sources.s2.channels=c2
# 运行该配置文件
flume-ng agent -n a2 -c conf -f /opt/flumeconf/conf_0805_readfile.properties
//在hive中建外部表映射
hive> create external table xxx(id string,fname string,lname string,email string,gender string,address string,lan string,job string, ct string,cr string)
> row format delimited fields terminated by ','
> stored as sequencefile
> location '/data/customers'
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.064 seconds
# 查询可发现数据成功上传
hive> select * from xxx limit 3;
OK
1 Spencer Raffeorty sraffeorty0@dropbox.com Male 9274 Lyons Court China KhmerSafety Technician III jcb
2 Cherye Poynor cpoynor1@51.la Female 1377 Anzinger Avenue China Czech Research Nursinstapayment
3 Natasha Abendroth nabendroth2@scribd.com Female 2913 Evergreen Lane China Yiddish Budget/Accounting Analyst IV visa
Time taken: 0.049 seconds, Fetched: 3 row(s)
flume interceptors 拦截器匹配过滤
使用正则,过滤去除不需要的字段
配置文件如下
a3.channels=c3
a3.sources=s3
a3.sinks=k3
a3.sources.s3.type=spooldir
a3.sources.s3.spoolDir=/opt/datas
a3.sources.s3.interceptors=userid_filter
a3.sources.s3.interceptors.userid_filter.type=regex_filter
a3.sources.s3.interceptors.userid_filter.regex=userid.*
a3.sources.s3.interceptors.userid_filter.excludeEvents=true
a3.channels.c3.type=memory
a3.sinks.k3.type=logger
a3.sources.s3.channels=c3
a3.sinks.k3.channel=c3
#### 启动命令多写 会忘。。。。
flume-ng agent -n a3 -c conf -f /opt/flumeconf/conf_0805_interceptor.properties -Dflume.root.logger=INFO,console
自定义拦截器的使用
文件内容如下:
张三,男,20
李四,女,18
王五,男,28
现在需求将男和女的字段修改成数字格式,如:男->1 女->2 未知->0
- 使用自定义拦截器
public class CustomInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
byte[] body = event.getBody();
String line = new String(body);//1,张三,男,40
String[] sps = line.split(",");
switch (sps[2]){
case "男":
sps[2]="1";
break;
case "女":
sps[2]="2";
break;
default:
sps[2]="0";
}
String newStr=sps[0]+","+sps[1]+","+sps[2]+","+sps[3];
event.setBody(newStr.getBytes());
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new CustomInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
- 将写好的自定义拦截器打包上传至flume的lib目录下
- 编写配置文件
a4.channels=c4
a4.sources=s4
a4.sinks=k4
a4.sources.s4.type=spooldir
a4.sources.s4.spoolDir=/opt/datas
a4.sources.s4.interceptors=myintec
a4.sources.s3.interceptors.myintec.type=com.jstd.myinterceptors.CustomInterceptor$Builder
a4.channels.c4.type=memory
a4.sinks.k4.type=logger
a4.sources.s4.channels=c4
a4.sinks.k3.channel=c4
# 运行该配置文件
flume-ng agent -n a4 -c conf -f /opt/flumeconf/conf_0806_custconf.properties -Dflume.root.logger=INFO,console