Flume–拦截器
文章目录
拦截器:拦截source数据,将flume的agent读取的数据先拦截,对数据进行添加数据头header,便于数据过滤使用
1)Timestamp Interceptor 时间戳拦截器
拦截source 对数据添加时间戳 header:{timestamp=12425367272}
配置拦截器
a1.sources.r1.interceptors = i1 拦截器别名
a1.sources.r1.interceptors.i1.type = timestamp 类型
1.输出到控制台
在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_time.conf文件中写入如下内容:
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out
# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = logger
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_time.conf --name a1 -Dflume.root.logger=INFO,console
其中的一个event的信息:
Event: { headers:{timestamp=1572778457556} body: 61 61 61 61 aaaa }
2.sink是hdfs,使用过滤器
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out
# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf --name a1 -Dflume.root.logger=INFO,console
2)主机名拦截器
拦截source,为source添加上数据源的主机名 | id
1.输出到控制台
在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf中写入一下内容
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out
# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = logger
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf --name a1 -Dflume.root.logger=INFO,console
其中一个event信息如下:
Event: { headers:{host=192.168.2.101} body: 61 61 61 61 aaaa }
可以添加上下面一句话,
a1.sources.r1.interceptors.i1.useIP = false
得到的其中一个event
Event: { headers:{host=hadoop01} body: 32 30 31 39 2D 31 31 2D 30 33 20 30 31 3A 31 39 2019-11-03 01:19 }
2.指定两个拦截器,sink是hdfs
在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/host_ts_conf中写入以下内容
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out
# 指定拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r2.interceptors.i2.type= host
a1.sources.r1.interceptors.i2.useIP = false
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%{host}/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/host_ts_conf --name a1 -Dflume.root.logger=INFO,console
得到的文件:/user/data/flume_data/hadoop01/2019-11-03/03/event.1572781570881.log
3)静态拦截器
static interceptors 自定义拦截器
header{K,V} K V可以自己定制
要指定 type:static key value
在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/statis_conf中写入一下内容:
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1 r2 r3
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out
# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = zoo
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = logger
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/statis_conf --name a1 -Dflume.root.logger=INFO,console
其中的一个event
19/11/03 04:32:29 INFO sink.LoggerSink: Event: { headers:{filename=zoo} body: 32 30 31 39 2D 31 31 2D 30 33 20 30 31 3A 31 39 2019-11-03 01:19 }
4)综合案例
题目:
有A,B,C三台机器,A,B机器部署在hadoop01节点上,C机器部署在hadoop02节点上,
access.log、 用户日志数据,A机器上
nginx.log、 nginx 服务器运行日志,A机器上
web.log 、 web工程的日志,B机器上
现在要求:把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 hadoop03 机器上然后统一收集到 hdfs中。
而且在 hdfs 中要求的目录为:
/source/logs/access/日期/**
/source/logs/nginx/日期/**
/source/logs/web/20160101/**
分析:
hadoop01中A机器:access.log,位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/access.log
hadoop01中B机器:nginx.log,位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx.log
hadoop02中C机器:web.log,位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/web.log
按照题目要求,采用的是flume的先并联再串联的部署方案:所以在每一台机器部署一个agent监控,监控A,B,C的三个agent的sink采用avro,由于题目要求在HDFS上面的文件要分类,所以这个三个agent的source需要自定义拦截器,K就同一是filename,V就是各自监控的文件名;而负责收集汇总信息的hadoop03节点上面的agent,source的type是avro,端口和ip要和A,B,C三台机器上面的agent的端口,ip保持一致;最后在hadoop03上面还要定义两个拦截器,一个是时间拦截器,一个是自定义拦截器,自定义拦截器的K是type,V就是log.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lCsiUlE0-1572791347665)(D:\CSDN图片\多agent合并串联.png)]
具体实现:
监听A机器的agent配置文件
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = access
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 采用avro协议
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 45551
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
监听B机器的agent配置文件
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r2
# channel的别名
a1.channels = c2
# sink的别名
a1.sinks = k2
# 配置source的相关信息 数据源的
a1.sources.r2.type = exec
a1.sources.r2.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx.log
a1.sources.r2.interceptors = i1
a1.sources.r2.interceptors.i1.type = static
a1.sources.r2.interceptors.i1.key=filename
a1.sources.r2.interceptors.i1.value = nginx
# 配置channel的相关信息 内存
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity =100
# 配置sink的信息 输出avro
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 45551
# 绑定source channel sink的对应关系
a1.sources.r2.channels = c2
a1.sinks.k2.channel = c2
监听C机器的agent配置文件
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/web.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = web
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 采用avro协议
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 45551
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
A,B,C进行汇总的Collector配置文件
# 给当前的agent source channel sink起别名 a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1
# 配置source的相关信息 数据源的
a1.sources.r1.type = avro
# 这里的主机 和avrosink 一致
a1.sources.r1.bind = hadoop03
a1.sources.r1.port = 45551
# 配置拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = static
a1.sources.r1.interceptors.i2.key=type
a1.sources.r1.interceptors.i2.value = log
# 配置channel的相关信息 内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100
# 配置sink的信息 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%{type}/%{filename}/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 绑定source channel sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:从后向前依次启动
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/collector-conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/access-conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx-conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/web-conf --name a1 -Dflume.root.logger=INFO,console