vi@TOC
一、Flume自带的拦截器
示例1:
具体实现:
通过正则表达式,过滤掉匹配上的消息,这里是以user开头的消息
实现一个source传向2个通道,两个sink将两个通道的数据分别传入Kafka和hdfs
配置文件:
定义三个组件:
//agent组件的三部分的初始化定义
train.sources=trainSource
train.channels=kafkaChannel hdfsChannel
train.sinks=kafkaSink hdfsSink
source:
//sousce被定义为监听某个文件路径下的新文件
train.sources.trainSource.type=spooldir
//flume监听的文件目录,该目录下的文件内容会被flune收集
train.sources.trainSource.spoolDir=/root/software/flumlogfile/train
//指定一个把文件中的一行数据解析成Event的解析器,所有解析器必须实现EventDeserializer.Builder接口 LINE这个反序列化器会把文本数据的每行解析成一个Event
train.sources.trainSource.deserializer=LINE
//指定一个Event数据所包含的最大字符数,一行文本超过配置的这个长度就会被截断
train.sources.trainSource.deserializer.maxLineLength=32000
//指定要被读取的文件的正则表达式
train.sources.trainSource.includePattern=train_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
//指定拦截器是head_filter类型
train.sources.trainSource.interceptors=head_filter
//定义组件,regex_filter 正则过滤拦截器
train.sources.trainSource.interceptors.head_filter.type=regex_filter
//设置Event内容的正则表达式
train.sources.trainSource.interceptors.head_filter.regex=^user*
//true表示丢弃掉匹配上正则表达式的内容,flase表示保留匹配上正则表达式的内容
train.sources.trainSource.interceptors.head_filter.excludeEvents=true
channel:
train.channels.kafkaChannel.type=file
train.channels.kafkaChannel.checkpoinDir=/root/software/flumlogfile/checkpoint/train
train.channels.kafkaChannel.dataDirs=/root/software/flumlogfile/data/train
train.channels.hdfsChannel.type=memory
train.channels.hdfsChannel.capacity=64000
train.channels.hdfsChannel.transactionCapacity=16000
sink:
train.sinks.kafkaSink.type=org.apache.flume.sink.kafka.KafkaSink
train.sinks.kafkaSink.batchSize=640
train.sinks.kafkaSink.brokerList=192.168.21.2:9092
train.sinks.kafkaSink.topic=train
train.sinks.hdfsSink.type=hdfs
train.sinks.hdfsSink.hdfs.fileType=DataStream
// Flume在HDFS文件夹下创建新文件的固定前缀
train.sinks.hdfsSink.hdfs.filePrefix=train
// Flume在HDFS文件夹下创建新文件的后缀(比如:.avro,注意这个“.”不会自动添加,需要显式配置)
train.sinks.hdfsSink.hdfs.fileSuffix=.csv
//hdfs目录路径
train.sinks.hdfsSink.hdfs.path=hdfs://192.168.21.2:9000/kb11file/train/%Y-%m-%d
train.sinks.hdfsSink.hdfs.useLocalTimeStamp=true
// 向 HDFS 写入内容时每次批量操作的 Event 数量 默认值是100
train.sinks.hdfsSink.hdfs.batchSize=640
// 当前文件写入Event达到该数量后触发滚动创建新文件(0表示不根据 Event 数量来分割文件) 默认值是10
train.sinks.hdfsSink.hdfs.rollCount=0
// 当前文件写入达到该大小后触发滚动创建新文件(0表示不根据文件大小来分割文件),单位:字节 默认值1024字节=1k
train.sinks.hdfsSink.hdfs.rollSize=6400000
// 表示每隔30秒创建一个新文件进行存储。如果设置为0,表示所有 Event 都会写到一个文件中。默认值30
train.sinks.hdfsSink.hdfs.rollInterval=30
source、sink与channel进行绑定
train.sources.trainSource.channels=kafkaChannel hdfsChannel
train.sinks.hdfsSink.channel=hdfsChannel
train.sinks.kafkaSink.channel=kafkaChannel
执行flume:
[root@hadoop2 flume]# ./bin/flume-ng agent --name train --conf ./conf/ --conf-file ./conf/kb11Job/train-flume-kafka-hdfs.conf -Dflume.root.logger=INFO,console
开启kafka消费者:
[root@hadoop2 ~]# kafka-console-consumer.sh --topic train --bootstrap-server 192.168.21.2:9092 --from-beginning
二、自定义拦截器
一个概念:扇出流
已知,flume支持流的扇形形式配置,一个source可以连接多个channel,source有两种输出模式:复制模式和多路复用
1)复制模式 : source中的Event会被发送到与source连接的所有channel上,如果没有指定,默认就是复制选择器
2)多路复用模式 : Event仅被发送到 部分channel上,需要指定Event的分发规则
多路复用选择器具有另外一组属性可以配置来分发数据流。这需要指定Event属性到channel的映射,选择器检查Event header中每一个配置中指定的属性值,如果与配置的规则相匹配,则该Event将被发送到规则设定的channel上。如果没有匹配的规则,则Event 会被发送到默认的channel上,具体看下面配置:
# 列出这个Agent的source、sink和channel,注意这里有1个source、2个channel和2个sink
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>
# 指定与source1连接的channel,这里配置了两个channel
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>
# 将两个sink分别与两个channel相连接
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>
# 指定source1的channel选择器类型是复制选择器(按照上段介绍,不显示配置这个选择器的话,默认也是复制)
<Agent>.sources.<Source1>.selector.type = replicating
# 多路复用选择器的完整配置如下
<Agent>.sources.<Source1>.selector.type = multiplexing # 选择器类型是多路复用
<Agent>.sources.<Source1>.selector.header = <someHeader> # 假如这个<someHeader>值是abc,则选择器会读取Event header中的abc属性来作为分发的依据
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1> # 加入这里Value1配置的是3,则Event header中abc属性的值等于3的Event会被发送到channel1上
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2> # 同上,Event header中abc属性等于Value2的Event会被发送到channel1和channel2上
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2> # 同上规则,Event header中abc属性等于Value3的Event会被发送到channel2上
#...
<Agent>.sources.<Source1>.selector.default = <Channel2> # Event header读取到的abc属性值不属于上面配置的任何一个的话,默认就会发送到这个channel2上
1、新建一个Maven工程
添加依赖:
<!-- https://mvnrepository.com/artifact/org.apache.flume/flume-ng-core -->
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.6.0</version>
</dependency>
2、编写Java类:
描述:
如果 event:header,boby * boby内容hello开头,则给当前的event head打入hello标签 *
boby内容hi开头,则给当前的event head打入hi标签
public class InterceptorDemo implements Interceptor {
ArrayList addHeaderEvents;
public void initialize() {
addHeaderEvents = new ArrayList<Event>();
}
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
byte[] body = event.getBody();
String bodyStr = new String(body);
if (bodyStr.startsWith("hello")){
headers.put("type","hello");
}else if (bodyStr.startsWith("hi")){
headers.put("type","hi");
}else {
headers.put("type","other");
}
return event;
}
public List<Event> intercept(List<Event> list) {
addHeaderEvents.clear();
for (Event event:list){
Event opEvent = intercept(event);
addHeaderEvents.add(opEvent);
}
return addHeaderEvents;
}
public void close() {
addHeaderEvents = null;
System.gc();
}
public static class Builder implements Interceptor.Builder {
public Interceptor build() {
return new InterceptorDemo();
}
public void configure(Context context) {
}
}
}
3、写完之后打成jar包
完成后,将该jar包放入linux系统的flume/bin目录下
4、编写agent组件
interceptor.sources=interceptorSource
interceptor.channels=hellochannel hichannel otherchannel
interceptor.sinks=hellosink hisink othersink
//从文件夹读取
interceptor.sources.interceptorSource.type=spooldir
interceptor.sources.interceptorSource.spoolDir=/root/software/flumlogfile/interceptor
interceptor.sources.interceptorSource.deserializer=LINE
interceptor.sources.interceptorSource.deserializer.maxLineLength=32000
interceptor.sources.interceptorSource.includePattern=interceptor_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
train.sources.trainSource.interceptors=head_filter
train.sources.trainSource.interceptors.head_filter.type=regex_filter
train.sources.trainSource.interceptors.head_filter.regex=^user*
train.sources.trainSource.interceptors.head_filter.excludeEvents=true
interceptor.sources.interceptorSource.interceptors=interceptor1
interceptor.sources.interceptorSource.interceptors.interceptor1.type=Demo052502.InterceptorDemo$Builder
//选择器检查每个Event中 名为type的Event header的属性值
如果该值等于hello 则发送到hellochannel上
如果该值等于hi 则发送到hichannel上
如果该值不等于hello和hi 则发送到otherchannel上
interceptor.sources.interceptorSource.selector.type=multiplexing #多路复用
interceptor.sources.interceptorSource.selector.mapping.hello=hellochannel
interceptor.sources.interceptorSource.selector.mapping.hi=hichannel
interceptor.sources.interceptorSource.selector.mapping.other=otherchannel
interceptor.sources.interceptorSource.selector.header=type
//定义三个channel
interceptor.channels.hellochannel.type=memory
interceptor.channels.hellochannel.capacity=1000
interceptor.channels.hellochannel.transactionCapacity=100
interceptor.channels.hichannel.type=memory
interceptor.channels.hichannel.capacity=1000
interceptor.channels.hichannel.transactionCapacity=100
interceptor.channels.otherchannel.type=memory
interceptor.channels.otherchannel.capacity=1000
interceptor.channels.otherchannel.transactionCapacity=100
//定义三个sinks
interceptor.sinks.hellosink.type=hdfs
interceptor.sinks.hellosink.hdfs.fileType=DataStream
interceptor.sinks.hellosink.hdfs.filePrefix=hello
interceptor.sinks.hellosink.hdfs.fileSuffix=.csv
interceptor.sinks.hellosink.hdfs.path=hdfs://192.168.21.2:9000/kb11file/hello2/%Y-%m-%d
interceptor.sinks.hellosink.hdfs.useLocalTimeStamp=true
interceptor.sinks.hellosink.hdfs.batchSize=640
interceptor.sinks.hellosink.hdfs.rollCount=0
interceptor.sinks.hellosink.hdfs.rollSize=6400000
interceptor.sinks.hellosink.hdfs.rollInterval=3
interceptor.sinks.hisink.type=org.apache.flume.sink.kafka.KafkaSink
interceptor.sinks.hisink.batchSize=640
interceptor.sinks.hisink.brokerList=192.168.21.2:9092
interceptor.sinks.hisink.topic=hi2
interceptor.sinks.othersink.type=logger
//source、sink和channel的连接
interceptor.sources.interceptorSource.channels=hellochannel hichannel otherchannel
interceptor.sinks.hisink.channel=hichannel
interceptor.sinks.hellosink.channel=hellochannel
interceptor.sinks.othersink.channel=otherchannel
5、开客户端监听
[root@hadoop2 ~]# kafka-console-consumer.sh --topic hi2 --bootstrap-server 192.168.21.2:9092 --from-beginning