Flume（flume自带拦截器、自定义拦截器）

最新推荐文章于 2021-08-05 14:41:16 发布

fishbaby-

最新推荐文章于 2021-08-05 14:41:16 发布

阅读量1.3k

点赞数

分类专栏： Flume 文章标签： flume hadoop 正则表达式大数据

本文链接：https://blog.csdn.net/qq_42005540/article/details/117265938

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

vi@TOC

一、Flume自带的拦截器

示例1：
具体实现：
通过正则表达式，过滤掉匹配上的消息，这里是以user开头的消息
实现一个source传向2个通道，两个sink将两个通道的数据分别传入Kafka和hdfs

配置文件：

定义三个组件：

//agent组件的三部分的初始化定义
train.sources=trainSource
train.channels=kafkaChannel hdfsChannel
train.sinks=kafkaSink hdfsSink

source：

//sousce被定义为监听某个文件路径下的新文件
train.sources.trainSource.type=spooldir
//flume监听的文件目录，该目录下的文件内容会被flune收集
train.sources.trainSource.spoolDir=/root/software/flumlogfile/train
//指定一个把文件中的一行数据解析成Event的解析器，所有解析器必须实现EventDeserializer.Builder接口  LINE这个反序列化器会把文本数据的每行解析成一个Event
train.sources.trainSource.deserializer=LINE
//指定一个Event数据所包含的最大字符数，一行文本超过配置的这个长度就会被截断
train.sources.trainSource.deserializer.maxLineLength=32000
//指定要被读取的文件的正则表达式
train.sources.trainSource.includePattern=train_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
//指定拦截器是head_filter类型
train.sources.trainSource.interceptors=head_filter
//定义组件，regex_filter 正则过滤拦截器
train.sources.trainSource.interceptors.head_filter.type=regex_filter
//设置Event内容的正则表达式
train.sources.trainSource.interceptors.head_filter.regex=^user*
//true表示丢弃掉匹配上正则表达式的内容，flase表示保留匹配上正则表达式的内容
train.sources.trainSource.interceptors.head_filter.excludeEvents=true

channel：

train.channels.kafkaChannel.type=file
train.channels.kafkaChannel.checkpoinDir=/root/software/flumlogfile/checkpoint/train
train.channels.kafkaChannel.dataDirs=/root/software/flumlogfile/data/train

train.channels.hdfsChannel.type=memory
train.channels.hdfsChannel.capacity=64000
train.channels.hdfsChannel.transactionCapacity=16000

sink：

train.sinks.kafkaSink.type=org.apache.flume.sink.kafka.KafkaSink
train.sinks.kafkaSink.batchSize=640
train.sinks.kafkaSink.brokerList=192.168.21.2:9092
train.sinks.kafkaSink.topic=train

train.sinks.hdfsSink.type=hdfs
train.sinks.hdfsSink.hdfs.fileType=DataStream
// Flume在HDFS文件夹下创建新文件的固定前缀
train.sinks.hdfsSink.hdfs.filePrefix=train
// Flume在HDFS文件夹下创建新文件的后缀（比如：.avro，注意这个“.”不会自动添加，需要显式配置）
train.sinks.hdfsSink.hdfs.fileSuffix=.csv
//hdfs目录路径
train.sinks.hdfsSink.hdfs.path=hdfs://192.168.21.2:9000/kb11file/train/%Y-%m-%d
train.sinks.hdfsSink.hdfs.useLocalTimeStamp=true
// 向 HDFS 写入内容时每次批量操作的 Event 数量 默认值是100
train.sinks.hdfsSink.hdfs.batchSize=640
// 当前文件写入Event达到该数量后触发滚动创建新文件（0表示不根据 Event 数量来分割文件） 默认值是10
train.sinks.hdfsSink.hdfs.rollCount=0
// 当前文件写入达到该大小后触发滚动创建新文件（0表示不根据文件大小来分割文件），单位：字节 默认值1024字节=1k
train.sinks.hdfsSink.hdfs.rollSize=6400000
// 表示每隔30秒创建一个新文件进行存储。如果设置为0，表示所有 Event 都会写到一个文件中。默认值30
train.sinks.hdfsSink.hdfs.rollInterval=30

source、sink与channel进行绑定

train.sources.trainSource.channels=kafkaChannel hdfsChannel
train.sinks.hdfsSink.channel=hdfsChannel
train.sinks.kafkaSink.channel=kafkaChannel

执行flume：

[root@hadoop2 flume]# ./bin/flume-ng agent --name train --conf ./conf/ --conf-file ./conf/kb11Job/train-flume-kafka-hdfs.conf -Dflume.root.logger=INFO,console

开启kafka消费者：

[root@hadoop2 ~]# kafka-console-consumer.sh --topic train --bootstrap-server 192.168.21.2:9092 --from-beginning

二、自定义拦截器

一个概念：扇出流
已知，flume支持流的扇形形式配置，一个source可以连接多个channel，source有两种输出模式：复制模式和多路复用

1）复制模式 : source中的Event会被发送到与source连接的所有channel上，如果没有指定，默认就是复制选择器

2）多路复用模式 : Event仅被发送到部分channel上,需要指定Event的分发规则
多路复用选择器具有另外一组属性可以配置来分发数据流。这需要指定Event属性到channel的映射，选择器检查Event header中每一个配置中指定的属性值，如果与配置的规则相匹配，则该Event将被发送到规则设定的channel上。如果没有匹配的规则，则Event 会被发送到默认的channel上，具体看下面配置：

# 列出这个Agent的source、sink和channel，注意这里有1个source、2个channel和2个sink
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# 指定与source1连接的channel，这里配置了两个channel
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# 将两个sink分别与两个channel相连接
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

# 指定source1的channel选择器类型是复制选择器（按照上段介绍，不显示配置这个选择器的话，默认也是复制）
<Agent>.sources.<Source1>.selector.type = replicating

# 多路复用选择器的完整配置如下
<Agent>.sources.<Source1>.selector.type = multiplexing                                 # 选择器类型是多路复用
<Agent>.sources.<Source1>.selector.header = <someHeader>                               # 假如这个<someHeader>值是abc，则选择器会读取Event header中的abc属性来作为分发的依据
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>                       # 加入这里Value1配置的是3，则Event header中abc属性的值等于3的Event会被发送到channel1上
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>            # 同上，Event header中abc属性等于Value2的Event会被发送到channel1和channel2上
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>                       # 同上规则，Event header中abc属性等于Value3的Event会被发送到channel2上
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>                                # Event header读取到的abc属性值不属于上面配置的任何一个的话，默认就会发送到这个channel2上

1、新建一个Maven工程
添加依赖：

<!-- https://mvnrepository.com/artifact/org.apache.flume/flume-ng-core -->
<dependency>
  <groupId>org.apache.flume</groupId>
  <artifactId>flume-ng-core</artifactId>
  <version>1.6.0</version>
</dependency>

2、编写Java类：
描述：

如果 event：header，boby * boby内容hello开头，则给当前的event head打入hello标签 *
boby内容hi开头，则给当前的event head打入hi标签

public class InterceptorDemo implements Interceptor {
    ArrayList addHeaderEvents;
    public void initialize() {
        addHeaderEvents = new ArrayList<Event>();
    }

    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        byte[] body = event.getBody();
        String bodyStr = new String(body);
        if (bodyStr.startsWith("hello")){
            headers.put("type","hello");
        }else if (bodyStr.startsWith("hi")){
            headers.put("type","hi");
        }else {
            headers.put("type","other");
        }
        return event;
    }

    public List<Event> intercept(List<Event> list) {
        addHeaderEvents.clear();
        for (Event event:list){
            Event opEvent = intercept(event);
            addHeaderEvents.add(opEvent);
        }
        return addHeaderEvents;
    }

    public void close() {
        addHeaderEvents = null;
        System.gc();
    }

    public static class Builder implements Interceptor.Builder {
        public Interceptor build() {
            return new InterceptorDemo();
        }

        public void configure(Context context) {

        }
    }
}

3、写完之后打成jar包
在这里插入图片描述
完成后，将该jar包放入linux系统的flume/bin目录下

4、编写agent组件

interceptor.sources=interceptorSource
interceptor.channels=hellochannel hichannel otherchannel
interceptor.sinks=hellosink hisink othersink

//从文件夹读取
interceptor.sources.interceptorSource.type=spooldir
interceptor.sources.interceptorSource.spoolDir=/root/software/flumlogfile/interceptor
interceptor.sources.interceptorSource.deserializer=LINE
interceptor.sources.interceptorSource.deserializer.maxLineLength=32000
interceptor.sources.interceptorSource.includePattern=interceptor_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
train.sources.trainSource.interceptors=head_filter
train.sources.trainSource.interceptors.head_filter.type=regex_filter
train.sources.trainSource.interceptors.head_filter.regex=^user*
train.sources.trainSource.interceptors.head_filter.excludeEvents=true
interceptor.sources.interceptorSource.interceptors=interceptor1
interceptor.sources.interceptorSource.interceptors.interceptor1.type=Demo052502.InterceptorDemo$Builder
  //选择器检查每个Event中 名为type的Event header的属性值
  如果该值等于hello 则发送到hellochannel上
  如果该值等于hi 则发送到hichannel上
  如果该值不等于hello和hi 则发送到otherchannel上
interceptor.sources.interceptorSource.selector.type=multiplexing  #多路复用
interceptor.sources.interceptorSource.selector.mapping.hello=hellochannel
interceptor.sources.interceptorSource.selector.mapping.hi=hichannel
interceptor.sources.interceptorSource.selector.mapping.other=otherchannel
interceptor.sources.interceptorSource.selector.header=type

//定义三个channel
interceptor.channels.hellochannel.type=memory
interceptor.channels.hellochannel.capacity=1000
interceptor.channels.hellochannel.transactionCapacity=100

interceptor.channels.hichannel.type=memory
interceptor.channels.hichannel.capacity=1000
interceptor.channels.hichannel.transactionCapacity=100

interceptor.channels.otherchannel.type=memory
interceptor.channels.otherchannel.capacity=1000
interceptor.channels.otherchannel.transactionCapacity=100

//定义三个sinks
interceptor.sinks.hellosink.type=hdfs
interceptor.sinks.hellosink.hdfs.fileType=DataStream
interceptor.sinks.hellosink.hdfs.filePrefix=hello
interceptor.sinks.hellosink.hdfs.fileSuffix=.csv
interceptor.sinks.hellosink.hdfs.path=hdfs://192.168.21.2:9000/kb11file/hello2/%Y-%m-%d
interceptor.sinks.hellosink.hdfs.useLocalTimeStamp=true
interceptor.sinks.hellosink.hdfs.batchSize=640
interceptor.sinks.hellosink.hdfs.rollCount=0
interceptor.sinks.hellosink.hdfs.rollSize=6400000
interceptor.sinks.hellosink.hdfs.rollInterval=3

interceptor.sinks.hisink.type=org.apache.flume.sink.kafka.KafkaSink
interceptor.sinks.hisink.batchSize=640
interceptor.sinks.hisink.brokerList=192.168.21.2:9092
interceptor.sinks.hisink.topic=hi2

interceptor.sinks.othersink.type=logger

//source、sink和channel的连接
interceptor.sources.interceptorSource.channels=hellochannel hichannel otherchannel
interceptor.sinks.hisink.channel=hichannel
interceptor.sinks.hellosink.channel=hellochannel
interceptor.sinks.othersink.channel=otherchannel

5、开客户端监听

[root@hadoop2 ~]# kafka-console-consumer.sh --topic hi2 --bootstrap-server 192.168.21.2:9092 --from-beginning

fishbaby-

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Flume（flume自带拦截器、自定义拦截器）

Flume自带拦截器、自定义拦截器一、Flume自带的拦截器二、自定义拦截器一、Flume自带的拦截器示例1：具体实现：通过正则表达式，过滤掉匹配上的消息，这里是以user开头的消息实现一个source传向2个通道，两个sink将两个通道的数据分别传入Kafka和hdfs配置文件：定义三个组件：//agent组件的三部分的初始化定义train.sources=trainSourcetrain.channels=kafkaChannel hdfsChanneltrain.sinks=ka
复制链接

扫一扫