flume原理、flume的 source、channel  sink  配置详解


flume的配置详解  https://www.cnblogs.com/zyde/p/8946069.html

https://blog.csdn.net/qq_43371556/article/details/103100193#_netcat_284

一、source
============================================

1、avro

avro可以监听和收集指定端口的日志,监听AVRO端口来接受来自外部AVRO客户端的事件流。利用Avro Source可以实现多级流动、扇出流、扇入流等效果。另外也可以接受通过flume提供的Avro客户端发送的日志信息。使用avro的source需要说明被监听的主机ip和端口号

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = avro      # (类型为avro source)
        a1.sources.r1.bind = 0.0.0.0   #(指定监听的主机ip.本机是0.0.0.0.)
        a1.sources.r1.port = 4444      # (指定监听的端口号) 
 
        # 配置sink
        a1.sinks.k1.type = logger

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1 

    1、序列(seq)源:多用作测试
        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = seq
        # 总共发送的事件个数
        a1.sources.r1.totalEvents = 1000    

           
    2、压力(stress)源:多用作负载测试
       # 配置source
        a1.sources.r1.type = org.apache.flume.source.StressSource
        # 单个事件大小,单位:byte
        a1.sources.r1.size = 10240
        # 事件总数
        a1.sources.r1.maxTotalEvents = 1000000

 
    3、滚动目录(Spooldir)源:监听指定目录新文件产生,并将新文件数据作为event发送

是监控目录,不能为空,没有默认值。这个source不具有监控子目录的功能,也就是不能递归监控。如果需要,这需要自己去实现,http://blog.csdn.net/yangbutao/article/details/8835563 这里有递归检测的实现;

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = spooldir
        # 设置监听目录
        a1.sources.r1.spoolDir = /home/centos/spooldir

        # 通过以下配置指定消费完成后文件后缀
        a1.sources.r1.fileSuffix = .COMPLETED 

        # 配置sink
        a1.sinks.k1.type = logger

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

  4、exec源    //通过执行linux命令产生新数据、典型应用 tail -F (监听一个文件,文件增长的时候,输出追加数据)
            //不能保证数据完整性,很可能丢失数据

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = exec
        # 配置linux命令
        a1.sources.r1.command = tail -F /home/centos/readme.txt

        # 配置sink
        a1.sinks.k1.type = logger

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1


    5、Taildir源        //监控目录下文件、文件类型可通过正则指定、有容灾机制

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = Taildir
        # 设置source组 可设置多个
        a1.sources.r1.filegroups = f1
        # 设置组员的监控目录和监控文件类型,使用正则表示,只能监控文件
        a1.sources.r1.filegroups.f1 = /home/centos/taildir/.*

        # 设置定位文件的位置
        # a1.sources.r1.positionFile     ~/.flume/taildir_position.json

        # 配置sink
        a1.sinks.k1.type = logger

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1


二、channel  缓冲区

====================================


    1、memory channel   :内存

        a1.channels.c1.type = memory
        # 缓冲区中存留的最大event个数
        a1.channels.c1.capacity = 1000
        # channel从source中每个事务提取的最大event数
        # channel发送给sink每个事务发送的最大event数
        a1.channels.c1.transactionCapacity = 100

    2、file Channel:   //检查点和数据存储在默认位置时,当多个channel同时开启,会导致文件冲突,引发其他channel会崩溃

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888

        # 配置sink
        a1.sinks.k1.type = logger

        # 配置channel
        a1.channels = c1
        a1.channels.c1.type = file
        a1.channels.c1.checkpointDir = /home/centos/flume/checkpoint
        a1.channels.c1.dataDirs = /home/centos/flume/data

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

    memoryChannel:快速,但是当设备断电,数据会丢失

    FileChannel:  速度较慢,即使设备断电,数据也不会丢失

 

三、sink  
============================================

1、logger sink:将收集到的日志写到flume的log

agent1.sinks = k1
agent2.sinks.k1.type = logger

2、fileSink    //多用作数据收集 

        # 配置sink
        a1.sinks.k1.type = file_roll
        # 配置目标文件夹
        a1.sinks.k1.sink.directory = /home/centos/file
        # 设置滚动间隔,默认30s,设为0则不滚动,成为单个文件
        a1.sinks.k1.sink.rollInterval = 0
2、hdfsSink 
hdfs sink:将收集到的日志写入到新创建的文件中保存起来,存储路径为分布式的文件系统hdfs的路径,同时hdfs创建新文件的周期可以是时间,也可以是文件的大小,还可以是采集日志的条数
        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888
        
        # 配置sink
        a1.sinks.k1.type = hdfs
        # 配置目标文件夹
        a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
        # 配置文件前缀
        a1.sinks.k1.hdfs.filePrefix = events-
        # 滚动间隔,秒
        a1.sinks.k1.hdfs.rollInterval = 0
        # 触发滚动文件大小,byte
        a1.sinks.k1.hdfs.rollSize = 1024
        # 配置使用本地时间戳
        a1.sinks.k1.hdfs.useLocalTimeStamp = true
        # 配置输出文件类型,默认SequenceFile
        # DataStream文本格式,不能设置压缩编解码器
        # CompressedStream压缩文本格式,需要设置编解码器
        a1.sinks.k1.hdfs.fileType = DataStream

        a1.sinks.k1.hdfs.path= /flume/events/%y-%m-%d/%H%M/%S
        a1.sinks.k1.hdfs.round= true
        a1.sinks.k1.hdfs.roundValue= 10
        a1.sinks.k1.hdfs.roundUnit= minute
        # 当时间为2015-10-16 17:38:59时候,hdfs.path依然会被解析为:
        # /flume/events/20151016/17:30/00
        # 因为设置的是舍弃10分钟内的时间,因此,该目录每10分钟新生成一个。

         # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1
        

avro 

sink:可以将接受到的日志发送到指定端口,供级联agent的下一跳收集和接受日志,使用时需要指定目的ip和端口

agent1.sinks=k1
agent1.sinks.k2.type = avro
agent1.sinks.k2.channel = c2
agent1.sinks.k2.hostname = hadoop03 #(指定的主机名或ip)
agent1.sinks.k2.port = 16666        #(指定的端口号) 

 

3、kafka

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = jsonTopic
a1.sinks.k1.kafka.bootstrap.servers = hdp001:9092,hdp002:9092,hdp003:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.ki.kafka.producer.compression.type = snap

3、hiveSink

//hiveserver帮助:hive --service help
//1、hive --service metastore 启动hive的metastore服务,metastore地址:thrift://localhost:9083
//2、将hcatalog的依赖放在/hive/lib下,cp hive-hcatalog* /soft/hive/lib    (位置/soft/hive/hcatalog/share/hcatalog)
//3、创建hive事务表
//SET hive.support.concurrency=true;                                  
    SET hive.enforce.bucketing=true;                                    
    SET hive.exec.dynamic.partition.mode=nonstrict;                     
    SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
    SET hive.compactor.initiator.on=true;                               
    SET hive.compactor.worker.threads=1;

//create table myhive.weblogs(id int, name string, age int)
    clustered by(id) into 2 buckets                                         
    row format delimited                                                          
    fields terminated by '\t'                                                     
    stored as orc                                                                 
    tblproperties('transactional'='true');   

      

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888

        # 配置sink
        a1.sinks.k1.type = hive
        a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
        a1.sinks.k1.hive.database = myhive
        a1.sinks.k1.hive.table = weblogs
        a1.sinks.k1.useLocalTimeStamp = true
        #输入格式,DELIMITED和json
        #DELIMITED    普通文本
        #json        json文件
        a1.sinks.k1.serializer = DELIMITED
        #输入字段分隔符,双引号
        a1.sinks.k1.serializer.delimiter = ","
        #输出字段分隔符,单引号
        a1.sinks.k1.serializer.serdeSeparator = '\t'
        #字段名称,","分隔,不能有空格
        a1.sinks.k1.serializer.fieldnames =id,name,age

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

4、hbaseSink
hbase是一种数据库,可以储存日志,
使用时需要指定存储日志的表名和列族名,然后agent就可以将收集到的日志逐条插入到数据库中。

        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1

        # 配置source
        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888
        
        # 配置sink
        a1.sinks.k1.type = hbase
        a1.sinks.k1.table = flume_hbase  #tableName:要写入的HBase数据表名,不能为空;        
        a1.sinks.k1.columnFamily = f1  #数据表对应的列簇名,这个sink目前只支持一个列簇,不能为空;
        a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer 
     //RegexHbaseEventSerializer可以手动指定rowKey和col字段名称
        # 配置col正则手动指定
        # rowKeyIndex手动指定rowKey,索引以0开头
        a1.sinks.k1.serializer.colNames = ROW_KEY,name,age
        a1.sinks.k1.serializer.regex = (.*),(.*),(.*)
        a1.sinks.k1.serializer.rowKeyIndex=0

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

四、interceptor:拦截器 

====================================
   

Flume中的拦截器(interceptor),当Source读取events发送到Sink的时候,在events header中加入一些有用的信息,或者对events的内容进行过滤,完成初步的数据清洗。这在实际业务场景中非常有用,Flume-ng 1.7中目前提供了以下拦截器:

  • Timestamp Interceptor;
  • Host Interceptor;
  • Static Interceptor;
  • UUID Interceptor;
  • Morphline Interceptor;
  • Search and Replace Interceptor;
  • Regex Filtering Interceptor;
  • Regex Extractor Interceptor;

可以对一个source指定多个拦截器,按先后顺序依次处理。如:

a1.sources.r1.interceptors=i1 i2  
a1.sources.r1.interceptors.i1.type=regex_filter  
a1.sources.r1.interceptors.i1.regex=\\{.*\\}  
a1.sources.r1.interceptors.i2.type=timestamp

 

Timestamp Interceptor    //时间戳拦截器 + header

时间戳拦截器,将当前时间戳(毫秒)加入到events header中,key名字为:timestamp,值为当前时间戳。用的不是很多。比如在使用HDFS Sink时候,根据events的时间戳生成结果文件,
hdfs.path = hdfs://cdh5/tmp/dap/%Y%m%d
hdfs.filePrefix = log_%Y%m%d_%H
会根据时间戳将数据写入相应的文件中。
但可以用其他方式代替(设置useLocalTimeStamp = true)。

a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
 
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=hdfs://h71:9000/hui/%y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = log_%Y%m%d_%H
#配置fileType和writeFormat为下面的参数才能保证导入hdfs中的数据为文本格式
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#当有数据存储时会每10秒滚动将文件名的后缀.tmp去掉,默认值为30秒
a1.sinks.k1.hdfs.rollInterval=10
#上面是按时间滚动,下面这个是按文件大小进行滚动
#a1.sinks.k1.hdfs.rollSize=1024
 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume进程:
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c conf/ -f conf/timestamp.conf -n a1 -Dflume.root.logger=INFO,console
产生数据:
[hadoop@h71 hui]$ echo "hello world" >> hehe.txt

查看结果:
[hadoop@h71 hui]$ hadoop fs -lsr /hui
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 02:41 /hui/17-03-18
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 02:41 /hui/17-03-18/02
-rw-r--r--   2 hadoop supergroup         12 2017-03-18 02:41 /hui/17-03-18/02/log_20170318_02.1489776083025.tmp
10秒中之后:(文件中的后缀1489776083025为时间戳)
[hadoop@h71 hui]$ hadoop fs -lsr /hui
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 02:41 /hui/17-03-18
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 02:41 /hui/17-03-18/02
-rw-r--r--   2 hadoop supergroup         12 2017-03-18 02:41 /hui/17-03-18/02/log_20170318_02.1489776083025

Host Interceptor
主机名拦截器。将运行Flume agent的主机名或者IP地址加入到events header中,key名字为:host(也可自定义)。

[hadoop@h71 conf]$ vi host.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
#参数为true时用IP192.168.8.71,参数为false时用主机名h71,默认为true
a1.sources.r1.interceptors.i1.useIP = false
a1.sources.r1.interceptors.i1.hostHeader = agentHost
 
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://h71:9000/hui/%y%m%d
a1.sinks.k1.hdfs.filePrefix = qiang_%{agentHost}
#往生成的文件加后缀名.log
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 10
 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume进程:
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c conf/ -f conf/host.conf -n a1 -Dflume.root.logger=INFO,console
报错:Caused by: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
解决:在host.conf文件中加这么一行a1.sinks.k1.hdfs.useLocalTimeStamp = true


产生数据:
[hadoop@h71 hui]$ echo "hello world" >> hehe.txt

查看结果:
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ hadoop fs -lsr /hui
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 03:36 /hui/170318
-rw-r--r--   2 hadoop supergroup          2 2017-03-18 03:36 /hui/170318/qiang_h71.1489779401946.log

说明:Timestamp Interceptor和Host Interceptor这两个实验有毒啊。。。我一开始做的时候还正常,在重做一次的时候启动flume进程SINK, name: k1 started后就莫名其妙的卡在哪里不动了,也不报错,死活不好使,我也是醉了。。。

Static Interceptor
静态拦截器,用于在events header中加入一组静态的key和value。

[hadoop@h71 conf]$ vi static.conf

a1.sinks = k1
a1.channels = c1
 
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = static_key
a1.sources.r1.interceptors.i1.value = static_value
 
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://h71:9000/hui/
a1.sinks.k1.hdfs.filePrefix = qiang_%{static_key}
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

查看结果:
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ hadoop fs -lsr /hui
drwxr-xr-x   - hadoop supergroup          0 2017-03-18 03:36 /hui/
-rw-r--r--   2 hadoop supergroup          2 2017-03-18 03:36 /hui/qiang_static_value.1489779401946


UUID Interceptor
UUID拦截器,用于在每个events header中生成一个UUID字符串,例如:b5755073-77a9-43c1-8fad-b7a586fc1b97。生成的UUID可以在sink中读取并使用。

[hadoop@h71 conf]$ vi uuid.conf

a1.sources = r1  
a1.sinks = k1  
a1.channels = c1  
 
a1.sources.r1.type = exec  
a1.sources.r1.channels = c1  
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
#type的参数不能写成uuid,得写具体,否则找不到类
a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
#如果UUID头已经存在,它应该保存
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.interceptors.i1.prefix = UUID_
 
a1.sinks.k1.type = logger  
 
a1.channels.c1.type = memory  
a1.channels.c1.capacity = 1000  
a1.channels.c1.transactionCapacity = 100  
 
a1.sources.r1.channels = c1  
a1.sinks.k1.channel = c1  

运行flume进程后可看到:
Event: { headers:{id=UUID_1cb50ac7-fef0-4385-99da-45530cb50271} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }


Morphline Interceptor
后续再研究这块。

Search and Replace Interceptor
该拦截器用于将events中的正则匹配到的内容做相应的替换。

[hadoop@h71 conf]$ vi search.conf

a1.sources = r1  
a1.sinks = k1  
a1.channels = c1  
 
a1.sources.r1.type = exec  
a1.sources.r1.channels = c1  
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = search_replace
a1.sources.r1.interceptors.i1.searchPattern = [0-9]+
a1.sources.r1.interceptors.i1.replaceString = xiaoqiang
a1.sources.r1.interceptors.i1.charset = UTF-8
 
a1.sinks.k1.type = logger  
 
a1.channels.c1.type = memory  
a1.channels.c1.capacity = 1000  
a1.channels.c1.transactionCapacity = 100  
 
a1.sources.r1.channels = c1  
a1.sinks.k1.channel = c1

启动flume进程:
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c conf/ -f conf/search.conf -n a1 -Dflume.root.logger=INFO,console
产生数据:
[hadoop@h71 hui]$ echo "message 1" >> hehe.txt
[hadoop@h71 hui]$ echo "message 23" >> hehe.txt

在控制台可看到:
Event: { headers:{} body: 6D 65 73 73 61 67 65 20 78 69 61 6F 71 69 61 6E message xiaoqian }
Event: { headers:{} body: 6D 65 73 73 61 67 65 20 78 69 61 6F 71 69 61 6E message xiaoqian }

Regex Filtering Interceptor
该拦截器使用正则表达式过滤原始events中的内容。

[hadoop@h71 conf]$ vi filter.conf

a1.sources = r1  
a1.sinks = k1  
a1.channels = c1  
 
a1.sources.r1.type = exec  
a1.sources.r1.channels = c1  
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^lxw1234.*
#该配置表示过滤掉不是以lxw1234开头的events。如果excludeEvents设为true,则表示过滤掉以lxw1234开头的events。
a1.sources.r1.interceptors.i1.excludeEvents = false
 
a1.sinks.k1.type = logger  
 
a1.channels.c1.type = memory  
a1.channels.c1.capacity = 1000  
a1.channels.c1.transactionCapacity = 100  
 
a1.sources.r1.channels = c1  
a1.sinks.k1.channel = c1

原始events内容为:
[hadoop@h71 hui]$ echo "message 1" >> hehe.txt 
[hadoop@h71 hui]$ echo "lxw1234 message 3" >> hehe.txt 
[hadoop@h71 hui]$ echo "message 2" >> hehe.txt 
[hadoop@h71 hui]$ echo "lxw1234 message 4" >> hehe.txt 


拦截后的events内容为:
Event: { headers:{} body: 6C 78 77 31 32 33 34 20 6D 65 73 73 61 67 65 20 lxw1234 message  }
Event: { headers:{} body: 6C 78 77 31 32 33 34 20 6D 65 73 73 61 67 65 20 lxw1234 message  }

Regex Extractor Interceptor
该拦截器使用正则表达式抽取原始events中的内容,并将该内容加入events header中。

[hadoop@h71 conf]$ vi extractor.conf

a1.sources = r1  
a1.sinks = k1  
a1.channels = c1  
 
a1.sources.r1.type = exec  
a1.sources.r1.channels = c1  
a1.sources.r1.command = tail -F /home/hadoop/hui/hehe.txt
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = cookieid is (.*?) and ip is (.*)
a1.sources.r1.interceptors.i1.serializers = s1 s2
a1.sources.r1.interceptors.i1.serializers.s1.name = cookieid
a1.sources.r1.interceptors.i1.serializers.s2.name = ip
 
a1.sinks.k1.type = logger  
 
a1.channels.c1.type = memory  
a1.channels.c1.capacity = 1000  
a1.channels.c1.transactionCapacity = 100  
 
a1.sources.r1.channels = c1  
a1.sinks.k1.channel = c1


注意:1.把原博客中的a1.sources.r1.interceptors.i1.serializers.s1.type = default这两个删除掉,否则会报错:

Caused by: java.lang.ClassNotFoundException: default

2.正则表达式cookieid is (.*?) and ip is (.*?)改为cookieid is (.*?) and ip is (.*),否则无法匹配IP,events header中IP为空

该配置从原始events中抽取出cookieid和ip,加入到events header中。

原始的events内容为:
[hadoop@h71 hui]$ echo "cookieid is c_1 and ip is 127.0.0.1" >> hehe.txt 
[hadoop@h71 hui]$ echo "cookieid is c_2 and ip is 127.0.0.2" >> hehe.txt 
[hadoop@h71 hui]$ echo "cookieid is c_3 and ip is 127.0.0.3" >> hehe.txt

events header中的内容为:
Event: { headers:{cookieid=c_1, ip=127.0.0.1} body: 63 6F 6F 6B 69 65 69 64 20 69 73 20 63 5F 31 20 cookieid is c_1  }
Event: { headers:{cookieid=c_2, ip=127.0.0.2} body: 63 6F 6F 6B 69 65 69 64 20 69 73 20 63 5F 32 20 cookieid is c_2  }
Event: { headers:{cookieid=c_3, ip=127.0.0.3} body: 63 6F 6F 6B 69 65 69 64 20 69 73 20 63 5F 33 20 cookieid is c_3  }

Flume的拦截器可以配合Sink完成许多业务场景需要的功能,
比如:按照时间及主机生成目标文件目录及文件名;
配合Kafka Sink完成多分区的写入等等。
 

    5、每个source可以配置多个拦截器    ===> interceptorChain   设置拦截器链
 

  # 将agent组件起名
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    # 配置source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 8888
    a1.sources.r1.interceptors = i1 i2 i3  #三个拦截器
    a1.sources.r1.interceptors.i1.type = timestamp
    a1.sources.r1.interceptors.i2.type = host
    a1.sources.r1.interceptors.i3.type = static
    a1.sources.r1.interceptors.i3.key = location
    a1.sources.r1.interceptors.i3.value = NEW_YORK
    # 配置sink
    a1.sinks.k1.type = logger
    # 配置channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    # 绑定channel-source, channel-sink
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

 

channel selector:通道挑选器
====================================


    是source端组件:负责将event发送到指定的channel,相当于分区
        
    当一个source设置多个channel时,默认以副本形式向每个channel发送一个event拷贝

        1、replication 副本通道挑选器    
//默认挑选器,source将所有channel发送event副本
                    //设置source x 1, channel x 3, sink x 3 
                    //    nc       memory    file
    
        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1 k2 k3
        a1.channels = c1 c2 c3

        # 配置source
        a1.sources.r1.type = netcat
        a1.sources.r1.bind = localhost
        a1.sources.r1.port = 8888
        a1.sources.r1.selector.type = replicating

        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        a1.channels.c2.type = memory
        a1.channels.c2.capacity = 1000
        a1.channels.c2.transactionCapacity = 100

        a1.channels.c3.type = memory
        a1.channels.c3.capacity = 1000
        a1.channels.c3.transactionCapacity = 100

        
        # 配置sink
        a1.sinks.k1.type = file_roll
        a1.sinks.k1.sink.directory = /home/centos/file1
        a1.sinks.k1.sink.rollInterval = 0

        a1.sinks.k2.type = file_roll
        a1.sinks.k2.sink.directory = /home/centos/file2
        a1.sinks.k2.sink.rollInterval = 0

        a1.sinks.k3.type = file_roll
        a1.sinks.k3.sink.directory = /home/centos/file3
        a1.sinks.k3.sink.rollInterval = 0

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1 c2 c3
        a1.sinks.k1.channel = c1
        a1.sinks.k2.channel = c2
        a1.sinks.k3.channel = c3
    
    2、Multiplexing 多路复用通道挑选器    //选择avro源发送文件
      
        # 将agent组件起名
        a1.sources = r1
        a1.sinks = k1 k2 k3
        a1.channels = c1 c2 c3
        
        # 配置source
        a1.sources.r1.type = avro
        a1.sources.r1.bind = 0.0.0.0
        a1.sources.r1.port = 4444
        # 配置通道挑选器
        a1.sources.r1.selector.type = multiplexing
        a1.sources.r1.selector.header = country
        a1.sources.r1.selector.mapping.CN = c1
        a1.sources.r1.selector.mapping.US = c2
        a1.sources.r1.selector.default = c3
        
        # 配置channel
        a1.channels.c1.type = memory
        a1.channels.c1.capacity = 1000
        a1.channels.c1.transactionCapacity = 100

        a1.channels.c2.type = memory
        a1.channels.c2.capacity = 1000
        a1.channels.c2.transactionCapacity = 100

        a1.channels.c3.type = memory
        a1.channels.c3.capacity = 1000
        a1.channels.c3.transactionCapacity = 100

        
        # 配置sink
        a1.sinks.k1.type = file_roll
        a1.sinks.k1.sink.directory = /home/centos/file1
        a1.sinks.k1.sink.rollInterval = 0

        a1.sinks.k2.type = file_roll
        a1.sinks.k2.sink.directory = /home/centos/file2
        a1.sinks.k2.sink.rollInterval = 0

        a1.sinks.k3.type = file_roll
        a1.sinks.k3.sink.directory = /home/centos/file3
        a1.sinks.k3.sink.rollInterval = 0

        # 绑定channel-source, channel-sink
        a1.sources.r1.channels = c1 c2 c3
        a1.sinks.k1.channel = c1
        a1.sinks.k2.channel = c2
        a1.sinks.k3.channel = c3


        1、创建file1 file2 file3文件夹,+目录
            mkdir file1 file2 file3

        2、创建文件夹country,并放入头文件和数据
            创建头文件CN.txt、US.txt、OTHER.txt 
                CN.txt ===> country CN              
                US.txt ===> country US              
                OTHER.txt ===> country OTHER   
            
            创建数据 1.txt 
                1.txt ====> helloworld

        3、运行flume
            flume-ng agent -n a1 -f /soft/flume/selector_multi.conf

        4、运行Avro客户端
            flume-ng avro-client -H localhost -p 4444 -R ~/country/US.txt -F ~/country/1.txt    ===> 查看file2
            flume-ng avro-client -H localhost -p 4444 -R ~/country/CN.txt -F ~/country/1.txt    ===> 查看file1
            flume-ng avro-client -H localhost -p 4444 -R ~/country/OTHER.txt -F ~/country/1.txt    ===> 查看file3


        
 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

四月天03

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值