概述
flume定义
flume是Cloudera提供的一个高可用的、高可靠的、分布式的海量日志采集,聚合和移动的系统。flume基于流式(日志流)架构,灵活简单。
它具有可靠的可靠性机制和许多故障转移和恢复机制,具有强大的容错性。使用flume这套架构实现对日志流数据的实时在线分析。
Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接收方(可定制)的能力。当前Flume有两个版本Flume 0.9x版本的统称Flume-og,Flume1.X版本的统称Flume-ng。由于Flume-ng经过重大重构,与Flume-og有很大不同,使用时请注意区分。
其主要作用是实时的读取服务器本地磁盘的数据,将数据写入到HDFS中。
flume的优势
1.可以高速采集数据,采集的数据能够以想要的文件格式以及压缩方式存储在hdfs上。
2.事务功能保证了数据在采集的过程中数据不丢失。
3.部分Source保证了Flume挂了以后重启依旧能够继续在上一次采集点采集数据,真正做到数据零丢失。
Flume架构
Agent:最小日志采集单元,所谓的Flume的日志采集是通过拼装若干个Agent完成的。
Agent中包含Source、Channel、Sink。
1.Source(源端数据采集):Flume提供了各种各样的Source,同时还提供了自定义Source。
2.Channel(临时存储聚合数据):主要用的是memory channel和File channel(生产最常用),生产中channel的数据一定是要监控的,防止sink挂了,撑爆channel。
3.Sink(移动数据到目标端):如HDFS、KAFKA、DB以及自定义的sink。
单Agent:
串联Agent:
并联Agent(生产中最多的使用):
多sinkAgent(也很常见):
可以将Flume想象成血管,负责运输到心脏后,再由心脏将血液运输到各个器官。
Flume安装
- 安装JDK 1.8+ 配置环境变量
- 安装Flume
[root@CentOSA ~]# tar -zxf apache-flume-1.9.0-bin.tar.gz -C /usr/
[root@CentOSA ~]# cd /usr/apache-flume-1.9.0-bin/
[root@CentOSA apache-flume-1.9.0-bin]# ./bin/flume-ng version
Flume 1.9.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: d4fcab4f501d41597bc616921329a4339f73585e
Compiled by fszabo on Mon Dec 17 20:45:25 CET 2018
From source with checksum 35db629a3bda49d23e9b3690c80737f9
Agent配置
# 声明组件信息
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink1>
<Agent>.channels = <Channel1> <Channel2>
# 组件配置
<Agent>.sources.<Source>.<someProperty> = <someValue>
<Agent>.channels.<Channel>.<someProperty> = <someValue>
<Agent>.sinks.<Sink>.<someProperty> = <someValue>
# 链接组件
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
<Agent>.sinks.<Sink>.channel = <Channel1>
<Agent>、<Source>、<Sink>表示组件的名字,系统有哪些可以使用的组件需要查阅文档
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
测试
[root@CentOSA apache-flume-1.9.0-bin]# vi conf/demo01.properties
[root@CentOSA ~]# yum install -y telnet #必须安装该插件,否则r1组件无法运行
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = CentOSA
a1.sources.r1.port = 44444
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动组件
[root@CentOSA apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/demo01.properties --name a1 -Dflume.root.logger=INFO,console
测试
[root@CentOSA ~]# telnet CentOSA 44444
Trying 192.168.40.129...
Connected to CentOSA.
Escape character is '^]'.
hello world
OK
ni hao
OK
组件概述
Source 输入源
- Avro Source:内部启动一个Avro服务器,用于接收来自Avro Client的请求,并且将接收数据存储到Channel中。
Avro是一个数据序列化系统,设计用于支持大批量数据交换的应用。它的主要特点有:支持二进制序列化方式,可以便捷,快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据。
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = avro
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@train apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avro.properties -Dflume.root.logger=INFO,console
[root@train apache-flume-1.9.0-bin]# ./bin/flume-ng avro-client --host train --port 44444 --filename /root/data/t_employee
- Exec Source:可以将指令在控制台输出采集过来。
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/data/t_employee
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@train apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/exec.properties -Dflume.root.logger=INFO,console
echo 'hello world' >> data/t_employee
- Spooling Directory Source: 采集静态目录下,新增文本文件,采集完成后会修改文件后缀,但是不会删除采集的源文件,如果用户只想采集一次,可以修改该source默认行为。
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/spooldir
a1.sources.r1.fileHeader = true
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@train apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/spooldir.properties -Dflume.root.logger=INFO,console
注意: 只采取指定目录下新增文件,即当文件名被修改后也会采集。但是文件名不变,内容改变不会采集数据。
- Taildir Source : 实时监测动态文本行的追加,并且记录采集的文件读取的位置的偏移量,即使下一次再次采集,可以实现增量采集。
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1 g2
a1.sources.r1.filegroups.g1 = /root/taildir/.*\.xml$
a1.sources.r1.filegroups.g2 = /root/taildir/.*\.properties$
a1.sources.r1.headers.g1.type = xml
a1.sources.r1.headers.g2.type = properties
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@train apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/taildir.properties -Dflume.root.logger=INFO,console
[root@train ~]# ls .flume/
jdbc-channel taildir_position.json
[root@train ~]# cat .flume/taildir_position.json
[{"inode":34827659,"pos":12,"file":"/root/taildir/tail01.xml"}]
taildir_position.json:记录的位置信息,可以实现增量采集。
- kafka Source:Kafka Source是一个Apache Kafka消费者,它从Kafka主题中读取消息。这目前支持Kafka服务器版本0.10.1.0或更高版本。
属性 | 默认值 | 说明 |
---|---|---|
channels | 对接的通道 | |
type | 必须知道指定为org.apache.flume.source.kafka.KafkaSource | |
kafka.bootstrap.servers | source代码使用的Kafka集群中的 agent 列表,寻找kafka | |
batchSize | 1000 | 一批中写入Channel的最大消息数 |
batchDurationMillis | 1000 | 将批次写入channel之前的最长时间(以毫秒为单位)只要达到第一个大小和时间,就会写入批次。 |
kafka.topics | 以逗号分隔的主题列表,kafka消费者将从中读取消息。 | |
kafka.consumer.group.id | flume | 独特的消费者群体。在多个source或 agent 中设置相同的ID表示它们是同一个使用者组的一部分 |
用逗号分隔的主题列表订阅主题的示例。
1.配置文件
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 100
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = train:9092
a1.sources.r1.kafka.topics = test1
a1.sources.r1.kafka.consumer.group.id = group1
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
注意事项: a1.sources.r1.batchSize需要小于a1.channels.c1.transactionCapacity的数量。
否则会出现
2.运行flume
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/kafkaSource.properties -Dflume.root.logger=INFO,console
3.启动Kafka producer
./bin/kafka-console-producer.sh --broker-list train:9092 --topic test1
Sink输出
- Logger Sink: 通常用于测试/调试日志。
- File Roll Sink:可以将采集的数据写入到本地文件
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/file_roll
a1.sinks.k1.sink.rollInterval = 0
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/fileSink.properties
- HDFS Sink:可以将数据写入到HDFS文件系统。
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume-hdfs/%y-%m-%d
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
./bin/flume-ng agent --conf conf/ --name --conf-file conf/hdfsSink.p.properties
telnet train 44444
- kafka Sink: 将数据写入kafka的topic中
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = train:9092
a1.sinks.k1.topic = topic01
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/kafkaSink.properties
./bin/kafka-console-consumer.sh --bootstrap-server train:9092 --topic topic01 --group custom.g.id
telnet train 44444
- Avro Sink: 将数据写入到Avro Source中。
1.配置文件
# 组件配置
a1.sources.r1.type = avro
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 声明组件信息
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# 组件配置
a2.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a2.sources.r1.batchSize = 100
a2.sources.r1.batchDurationMillis = 2000
a2.sources.r1.kafka.bootstrap.servers = train:9092
a2.sources.r1.kafka.topics = test1
a2.sources.r1.kafka.consumer.group.id = custom.g.id
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = train
a2.sinks.k1.port = 44444
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# 链接组件
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
Agent a2:kafka source读取数据,Avro Sink写数据到a1的Avro Source中
Agent a1:Avro Source读取数据,以日志形式输出
2.由于Avro Sink写出数据时会寻找对应的ip、端口,所以必须先启动a1
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avroSink.properties -Dflume.root.logger=INFO,console
3.启动a2
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avroSink.properties
4.启动kafka的produce测试
./bin/kafka-console-producer.sh --broker-list train:9092 --topic test1
成功的话会在a1的日志中看见效果。
先启动a2会报错
Channel 通道
- Memory Channel:传输速度快 ,将Source数据直接写入内存,不安全,可能会导致数据丢失。
transactionCapacity数<capacity
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
- JDBC Channel:事件存储在数据库支持的持久性存储中。 JDBC通道当前支持嵌入式Derby。这是一种持久通道,非常适合可恢复性很重要的流程。存储非常重要的数据的时候可以使用。
a1.channels.c1.type = jdbc
- Kafka Channel: 将Source采集的数据写入外围系统的Kafka集群。
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = train:9092
a1.channels.c1.kafka.topic = test1
a1.channels.c1.kafka.consumer.group.id = flume-consumer
启动Flume
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/kafkaChannel.properties -Dflume.root.logger=INFO,console
订阅kafka的topic
./bin/kafka-console-consumer.sh --bootstrap-server train:9092 --topic test1 --group custom.g.id
测试
telnet train 44444
- File Channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /root/flume/checkpoint
a1.channels.c1.dataDirs = /root/flume/data
高级组件
拦截器
作用域Source组件,对Source封装的Event数据进行拦截 或者是装饰 ,Flume内建了许多拦截器:
- Timestamp Intercepot:装饰类型,负责在Event Header添加时间信息。
- Host Interceptor: 装饰类型,负责在Event Header添加主机信息。
- Static Interceptor:装饰类型,负责在Event Header添加自定义key和value。
- Remove Header Interceptor:装饰类型,负责删除Event Header中指定的key。
- UUID Interceptor:装饰类型,负责在Event Header添加uuid的随机的唯一字符串。
- Search and Replace Interceptor:装饰类型,负责搜索EventBody的内容,并且将匹配的内容进行替换。
- Regex Filtering Interceptor:拦截类型,将满足正则表达式的内容进行过滤或者匹配。
- Regex Extractor Interceptor: 装饰类型,负责搜索EventBody的内容,并且将匹配的内容添加到Event Header里面。
测试1: Timestamp、Host、Static、Remove、UUID、Search and Replace
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source配置,采集数据
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
#添加拦截器
a1.sources.r1.interceptors = i1 i2 i3 i4 i5 i6
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
# 自定义key和value
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = hello
a1.sources.r1.interceptors.i3.value = world
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.r1.interceptors.i4.headerName = uuid
a1.sources.r1.interceptors.i5.type = remove_header
a1.sources.r1.interceptors.i5.withName = hello
a1.sources.r1.interceptors.i6.type = search_replace
a1.sources.r1.interceptors.i6.searchPattern = ^tangc
a1.sources.r1.interceptors.i6.replaceString = yes
#sink配置,将数据发送
a1.sinks.k1.type = logger
#channel 通道缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
测试2:Regex Filtering、regex extractor
# 声明组件信息
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source配置,采集数据
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
#添加拦截器
a1.sources.r1.interceptors = i1 i2
#将EventBody含有INFO|ERROR的抽取出,并且添加到EventHeader
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = loglevel
#过滤
a1.sources.r1.interceptors.i2.type = regex_filter
#包含tang
a1.sources.r1.interceptors.i2.regex = .*tang.*
#false表示匹配 true表示排除
a1.sources.r1.interceptors.i2.excludeEvents = false
#sink配置,将数据发送
a1.sinks.k1.type = logger
#channel 通道缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
通道选择器
当一个Source组件对接多个Channel组件的时候,通道选择器决定了Source的数据如何路由到channel中,如果用户不指定通道选择器,默认系统会将Source数据广播给所有的Channel(默认使用replication模式)。
replication
# 声明组件信息
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
# sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/file_roll_1
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/file_roll_2
a1.sinks.k2.sink.rollInterval = 0
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 链接组件
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
错误
原因:由于环境变量配置了HIVE_HOME,导致hive中的debry的jar和flume的jar冲突。
解决:1.如果用户配置HIVE_HOME环境,需要用户移除hive的lib下的derby或者flume的lib下的derby(仅仅删除一方即可)
2.默认情况下,flume使用的是复制|广播模式的通道选择器。
测试:启动flume、telnet测试数据。查看sink写入的文件下的数据。
Multiplexing:将不同的数据分类,写入到不同的channel中。
案列: 将含有INFO的数据写入到c1,含有ERROR的数据写入到c2。
a1.sinks = k1 k2
a1.channels = c1 c2
#通道选择器 分流模试
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.channels = c1 c2
a1.sources.r1.selector.header = level
a1.sources.r1.selector.mapping.INFO = c1
a1.sources.r1.selector.mapping.ERROR = c2
a1.sources.r1.selector.default = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = level
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/file_roll_1
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/file_roll_2
a1.sinks.k2.sink.rollInterval = 0
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 链接组件
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
Sink Processors
Flume使用Sink Group将多个Sink实例封装成一个逻辑的Sink组件,内部通过Sink Processors实现Sink Group的故障和负载均衡。
Load balancing Sink Processor:负载平衡接收器处理器提供了在多个接收器上实现负载平衡的能力
# 声明组件信息
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/file_roll_1
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.sink.batchSize = 1
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/file_roll_2
a1.sinks.k2.sink.rollInterval = 0
a1.sinks.k2.sink.batchSize = 1
#配置Sink Porcessors
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
如果想看到负载均衡效果,sink.batchSize和transactionCapacity必须配置成1
Fileover Sink Processor:故障转移接收器处理器维护一个按优先级排序的接收器列表,确保只要有一个可用的接收器,就会处理(交付)事件。
# 声明组件信息
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
# 组件配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = train
a1.sources.r1.port = 44444
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/file_roll_1
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.sink.batchSize = 1
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/file_roll_2
a1.sinks.k2.sink.rollInterval = 0
a1.sinks.k2.sink.batchSize = 1
#Sink Processor
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 20
a1.sinkgroups.g1.processor.priority.k2 = 10
#失败Sink的最大回退周期
a1.sinkgroups.g1.processor.maxpenalty = 10000
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactiionCapacity = 1
# 链接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
API集成
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.9.0</version>
</dependency>
单机
private RpcClient client;
@Before
public void before(){
client = RpcClientFactory.getDefaultInstance("10.15.0.34",44444);
}
@Test
public void testAvro() throws EventDeliveryException {
Event event = EventBuilder.withBody("1 zhangsan true 28".getBytes());
client.append(event);
}
@After
public void after(){
client.close();
}