关于Flume
flume的宗旨是向hadoop批量导入基于事件的海量数据。一个典型的例子是利用flume从一组web服务器中收集日志文件,然后将日志文件中的日志事件保存至HDFS,以便进行具体的分析。
flume基本构件source-channel-sink,使用flume需先运行flume agent。flume agent由持续运行的source、sink以及channel构成的jvm进程。flume的source产生event,并将其传递给channel,channel存储这些event直至转发给sink。
安装
下载flume且配置环境变量
示例
- 监听网络端口
- 将消息打印在控制台
#example.conf: A single-node Flume configuration
#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Describe the sink
a1.sinks.k1.type = logger
#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启用flume agent
flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
向flume发送事件
telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello flume
OK
可以观察到flume在日志消息中输出事件
2018-12-30 22:04:42,039 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2018-12-30 22:04:57,046 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65 0D hello flume. }
flume也可以在配置文件中读取环境变量,可以通过设置agent propertiesImplementation = org.apache.flume.node.EnvVarResolverProperties来启用
a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
a1.sources.r1.channels = c1
查看已配置的环境变量
启用flume agent
flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties
事物和可靠性
flume使用了两个分别独立的事物source-channel以及channel-sink。source写入数据到channel,数据在一个批次内的数据出现异常,则不写入到Channel。已接收到的部分数据直接抛弃,靠上游重发数据。事物以类似的方式应用于channel到sink的事件传递过程,如果失败,事物将会回滚,所有的事件依然存在于channel,等待重新投递。
示例中的channel类型为memory,具有较高的吞吐量,不具备持久化的能力。flume提供以下channel类型memory、jdbc、kafka、file等
at-least-once 保证source产生的事件到达sink至少一次,由于异常等原因,同一事件可能重复到达。重复的事件由后续的作业清除,比如mapreduce,hive
flume在尽可能的情况下以事物为单位来批量处理事件以便提高效率
拦截器
flume拦截器的作用在于数据在source传递至channel过程中提取event header中的属性或是对event的内容进行预处理
比如使用hdfs作为sink的输出源,flume的事件通常按时间来分区,可以通过设置hdfs.path实现分区
a1.sinks.k1.hdfs.path = /mine/flume/events/%y/%m/%d
一个flume事件将被写入哪个分区由header中的timestamp来决定,事件header中默认并没有timestamp,可以通过添加拦截器来实现
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
一般而言,存在flume多层代理的情况下建议使用由运行hdfs sink的flume agent所产生的时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
为避免flume事件在hdfs上产生大量的小文件可以通过调大rollSize来实现,单位bytes
a1.sinks.k1.hdfs.rollSize = 100000000
flume提供的拦截器
- Timestamp Interceptor
- Host Interceptor
- Static Interceptor
- Remove Header Interceptor
- UUID Interceptor
- Morphline Interceptor
- Search and Replace Interceptor
- Regex Filtering Interceptor
- Regex Extractor Interceptor
- 自定义拦截器实现
扇出
扇出(fan out)指的是一个source向多个channel,亦向多个sink传递事件。以下配置将事件同时传递到一个hdfs sink和logger sink
#Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /mine/flume/events/%y-%m-%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100000000
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k2.type = logger
#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
#Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
完整流程如下:
正常的扇出流向所有chanenl复制事件,有时候需要将不同的事件发往不同的channel,可以在source上设置一个复用选择器,根据header中的特定值与channel匹配
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
分层代理
使用flume的分层结构代理来实现flume事件的汇聚。第一层代理负责采集来自原始source的事件(如web日志),第二层代理汇总第一层代理的事件并且写入hdfs
以下示例 agent1通过监听网络端口将事件发送至kafka,agent2通过监听kafka将事件输出至控制台,完成flume两层代理配置
agent1:
#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = test1
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
agent2:
#Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
#Describe/configure the source
a2.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a2.sources.r1.channels = channel1
a2.sources.r1.batchSize = 5000
a2.sources.r1.batchDurationMillis = 2000
a2.sources.r1.kafka.bootstrap.servers = localhost:9092
a2.sources.r1.kafka.topics = test1
a2.sources.r1.kafka.consumer.group.id = custom.g.id
#Describe the sink
a2.sinks.k1.type = logger
#Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
sink组
flume允许将多个sink当做一个sink处理,以实现故障转移或负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random