官网具体地址:http://flume.apache.org/FlumeUserGuide.html
flume 核心概念
Client:Client生产数据,运行在一个独立的线程
agent:java 进程,运行在日志收集节点上(服务器节点),包含source、channel、sink三个核心组件
source:从Client收集数据,可以处理各种类型各种格式的日志数据。数据封装到事件(event) 里被传递给channel
channel:缓存收集来的数据,保存由Source组件传递过来的Event
sink:从channel中取出数据,发送到目的地。目的地包括hdfs、logger、avro、thrift、ipc、file、Hbase、solr、自定义。
event: event将传输的数据进行封装,是flume传输数据的基本单位,在整个数据的传输过程中,流动的是event,即事务保证是在event级别进行的。 如果是文本文件,通常是一行记录,event也是事务的基本单位。event从source,流向channel,再到sink,本身为一个字节数组,并可携带headers(头信息)信息。event代表着一个数据的最小完整单元,从外部数据源来,向外部的目的地去。 一个完整的event包括:event headers、event body、event信息(即文本文件中的单行记录),其中event信息就是flume收集到的日记记录。
分区:flume事件的数据通常以时间进行分区,通过 agent1.sinks.sink.hdfs.pathse设置,使用具有时间格式的转义序列的子目录。agent1.sinks.sink.hdfs.path=/temp/flume/year=%Y/month=%m/day=%d 以天作为分区粒度。Flume事件被写入的分区由事件heaer 中的timestamp 决定,但是默认情况事件的header 中没有Timestamp ,通过 interceptor(拦截器)添加。
下列为格式转义序列的完整列表:
interceptor(拦截器):拦截器是一种对事件流中事件进行修改删除的组件,它们连接source 并在事件被传递到channel之前对事件进行处理 。
//为source1 增加一个时间戳拦截器,将为source 产生的每个事件添加一个时间戳
agent1.sources.source1.interceptors = interceptor1
agent1.sources.source1.interceptors.interceptor1.type = timestamp
如果存在多级flume 代理,那么创建的时间和写入的时间存在差异,那么使用agent1.sinks.sink1.hdfs.useLocalTimeStamp=true;
事物:flume 使用两个独立的事物分别负责从source 到channel 以及从channel 到sink 的事件传递。
可靠性:flume 中source 到达sink至少一次(at least once )。同一事件有可能重复到达。不论source 还是sink都有可能产生重复。
扇出:从一个source 向多个channel,亦即向多个sink 传递事件。
使用spooling directorysource和hdfs sink 的两层 flume 代理配置:
# First tier agent
--设定agent各组件名字
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
--设定连接关系
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
--描述source 来源方式
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
--描述sink
agent1.sinks.sink1.type = avro
agent1.sinks.sink1.hostname = localhost
agent1.sinks.sink1.port = 10000
--描述channel
agent1.channels.channel1.type = file
agent1.channels.channel1.checkpointDir=/tmp/agent1/file-channel/checkpoint
agent1.channels.channel1.dataDirs=/tmp/agent1/file-channel/data
# Second tier agent
agent2.sources = source2
agent2.sinks = sink2
agent2.channels = channel2
agent2.sources.source2.channels = channel2
agent2.sinks.sink2.channel = channel2
agent2.sources.source2.type = avro
agent2.sources.source2.bind = localhost
agent2.sources.source2.port = 10000
agent2.sinks.sink2.type = hdfs
--配置hdfs上文件目录位置以及文件名格式
agent2.sinks.sink2.hdfs.path = /tmp/flume
agent2.sinks.sink2.hdfs.filePrefix = events
agent2.sinks.sink2.hdfs.fileSuffix = .log
agent2.sinks.sink2.hdfs.fileType = DataStream
agent2.channels.channel2.type = file
agent2.channels.channel2.checkpointDir=/tmp/agent2/file-channel/checkpoint
agent2.channels.channel2.dataDirs=/tmp/agent2/file-channel/data
启动命令:
这两个代理需要分别运行
先启动agent1:
% flume-ng agent --conf-file 配置文件所在目录 --name agent1 --conf $FLUME_HOME/conf(通用配置所在目录) -Dflume.root.logger=INFO,console
再启动agent2:
% flume-ng agent --conf-file 配置文件所在目录 --name agent2 --conf $FLUME_HOME/conf(通用配置所在目录) -Dflume.root.logger=INFO,console
使用sink 组保证负载均衡或者故障转移
# First tier agent
agent1.sources = source1
--多sink 间用空格
agent1.sinks = sink1a sink1b
--设置sink 组名
agent1.sinkgroups = sinkgroup1
agent1.channels = channel1
agent1.sources.source1.channels = channel1
agent1.sinks.sink1a.channel = channel1
agent1.sinks.sink1b.channel = channel1
agent1.sinkgroups.sinkgroup1.sinks = sink1a sink1b
--设定处理器类型,该处理器循环选择sink分发事件(processor.selector 可更改)
agent1.sinkgroups.sinkgroup1.processor.type = load_balance
agent1.sinkgroups.sinkgroup1.processor.backoff = true
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
agent1.sinks.sink1a.type = avro
agent1.sinks.sink1a.hostname = localhost
agent1.sinks.sink1a.port = 10000
agent1.sinks.sink1b.type = avro
agent1.sinks.sink1b.hostname = localhost
agent1.sinks.sink1b.port = 10001
agent1.channels.channel1.type = file
agent1.channels.channel1.checkpointDir=/tmp/agent1/file-channel/checkpoint
agent1.channels.channel1.dataDirs=/tmp/agent1/file-channel/data
# Second tier agent
agent2a.sources = source2a
agent2a.sinks = sink2a
agent2a.channels = channel2a
agent2a.sources.source2a.channels = channel2a
agent2a.sinks.sink2a.channel = channel2a
agent2a.sources.source2a.type = avro
agent2a.sources.source2a.bind = localhost
agent2a.sources.source2a.port = 10000
agent2a.sinks.sink2a.type = hdfs
agent2a.sinks.sink2a.hdfs.path = /tmp/flume
agent2a.sinks.sink2a.hdfs.filePrefix = events-a
agent2a.sinks.sink2a.hdfs.fileSuffix = .log
agent2a.sinks.sink2a.hdfs.fileType = DataStream
agent2a.channels.channel2a.type = file
agent2a.channels.channel2a.checkpointDir=/tmp/agent2a/file-channel/checkpoint
agent2a.channels.channel2a.dataDirs=/tmp/agent2a/file-channel/data
# Second tier agent (running on a different port number)
agent2b.sources = source2b
agent2b.sinks = sink2b
agent2b.channels = channel2b
agent2b.sources.source2b.channels = channel2b
agent2b.sinks.sink2b.channel = channel2b
agent2b.sources.source2b.type = avro
agent2b.sources.source2b.bind = localhost
agent2b.sources.source2b.port = 10000
agent2b.sinks.sink2b.type = hdfs
agent2b.sinks.sink2b.hdfs.path = /tmp/flume
agent2b.sinks.sink2b.hdfs.filePrefix = events-b
agent2b.sinks.sink2b.hdfs.fileSuffix = .log
agent2b.sinks.sink2b.hdfs.fileType = DataStream
agent2b.channels.channel2b.type = file
agent2b.channels.channel2b.checkpointDir=/tmp/agent2b/file-channel/checkpoint
agent2b.channels.channel2b.dataDirs=/tmp/agent2b/file-channel/data
以kafka 为sink :
a1.source=r1
a1.channel=c1
a1.sink=k1
a1.sources.r1.channel=c1
a1.sources.r1.type=spoolDir
a1.source.r1.spoolDir =/tmp/spoolDir
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = 本地主机:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka .producer.compression.type = snappy