概述
Apache Flume是一个分布式的、可靠的、可用的系统,用于有效地收集、聚合和移动大量的日志数据,从许多不同的源到一个集中的数据存储。
Apache Flume的使用不仅限于日志数据聚合。由于数据源是可定制的,Flume可以用来传输大量的事件数据,包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息以及几乎任何可能的数据源。
系统需求
- Java Runtime Environment -
Java 1.8 or later
- Memory -
Sufficient memory
for configurations used by sources, channels or sinks - Disk Space -
Sufficient disk space
for configurations used by channels or sinks - Directory Permissions -
Read/Write permissions
for directories used by agent
数据流模型
每个agent都必须有source, channel, sink
三个组件,source作为数据获取,channel作为数据传输管道,sink作为数据的发送目的地
- Setting multi-agent flow
为了让数据在多个agent
之间流动,前一个agent
的sink
和当前agent
的source
需要是avro类型,并且sink
指向source
的hostname
(或IP地址
)和port
。
- Consolidation
日志收集中一个非常常见的场景是,大量的日志生成log producing clients
将数据发送到附加到存储子系统的几个consumer agents
代理。例如,从数百个web服务器收集的日志被发送到12个写HDFS集群的代理。
双层Flume
: 这可以通过配置许多具有avro sink
的第一层代理来实现,所有agent都指向单个agent的avro source
(在这种场景中,您也可以使用节省源/接收器/客户端)。第二层代理上的源将接收到的事件合并到单个通道中,该通道由到最终目的地的接收器使用。 - Multiplexing the flow
Flume支持将事件流多路复用到一个或多个目的地。这是通过定义一个流多路复用器来实现的,该多路复用器可以复制或选择性地将事件路由到一个或多个通道。
Setup
设置agent
- Flume agent configuration is stored in a
local configuration file
. - Configurations for
one or more
agents can be specifiedin the same configuration file
. - The configuration file includes
properties of each source, sink and channel
in an agent andhow they are wired together
to form data flows.
步骤
- Configuring individual components
定义配置flow
中的每个组件(source, sink or channel
)的特定类型和实例化的名称、类型和一组属性。 - Wiring the pieces together
配置source和sink如何与channel建立connect - Starting an agent
使用名为Flume-ng
的shell脚本启动代理,该脚本位于Flume的bin目录中。需要指定agent_name,config_dir,config_file
在命令行:./bin/flume-ng agent -n $agent_name -c $config_dir -f $config_dir/config_file
config_file
# 配置文件的一般写法
# 1. Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 2. Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 3. Describe the sink
a1.sinks.k1.type = logger
# 4. Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 5. Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
logging for debug
下面是一个启用配置日志记录和原始数据日志记录的例子,同时还将Log4j loglevel设置为控制台输出的调试:
./flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true
-Dorg.apache.flume.log.printconfig=true
这可以在命令行上传递,也可以在flume-env.sh中的JAVA_OPTS变量中设置。
-Dorg.apache.flume.log.rawdata=true
启用数据记录
Zookeeper based Configuration
Flume可以通过Zookeeper来配置agent。这是一个实验特性。需要在Zookeeper中上传配置文件, 存储在Zookeeper节点数据中。
config_file详解
- Defining the flow
# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>
注意!!!:source可以绑定多个channel, 而channel只能与sink一对一绑定
!!!
2. Configuring individual components
After defining the flow, you need to set properties of each source, sink and channel.
# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>
# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>
# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>
需要为Flume的每个组件设置属性type
,以了解它需要成为什么类型的组件。每个source channel sink
的type
都有自己所需的一组属性,这些属性就根据具体文档中的要求来设置.
- Adding multiple flows in an agent
单个agant可以包含多个相互独立的flow, 可以在config_file中配置多个source,channel,sink. 这些组件可以connect成多个流:
# list the sources, sinks and channels in the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>
# flow #1 configuration
<Agent>.sources.<Source1>.channels = <Channel1>
<Agent>.sinks.<Sink1>.channel = <Channel1>
# flow #2 configuration
<Agent>.sources.<Source2>.channels = <Channel2>
<Agent>.sinks.<Sink2>.channel = <Channel2>
- Configuring a multi agent flow
配置有多个agent的流,前一个agent的sink
类型以及后一个agent的source
类型必须同为avro/thrift
,这就能够将event在多个agent的流中传输
- 前一个agent的config_file
# list sources, sinks and channels in the agent
<Agent1>.sources = avro-AppSrv-source
<Agent1>.sinks = avro-forward-sink
<Agent1>.channels = file-channel
# define the flow
<Agent1>.sources.avro-AppSrv-source.channels = file-channel
<Agent1>.sinks.avro-forward-sink.channel = file-channel
# avro sink properties
<Agent1>.sinks.avro-forward-sink.type = avro
<Agent1>.sinks.avro-forward-sink.hostname = 10.1.1.100
<Agent1>.sinks.avro-forward-sink.port = 10000
# configure other pieces
#...
- 后一个agent的config_file
# list sources, sinks and channels in the agent
<Agent2>.sources = avro-collection-source
<Agent2>.sinks = hdfs-sink
<Agent2>.channels = mem-channel
# define the flow
<Agent2>.sources.avro-collection-source.channels = mem-channel
<Agent2>.sinks.hdfs-sink.channel = mem-channel
# avro source properties
<Agent2>.sources.avro-collection-source.type = avro
<Agent2>.sources.avro-collection-source.bind = 10.1.1.100
<Agent2>.sources.avro-collection-source.port = 10000
# configure other pieces
#...
这样多个agent就连接起来了
- Fan out flow
Flume支持将流从单个sourcefan out
到多个sink,有两种模式:replicating(default) 和 multiplexing
.
- replicating模式
# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>
# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>
# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>
<Agent>.sources.<Source1>.selector.type = replicating
- multiplexing模式
需要进一步定义属性集来将flow分叉到不同的channel及sink
如果它匹配指定的值,那么该事件将被发送到映射到该值的所有通道。如果没有匹配,则将事件发送到配置为默认的通道集:
# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...
<Agent>.sources.<Source1>.selector.default = <Channel2>
The mapping allows overlapping the channels for each value.
也就是可以在multiplexing
模式下包含replicating
模式.例如,state
为 1 时进入channel1
,为 2 时进入channel2
,为 3 时进入channel1
和channel2
, 其余值进入default channel
Source and sink batch sizes and channel transaction capacities
source和sink批处理大小和channel的event处理能力, 拥有一个批大小参数,用于确定它们在一个批处理中处理的事件的最大数量。这发生在信道事务中,该事务有一个称为事务容量的上限。批处理大小必须小于通道的事务处理能力。
具体的source channel sink
的config配置属性,参考官网文档