Flume基本架构图:
一个Agent中有Source、Channel和Sink。Sink可以连接HDFS,JMS或者其他的Agent Source
Flume术语解释
-
FlumeEvent
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes。
问题,既然Flume是以Event为单位进行数据处理,那么,对Event解析是怎么做的?通过Spark集成Flume这个例子研究下这个问题,然后再来更新
-
Agent
Flume agent is a (JVM) process that hosts the components(Source、Channel、Sink) through which events flow from an external source to the next destination (hop).
-
Source
Flume Agent的Source用于接收外部数据源发送过来的数据(例如上例中的Web Server),注意的是,外部数据源发送的数据必须满足Agent Source定义的数据源的格式。比如对于Avro Source,那么这个Source仅接收Avro格式的数据,外部数据源可以通过Avro Client的方式给Agent的Source发送数据;Avro Source也可以接收其它Agent的Avro Sink发送的Avro Event数据。(A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink)
-
Channel
Agent Channel是Agent Source接收到数据的一个缓冲,数据在被消费前(写入到Sink)对读取到的数据进行缓冲。When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem
-
Sink
The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow.
-
Source和Sink的异步特性
The source and sink within the given agent run asynchronously with the events staged in the channel.
Agent可以定义多个Channel以及Sink
如下图所见,一个Source接收到的数据,可以写到不同的Channel中,然后不同的Sink连接到这个Channel,从而达到同一个日志写到不同存储位置(比如HDFS、JMS、MySQL)或者作为其它Agent的外部数据源继续流向下一个Agent,这就形成了Pipeline的数据处理流
Agent可以组合,构成流式处理系统,Agent foo的输出作为Agent bar的输入,Pipe处理。同时一个Agent可以有多个Channel和Sink,因而同一条日志,可以同时流向不同的地方,比如HDFS,JMS或者另外的Agent。
如何配置一个Agent,多个source和sink?如下所示,使用空格隔开而不是逗号
#2个channel和2个sink的配置文件 这里我们可以设置两个sink,一个是kafka的,一个是hdfs的;
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
多个Agent可以输出到一个Agent,这样就构成了聚合Consolidateion的效果,
两个Agent组合产生多级处理
Flume安装
1.下载Flume
wget http://mirror.bit.edu.cn/apache/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz
2.Flume解压并配置
tar xzvf apache-flume-1.5.2-bin.tar.gz
///在conf目录下
cp flume-env.sh.template flume-env.sh
///在flume-env.sh文件中,添加JAVA_HOME环境变量:
JAVA_HOME=/home/hadoop/software/jdk1.7.0_67