reference: http://flume.apache.org/FlumeUserGuide.html
1、flume是为了解决什么问题的?
引用官方文档的描述:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
大致意思是flume可以收集、聚合和移动不同来源的日志数据到数据存储中心。一个很重要的特性就是它支持的数据来源非常丰富,当然,对于数据最终会流向哪里的支持也同样丰富。因此,大体上flume的工作流程可以用下图来表示:
2、flume是如何工作的?
从1中我们了解到数据从一个来源,经过flume,流向另一个地方,那flume这个黑匣子到底是如何工作的呢?
需要知道的是,flume是基于事件(Event)来传递数据的,Event是flume传输数据的基本单元,如果实在不知道Event是什么,可以将其理解为一个数据结构或者是一个JavaBean,其中包含了需要要传输的数据和数据的一些基本属性。在flume内部,有三个非常重要的组件:Source、Channel、Sink,Source首先采集到Event的数据,然后会将采集到的数据传输给Channel,Channel再将数据给到Sink,最后Sink把数据给到最终的目的地,如下图:
在上图中,我们看到了一个不认识的单词:Agent,什么是Agent呢?来看看官方给的解释:A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop)
,意思就是一个Agent是承载事件从外部源流向下一个目标的组件,也可以理解为一个flume节点。
3、flume的数据流模型
数据流模型即数据流向的模型,2中展示的是flume中一个简单的数据流模型,下面来看看还有什么模型:
3.1、 一个Agent串联另外一个Agent
In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.
(为了能够在多个Agent之间传递数据,上一个Agent的Sink和下一个Agent的Source都必须是avro类型)
3.2、多个Agent连接一个Agent
This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.
(多个Agent接一个Agent的场景)
3.3、一个Agent中有多个Channel和多个Sink
The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.
(一个Agent中有多个Channel和多个Sink,最后数据流向多个目的地)
从以上几种模型可以看出,flume的数据流模型相当灵活,在实际生产中,我们也可以按照自己的业务场景使用合适的数据流模型。
4、配置
flume的配置主要从三个方面着手,Source、Channel、Sink
4.1、Source的配置
Source是用来配置数据来源的,官方支持的来源非常多,具体使用哪一种来源,就参考哪一项的配置,Channel和Sink都是如此。
4.2、Channel的配置
下面是Channel的配置项:
4.3 Sink的配置
下面是Sink的配置项:
5、总结
Talk is cheap, show me the demo,接下来着手实现一个flume采集tomcat的catalina.out,然后存储在HDFS中。