flume学习

最新推荐文章于 2024-05-28 17:45:14 发布

强瞿望羲

最新推荐文章于 2024-05-28 17:45:14 发布

阅读量408

点赞数

文章标签： flume

文章目录

#总览
按照官网的解释，flume呢，在我看来就是一个对于日志文件的管理软件。其有以下特点：
1、 数据流
2、 高可靠性
3、 支持在线操作
4、 灵活可变
5、 分布式
flume可以获取不同应用、软件产生的日志文件，通过数据流的形式进行传输，最后存储在一个中心仓库，例如数据库。因此，flume获取的数据类型是可定制的，包括但不限于网络交换、邮件等。

flume的环境配置要求很低，除去硬件，只要求一个1.8jre。
#架构
总的来讲，flume的工作架构可以看作三个部分，来源、去向以及中间最主要的处理。在这之中，中间的处理叫做agent，简单地说：代理。一个agent的去向，可以是另一个agent的来源。

来源，顾名思义，就是获取数据的地方，我们称之为外部数据源，flume通过自身数据源对外部数据进行抓取，然后将数据保存在一个或多个本地硬盘的存储位置，等待后续调用。在之前会对flume自身的数据源进行配置，保证与外部数据源的数据类型相匹配，当然，相同的数据类型，不只是简单的int、string或者array，包括但不限于各类协议。也就是说获取http交互信息的flume只能获取http信息而不能获取mysql的日志信息。

另外，flume异步工作，也就是说一个agent可以同时接受多个数据源传递的信息，各类信息各安其分，在处理大量数据时避免了拥堵。

flume允许数据流经过多个agent，类似交通线路一样，对于数据经过的路线，经过多少个agent，没有强制规定。如果我们要从A的电脑上将mysql的日志传递给C，A可以先将信息传给B，B再传给C。

对于每一个agent，其中保存的信息必须等到该agent将信息传递给下一个agent或者仓库后，才被允许删除。对于每一个agent而言，flume是分布式的点对点管理模式，就跟区块链一样，保证一个agent的信息必须保存在其他agent中，相邻节点互相进行确认。

此外，agent也有类似word的即时存储功能，可以将数据存储在内存中，以备不时之需。
#安装&配置
首先进入官网下载安装包.bin，flume是基于java语言的。

解压后进入conf目录下，新建一个.conf文件进行手动配置。注意，在同一目录下，还有官方的一个实例，官网也有配置的实例代码：

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

以上分别配置了agent的名称为a1，sources的名称及其类型、来源和端口号；本地存储的位置为内存和大小；sinks和sources与本地存储channel的关联。

除此之外，当然还有其他的可选环境配置，具体可以参照官网

注意:官网上的配置代码只适用于Linux，因为Linux和Windows命令的差异，如果是安装在windows的朋友，必须对命令进行修改。
原始命令：
flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
修改后的命令：
flume-ng.cmd agent -conf 安装目录/conf -conf-file 安装目录/conf/example.conf -name a1 -property flume.root.logger=INFO,console

输入cmd后
输入命令行后

打开另外一个cmd
输入telnet localhost 44444
在telnet窗口输入一些字符串
键入信息
此时，在原来的cmd窗口，也会返回相应的日志信息
返回信息

官方文档还提供了一个查看原始数据和debug的代码，我是安装在windows上的，也就没有调试，感兴趣的可以在linux上试一下。

bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

####扩展性

Flume有极强的可扩展性，可以关联上Apache的Zookeeper，管理不同agent的conf。

Flume当然也提供很多第三方的插件，在Linux中只需要将jar包添加进环境路径FLUME-CLASSPATH中。此外，Flume提供了一个专用路径$FLUME_HOME/plugins.d，对第三方插件进行统一管理。每当启动agent的时，flume-ng都会在该路径下查找插件，并将其添加进java路径。

在这之下，每个插件都有三个子目录，lib：装有插件本身的jar包；libext：装有插件所依赖的jar包；native：装有本地的库，例如.so文件。

以下是两个插件的例子：

plugins.d/
plugins.d/custom-source-1/
plugins.d/custom-source-1/lib/my-source.jar
plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
plugins.d/custom-source-2/
plugins.d/custom-source-2/lib/custom.jar
plugins.d/custom-source-2/native/gettext.so

####数据获取
就像之前的telnet发送数据，flume可以接受很多种不同的数据，包括但不限于http协议和rpc协议。

例如，使用以下命令可以接受avro客户端的rpc协议的数据：

bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

除了Avro之外，还可以接受Thrift、Syslog和Netcat等。

####连接多个Agent
为了连接多个Agent，必须保证前一个的sink和后一个的source的数据类型相一致，同时，配置sink指向source的ip地址和端口号。

多个agent

以下为一个一般模型：
多个Agent

在这个模型中，一个服务器对应一个Agent，多个Agent又通过一个Agent将信息传递给一个HDFS。这之中，Agent4起到了加固的作用，增强了系统稳定性。

除了多个Agent，还可以在一个Agent内建立多个channel，对不同的信息进行分类：

####数据流配置

最开始可以对数据流有一个基本的配置：

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>

# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

以下为一个实例：

agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1

# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

#####如果需要，可以为每个Agent自定义一些配置属性：

# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>

# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>

# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue

以下为一个实例：

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1

# set channel for sources, sinks

# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000

# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100

# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata

#...

#####在一个Agent中设置多个路径

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

以下为一个实例：

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

#####如果要关联两个不同数据类型的Agent，需要分别配置。
config1：

# list sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel

# define the flow
agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel

# avro sink properties
agent_foo.sinks.avro-forward-sink.type = avro
agent_foo.sinks.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sinks.avro-forward-sink.port = 10000

# configure other pieces
#...

config2：

# list sources, sinks and channels in the agent
agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel

# define the flow
agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel

# avro source properties
agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000

# configure other pieces
#...

这样，我们可以通过Agent1从web获取数据，并通过Agent2存储。
#####扇出数据
配置选择扇出数据的方式为replicate还是multipex，默认为replicate：

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

为每个数据流规定其所走channel，以及默认channel：

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

以下配置的agent-foo，具有两条路径：

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

对于获取的数据，首先检查头部行是否为State，接着检查值，如果为“CA”，传入mem-channel-1，如果为“AZ”，传入file-channel-2，如果为"NY"，那么同时传入两个channel，如果都不符合，传入默认的mem-channel-1。
#####option选项
option为数据提供了最后一层保险，只有当没有其他指定的channel或者默认channel，或者数据传输失败在所有channel进行重传之后，数据才会走option channel，并且不会有安全检测，即不会保障数据的传输成功。

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

####Source
以下是对一些Source源的配置。
#####Avro Source

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

此外，还可以配置ipFilterRules：

<’allow’ or deny>:<’ip’ or ‘name’ for computer name>:<pattern> or allow/deny:ip/name:pattern

example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*

顾名思义，在allow之后配置可以连接Agent的ip和主机名，而在deny之后配置无法访问的ip。

#####Thrift Source

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

此外，Thrift还可以配置使用安全模式启动。

#####Exec Source

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

这种类型的Source通常通过Unix命令启动，并且将从进程获取的信息打印到标准输出上。默认情况下，无法打印stderr，该设置可以在conf中的property属性上修改。

需要注意的是，Exec Source无法保障数据的安全传输，当本地存储空间装满之后，无法告知进程停止产生或者传输数据，就会产生拥塞，导致数据丢包。所以官方推荐使用Spooling Directory Source或者Taildir Source。

此外，还可以关联Linux的Bash或者Windows的Powershell，增强功能：

a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

剩下还有很多种Source，我就不一一列举了，后面的Sink、Channel同样如此，感兴趣的朋友可以去看看官方文档。
####Sink
#####HDFS Sink
Hadoop Distributed File System (HDFS)，支持文本或者序列文件，当然也包括这两种格式的压缩包。在发送文件的过程中，采用流水化的形式，也可以通过时间戳或位置进行分类。使用HDFS的前提是必须安装Hadoop，以便引入jar包，应当保持版本号的一致性。

在HDFS的路径中，应当包含格式化的序列，例如：

名称	描述
%{host}	Substitute value of event header named “host”. Arbitrary header names are supported.
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, …)
%A	locale’s full weekday name (Monday, Tuesday, …)
%b	locale’s short month name (Jan, Feb, …)
%B	locale’s long month name (January, February, …)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%e	day of month without padding (1)
%D	date; same as %m/%d/%y
%H	hour (00…23)
%I	hour (01…12)
%j	day of year (001…366)
%k	hour ( 0…23)
%m	month (01…12)
%n	month without padding (1…12)
%M	minute (00…59)
%p	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00…59)
%y	last two digits of year (00…99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)
%[localhost]	Substitute the hostname of the host where the agent is running
%[IP]	Substitute the IP address of the host where the agent is running
%[FQDN]	Substitute the canonical hostname of the host where the agent is running

需要注意的是，最后三个：localhost、IP、FQDN都会受到网络环境的影响。同时，序列信息必须包含时间戳，使用TimestampInterceptor可以自动添加时间戳。如果不想使用，也可以在conf中将hdfs.useLocalTimeStamp设置为true。

使用中的文件会产生一个.tmp的临时文件，可以将已完成的文件排除。

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

在这个设置中，roundvalue=10，也就是时间戳是10min，例如12:58:05收到数据，将其记录为1250/00。

####Chnnel
######Memory Channel
这种channel实际上就是内存，因为内存的速度是很快的，可以提高吞吐率。

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

####Channel Selectors
#####Replicating Channel Selector
这个是配置Agent的扇出数据的方式，详情请看之前的***扇出数据***。需要注意的是，默认的方式就是replicate。

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3

#####Custom Channel Selector
除了repllicate和multiplex，还可以自定义自己的channel selector接口，但是必须要将路径和依赖路径添加进启动Flume时的启动命令里面。

a1.sources = r1
a1.channels = c1
a1.sources.r1.selector.type = org.example.MyChannelSelector

####Sink Processors
sink processor主要是将多个sink集中起来，平衡其性能，并将数据统一传递给一个实体，同时也可以在sink间的数据传输发生错误时，进行故障转移。

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance

#####Defaul Sink Processor
默认的processor只能为一个sink提供服务，也就是最简单的Source-channel-sink。

#####Failover Sink Processor
在Failover中，会给每个sink设置一个唯一的优先级数，越大优先级越高，优先级更高的就可以先传输数据。如果没有设置优先级，那么就会按照在conf里的顺序进行数据传输。如果一个sink传输数据失败了，就让下一个更高优先级的sink传输数据，而它自身就会被扔到冷却池，当成功传输数据后，再回到活动池。

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

用maxpenalty设置failover的时间。
####Event Serializers
#####Body Text Serializer
file_roll和hdfs的sink都支持eventserializer，即串行化。

这个可以将event的主体拦截下来，忽略头部，不作改动地将其输出。

a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume
a1.sinks.k1.sink.serializer = text
a1.sinks.k1.sink.serializer.appendNewline = false

####Interceptors
拦截器可以修改或者丢弃传送数据的事件。它通过java的org.apache.flume.interceptor.Interceptor这个接口实现。用户可以自定义设置其修改或丢弃的事件。拦截器可以在conf配置文件中设置，多个拦截器通过空白符进行分隔，它们的顺序也遵循于此。所以它们返回的list也是按照这样的顺序相邻。如果一个拦截器想要丢掉一个事件，那么可以不在返回的list中记录这个事件，所以如果返回一个空list，就是丢弃了所有事件。

使用拦截器，需要在启动命令的type上进行修改。

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

在这个例子中，最先返回的就是HostInterceptor，然后是TimestampInterceptor。
#####TimeStamp Interceptor
顾名思义，就是在事件的头部添加一个时间戳。

a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

####flume.called.from.service
这是一个Flume的属性。每30s就会检查一次conf文件。如果出现了一个新的conf文件，就会检查这个flume.called.from.service属性。如果设置了该属性，那么会接受更改，否则就会马上结束更改。但是如果这个新文件不是第一次出现，并且这个时间段（30s）内也没有其他conf的修改，那么也会接受这个更改。
#Log4J Appender
附加一个log4j给Agent的source。在这之前必须将flume-ng-sdk添加到路径中。

#...
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = example.com
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true

# configure a class's logger to output to the flume appender
log4j.logger.org.example.MyClass = DEBUG,flume
#...

默认情况下，每个时间都会被转变为字符串。

如果这个事件是org.apache.avro.generic.GenericRecord, org.apache.avro.specific.SpecificRecord，或者给属性AvroReflectionEnabled设置为true，那么这个事件就可以用Avro serialization进行串行化。

因为给每个事件都进行串行化实在不够高效，所以仅仅只提供一个概要的URL地址。如果AvroSchemaUrl属性没有进行指定，那么它将会被包含在Flume头部中。

#...
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = example.com
log4j.appender.flume.Port = 41414
log4j.appender.flume.AvroReflectionEnabled = true
log4j.appender.flume.AvroSchemaUrl = hdfs://namenode/path/to/schema.avsc

# configure a class's logger to output to the flume appender
log4j.logger.org.example.MyClass = DEBUG,flume
#...

#Load Balancing Log4J Appender
这是之前的增强版，提供了轮转调度、随机计划和时间片补偿等算法，以对事件进行平衡。
以下是默认设置：

#...
log4j.appender.out2 = org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.out2.Hosts = localhost:25430 localhost:25431

# configure a class's logger to output to the flume appender
log4j.logger.org.example.MyClass = DEBUG,flume
#...

以下是随机平衡配置：

#...
log4j.appender.out2 = org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.out2.Hosts = localhost:25430 localhost:25431
log4j.appender.out2.Selector = RANDOM

# configure a class's logger to output to the flume appender
log4j.logger.org.example.MyClass = DEBUG,flume
#...

以下是时间片补偿：

#...
log4j.appender.out2 = org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.out2.Hosts = localhost:25430 localhost:25431 localhost:25432
log4j.appender.out2.Selector = ROUND_ROBIN
log4j.appender.out2.MaxBackoff = 30000

# configure a class's logger to output to the flume appender
log4j.logger.org.example.MyClass = DEBUG,flume
#...

#Security
HDFS sink, HBase sink, Thrift source, Thrift sink 和 Kite Dataset sink都提供kerbero安全认证。
#Monitor
Flume也提供监听功能，但是此功能目前并不完善，还需要修改。
使用JSON Monitor，Flume可以通过web网页的形式报告监听信息。事先必须设置好端口号。按照规定格式：

{
"typeName1.componentName1" : {"metric1" : "metricValue1", "metric2" : "metricValue2"},
"typeName2.componentName2" : {"metric3" : "metricValue3", "metric4" : "metricValue4"}
}

这里是实例：

{
"CHANNEL.fileChannel":{"EventPutSuccessCount":"468085",
                      "Type":"CHANNEL",
                      "StopTime":"0",
                      "EventPutAttemptCount":"468086",
                      "ChannelSize":"233428",
                      "StartTime":"1344882233070",
                      "EventTakeSuccessCount":"458200",
                      "ChannelCapacity":"600000",
                      "EventTakeAttemptCount":"458288"},
"CHANNEL.memChannel":{"EventPutSuccessCount":"22948908",
                   "Type":"CHANNEL",
                   "StopTime":"0",
                   "EventPutAttemptCount":"22948908",
                   "ChannelSize":"5",
                   "StartTime":"1344882209413",
                   "EventTakeSuccessCount":"22948900",
                   "ChannelCapacity":"100",
                   "EventTakeAttemptCount":"22948908"}
}

在Linux中，启动JSON的监听报告需要执行以下启动命令：

bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

之后，我们就可以在http://:/metrics这个网页查看监听报告。
#TOOL
Flume还提供了多种工具。
File Channel Integrity Tool，可以验证文件完整性。通过命令$bin/flume-ng tool --conf ./conf FCINTEGRITYTOOL -l ./datadir启动。
Event Validator Tool，事件验证器，它会通过特定方式对事件进行验证，并且移除未通过验证的事件。通过命令$bin/flume-ng tool --conf ./conf FCINTEGRITYTOOL -l ./datadir -e org.apache.flume.MyEventValidator -DmaxSize 2000启动。数据目录datadir通过逗号进行分隔。

#总结
Flume有很强大的功能，但是在使用时，也得考虑到是否适合目前的项目；针对其强大的健壮性，如何配置拓扑结构；对于机器硬件应能的分配等。

这是我简单的学习笔记，记录了我在学习官方文档时获取的信息，希望有用。

强瞿望羲

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
flume学习

总览按照官网的解释，flume呢，在我看来就是一个对于日志文件的管理软件。其有以下特点： 1、数据流 2、高可靠性 3、支持在线操作 4、灵活可变 5、分布式 flume可以获取不同应用、软件产生的日志文件，通过数据流的形式进行传输，最后存储在一个中心仓库，例如数据库。因此，flume获取的数据类型是可定制的，包括但不限于网络交换、邮件等。flume的环境配...
复制链接

扫一扫