Flume ng 实战图解篇

最新推荐文章于 2022-12-30 16:14:01 发布

wjacketcn

最新推荐文章于 2022-12-30 16:14:01 发布

阅读量3.6k

点赞数

分类专栏： flume

flume 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

转自：http://www.cnblogs.com/smartloli/p/4468708.html

1.概述

　　今天补充一篇关于Flume的博客，前面在讲解高可用的Hadoop平台的时候遗漏了这篇，本篇博客为大家讲述以下内容：

Flume NG简述
单点Flume NG搭建、运行
高可用Flume NG搭建
Failover测试
截图预览

　　下面开始今天的博客介绍。

2.Flume NG简述

　　Flume NG是一个分布式，高可用，可靠的系统，它能将不同的海量数据收集，移动并存储到一个数据存储系统中。轻量，配置简单，适用于各种日志收集，并支持Failover和负载均衡。并且它拥有非常丰富的组件。Flume NG采用的是三层架构：Agent层，Collector层和Store层，每一层均可水平拓展。其中Agent包含Source，Channel和Sink，三者组建了一个Agent。三者的职责如下所示：

Source：用来消费（收集）数据源到Channel组件中
Channel：中转临时存储，保存所有Source组件信息
Sink：从Channel中读取，读取成功后会删除Channel中的信息

　　下图是Flume NG的架构图，如下所示：

　　图中描述了，从外部系统（Web Server）中收集产生的日志，然后通过Flume的Agent的Source组件将数据发送到临时存储Channel组件，最后传递给Sink组件，Sink组件直接把数据存储到HDFS文件系统中。

3.单点Flume NG搭建、运行

　　我们在熟悉了Flume NG的架构后，我们先搭建一个单点Flume收集信息到HDFS集群中，由于资源有限，本次直接在之前的高可用Hadoop集群上搭建Flume。

　　场景如下：在NNA节点上搭建一个Flume NG，将本地日志收集到HDFS集群。

3.1基础软件

　　在搭建Flume NG之前，我们需要准备必要的软件，具体下载地址如下所示：

Flume　　《下载地址》

　　JDK由于之前在安装Hadoop集群时已经配置过，这里就不赘述了，若需要配置的同学，可参考《配置高可用的Hadoop平台》。

3.2安装与配置

安装

　　首先，我们解压flume安装包，命令如下所示：

[hadoop@nna ~]$ tar -zxvf apache-flume-1.5.2-bin.tar.gz

配置

　　环境变量配置内容如下所示：

export FLUME_HOME=/home/hadoop/flume-1.5.2
export PATH=$PATH:$FLUME_HOME/bin

　　flume-conf.properties

#agent1 name
agent1.sources=source1
agent1.sinks=sink1
agent1.channels=channel1


#Spooling Directory
#set source1
agent1.sources.source1.type=spooldir
agent1.sources.source1.spoolDir=/home/hadoop/dir/logdfs
agent1.sources.source1.channels=channel1
agent1.sources.source1.fileHeader = false
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = timestamp

#set sink1
agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path=/home/hdfs/flume/logdfs
agent1.sinks.sink1.hdfs.fileType=DataStream
agent1.sinks.sink1.hdfs.writeFormat=TEXT
agent1.sinks.sink1.hdfs.rollInterval=1
agent1.sinks.sink1.channel=channel1
agent1.sinks.sink1.hdfs.filePrefix=%Y-%m-%d

#set channel1
agent1.channels.channel1.type=file
agent1.channels.channel1.checkpointDir=/home/hadoop/dir/logdfstmp/point
agent1.channels.channel1.dataDirs=/home/hadoop/dir/logdfstmp

　　flume-env.sh

JAVA_HOME=/usr/java/jdk1.7

　　注：配置中的目录若不存在，需提前创建。

3.3启动

　　启动命令如下所示：

flume-ng agent -n agent1 -c conf -f flume-conf.properties -Dflume.root.logger=DEBUG,console

　　注：命令中的agent1表示配置文件中的Agent的Name，如配置文件中的agent1。flume-conf.properties表示配置文件所在配置，需填写准确的配置文件路径。

3.4效果预览

　　之后，成功上传后本地目的会被标记完成。如下图所示：

4.高可用Flume NG搭建

　　在完成单点的Flume NG搭建后，下面我们搭建一个高可用的Flume NG集群，架构图如下所示：

　　图中，我们可以看出，Flume的存储可以支持多种，这里只列举了HDFS和Kafka（如：存储最新的一周日志，并给Storm系统提供实时日志流）。

4.1节点分配

　　Flume的Agent和Collector分布如下表所示：

名称	HOST	角色
Agent1	10.211.55.14	Web Server
Agent2	10.211.55.15	Web Server
Agent3	10.211.55.16	Web Server
Collector1	10.211.55.18	AgentMstr1
Collector2	10.211.55.19	AgentMstr2

　　图中所示，Agent1，Agent2，Agent3数据分别流入到Collector1和Collector2，Flume NG本身提供了Failover机制，可以自动切换和恢复。在上图中，有3个产生日志服务器分布在不同的机房，要把所有的日志都收集到一个集群中存储。下面我们开发配置Flume NG集群

4.2配置

　　在下面单点Flume中，基本配置都完成了，我们只需要新添加两个配置文件，它们是flume-client.properties和flume-server.properties，其配置内容如下所示：

flume-client.properties

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1 

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /home/hadoop/dir/logdfs/test.log

agent1.sources.r1.interceptors = i1 i2
agent1.sources.r1.interceptors.i1.type = static
agent1.sources.r1.interceptors.i1.key = Type
agent1.sources.r1.interceptors.i1.value = LOGIN
agent1.sources.r1.interceptors.i2.type = timestamp

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = nna
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = nns
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

　　注：指定Collector的IP和Port。

flume-server.properties

#set Agent name
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# other node,nna to nns
a1.sources.r1.type = avro
a1.sources.r1.bind = nna
a1.sources.r1.port = 52020
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Collector
a1.sources.r1.interceptors.i1.value = NNA
a1.sources.r1.channels = c1

#set sink to hdfs
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=/home/hdfs/flume/logdfs
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=TEXT
a1.sinks.k1.hdfs.rollInterval=1
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

　　注：在另一台Collector节点上修改IP，如在NNS节点将绑定的对象有nna修改为nns。

flume的拦截器也是chain形式的，可以对一个source指定多个拦截器，按先后顺序依次处理。如下是几种常用拦截器：
（1）、Timestamp Interceptor :在event的header中添加一个key叫：timestamp,value为当前的时间戳。这个拦截器在sink为hdfs 时很有用
（2）、Host Interceptor：在event的header中添加一个key叫：host,value为当前机器的hostname或者ip。
  （3）、Static Interceptor:可以在event的header中添加自定义的key和value。
  （4）、Regex Filtering Interceptor:通过正则来清洗或包含匹配的events。
  （5）、Regex Extractor Interceptor：通过正则表达式来在header中添加指定的key,value则为正则匹配的部分

4.3启动

　　在Agent节点上启动命令如下所示：

flume-ng agent -n agent1 -c conf -f flume-client.properties -Dflume.root.logger=DEBUG,console

　　注：命令中的agent1表示配置文件中的Agent的Name，如配置文件中的agent1。flume-client.properties表示配置文件所在配置，需填写准确的配置文件路径。

　　在Collector节点上启动命令如下所示：

flume-ng agent -n a1 -c conf -f flume-server.properties -Dflume.root.logger=DEBUG,console

　　注：命令中的a1表示配置文件中的Agent的Name，如配置文件中的a1。flume-server.properties表示配置文件所在配置，需填写准确的配置文件路径。

5.Failover测试

　　下面我们来测试下Flume NG集群的高可用（故障转移）。场景如下：我们在Agent1节点上传文件，由于我们配置Collector1的权重比Collector2大，所以Collector1优先采集并上传到存储系统。然后我们kill掉Collector1，此时有Collector2负责日志的采集上传工作，之后，我们手动恢复Collector1节点的Flume服务，再次在Agent1上次文件，发现Collector1恢复优先级别的采集工作。具体截图如下所示：

Collector1优先上传

HDFS集群中上传的log内容预览

Collector1宕机，Collector2获取优先上传权限

重启Collector1服务，Collector1重新获得优先上传的权限

6.截图预览

　　下面为大家附上HDFS文件系统中的截图预览，如下图所示：

HDFS文件系统中的文件预览

上传的文件内容预览

7.总结

　　在配置高可用的Flume NG时，需要注意一些事项。在Agent中需要绑定对应的Collector1和Collector2的IP和Port，另外，在配置Collector节点时，需要修改当前Flume节点的配置文件，Bind的IP（或HostName）为当前节点的IP（或HostName），最后，在启动的时候，指定配置文件中的Agent的Name和配置文件的路径，否则会出错。

8.结束语

　　这篇博客就和大家分享到这里，如果大家在研究学习的过程当中有什么问题，可以加群进行讨论或发送邮件给我，我会尽我所能为您解答，与君共勉！

配置文件详解：

For more flexibility, the failover Flume client can be configured with these properties:

 
    client.type = default_failover

hosts = h1 h2 h3                     # at least one is required, but 2 or
                                     # more makes better sense

hosts.h1 = host1.example.org:41414

hosts.h2 = host2.example.org:41414

hosts.h3 = host3.example.org:41414

max-attempts = 3                     # Must be >=0 (default: number of hosts
                                     # specified, 3 in this case). A '0'
                                     # value doesn't make much sense because
                                     # it will just cause an append call to
                                     # immmediately fail. A '1' value means
                                     # that the failover client will try only
                                     # once to send the Event, and if it
                                     # fails then there will be no failover
                                     # to a second client, so this value
                                     # causes the failover client to
                                     # degenerate into just a default client.
                                     # It makes sense to set this value to at
                                     # least the number of hosts that you
                                     # specified.

batch-size = 100                     # Must be >=1 (default: 100)

connect-timeout = 20000              # Must be >=1000 (default: 20000)

request-timeout = 20000              # Must be >=1000 (default: 20000)
 
   

For more flexibility, the load-balancing Flume client can be configured with these properties:

 
   client.type = default_loadbalance

hosts = h1 h2 h3                     # At least 2 hosts are required

hosts.h1 = host1.example.org:41414

hosts.h2 = host2.example.org:41414

hosts.h3 = host3.example.org:41414

backoff = false                      # Specifies whether the client should
                                     # back-off from (i.e. temporarily
                                     # blacklist) a failed host
                                     # (default: false).

maxBackoff = 0                       # Max timeout in millis that a will
                                     # remain inactive due to a previous
                                     # failure with that host (default: 0,
                                     # which effectively becomes 30000)

host-selector = round_robin          # The host selection strategy used
                                     # when load-balancing among hosts
                                     # (default: round_robin).
                                     # Other values are include "random"
                                     # or the FQCN of a custom class
                                     # that implements
                                     # LoadBalancingRpcClient$HostSelector

batch-size = 100                     # Must be >=1 (default: 100)

connect-timeout = 20000              # Must be >=1000 (default: 20000)

request-timeout = 20000              # Must be >=1000 (default: 20000) 
  

Fan out flow

Flume supports fanning out the flow from one source to multiple channels.

There are two modes：replicating and multiplexing.

In the replicating flow, the event is sent to all the configured channels.

In case of multiplexing, the event is sent to only a subset of qualifying channels.

replicating:

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

multiplexing:

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

The selector checks for a header called “State”. If the value is “CA” then its sent to mem-channel-1, if its “AZ” then it goes to file-channel-2 or if its “NY” then both. If the “State” header is not set or doesn’t match any of the three, then it goes to mem-channel-1 which is designated as ‘default’.

To specify optional channels for a header, the config parameter ‘optional’ is used

The selector will attempt to write to the required channels first and will fail the transaction if even one of these channels fails to consume the events. The transaction is reattempted on all of the channels. Once all required channels have consumed the events, then the selector will attempt to write to the optional channels. A failure by any of the optional channels to consume the event is simply ignored and not retried.

Avro Source

Listens on Avro port and receives events from external Avro client streams. When paired with the built-in Avro Sink on another (previous hop) Flume agent, it can create tiered collection topologies. Required properties are in bold .

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `avro`
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	Maximum number of worker threads to spawn
selector.type
selector.*
interceptors	–	Space-separated list of interceptors
interceptors.*
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl	false	Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore	–	This is the path to a Java keystore file. Required for SSL.
keystore-password	–	The password for the Java keystore. Required for SSL.
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
ipFilter	false	Set this to true to enable ipFiltering for netty
ipFilterRules	–	Define N netty ipFilter pattern rules with this config.

Thrift Source

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `thrift`
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	Maximum number of worker threads to spawn
selector.type
selector.*
interceptors	–	Space separated list of interceptors
interceptors.*
ssl	false	Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore	–	This is the path to a Java keystore file. Required for SSL.
keystore-password	–	The password for the Java keystore. Required for SSL.
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
kerberos	false	Set to true to enable kerberos authentication. In kerberos mode, agent-principal and agent-keytab are required for successful authentication. The Thrift source in secure mode, will accept connections only from Thrift clients that have kerberos enabled and are successfully authenticated to the kerberos KDC.
agent-principal	–	The kerberos principal used by the Thrift Source to authenticate to the kerberos KDC.
agent-keytab	—-	The keytab location used by the Thrift Source in combination with the agent-principal to authenticate to the kerberos KDC.

Exec Source

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.

Required properties are in bold.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `exec`
command	–	The command to execute
shell	–	A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle	10000	Amount of time (in millis) to wait before attempting a restart
restart	false	Whether the executed cmd should be restarted if it dies
logStdErr	false	Whether the command’s stderr should be logged
batchSize	20	The max number of lines to read and send to the channel at a time
batchTimeout	3000	Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

Example for agent named a1:

 
   a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1 
  

The ‘shell’ config is used to invoke the ‘command’ through a command shell

In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.

 
   a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done 
  

JMS Source

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE

Spooling Directory Source

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `spooldir`.
spoolDir	–	The directory from which to read files from.
fileSuffix	.COMPLETED	Suffix to append to completely ingested files
deletePolicy	never	When to delete completed files: `never` or `immediate`
fileHeader	false	Whether to add a header storing the absolute path filename.
fileHeaderKey	file	Header key to use when appending absolute path filename to event header.
basenameHeader	false	Whether to add a header storing the basename of the file.
basenameHeaderKey	basename	Header Key to use when appending basename of file to event header.
ignorePattern	^$	Regular expression specifying which files to ignore (skip)
trackerDir	.flumespool	Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
consumeOrder	oldest	In which order files in the spooling directory will be consumed `oldest`, `youngest` and `random`. In case of `oldest` and `youngest`, the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest laxicographical order will be consumed first. In case of`random` any file will be picked randomly. When using `oldest` and `youngest` the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using `random` may cause old files to be consumed very late if new files keep coming in the spooling directory.
maxBackoff	4000	The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize	100	Granularity at which to batch transfer to the channel
inputCharset	UTF-8	Character set used by deserializers that treat the input file as text.
decodeErrorPolicy	`FAIL`	What to do when we see a non-decodable character in the input file. `FAIL`: Throw an exception and fail to parse the file. `REPLACE`: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. `IGNORE`: Drop the unparseable character sequence.
deserializer	`LINE`	Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement `EventDeserializer.Builder`.
deserializer.*		Varies per event deserializer.
bufferMaxLines	–	(Obselete) This option is now ignored.
bufferMaxLineLength	5000	(Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

Example for an agent named agent-1:

 
   a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true 
  

Kafka Source

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `org.apache.flume.source.kafka,KafkaSource`
zookeeperConnect	–	URI of ZooKeeper used by Kafka cluster
groupId	flume	Unique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group
topic	–	Kafka topic we’ll read messages from. At the time, this is a single topic only.
batchSize	1000	Maximum number of messages written to Channel in one batch
batchDurationMillis	1000	Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached.
backoffSleepIncrement	1000	Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
maxBackoffSleep	5000	Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
Other Kafka Consumer Properties	–	These properties are used to configure the Kafka Consumer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix`kafka.`. For example: kafka.consumer.timeout.ms Check Kafka documentation <https://kafka.apache.org/08/configuration.html#consumerconfigs> for details

Example for agent named tier1:

 
   tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = test1
tier1.sources.source1.groupId = flume
tier1.sources.source1.kafka.consumer.timeout.ms = 100 
  

NetCat Source

A netcat-like source that listens on a given port and turns each line of text into an event. Acts like nc -k -l [host] [port] .

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `netcat`
bind	–	Host name or IP address to bind to
port	–	Port # to bind to
max-line-length	512	Max line length per event body (in bytes)
ack-every-event	true	Respond with an “OK” for every event received
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

Example for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 
   

Sequence Generator Source

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `seq`
selector.type		replicating or multiplexing
selector.*	replicating	Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*
batchSize	1

Example for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.channels = c1 
   

Syslog TCP Source

The original, tried-and-true syslog TCP source.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `syslogtcp`
host	–	Host name or IP address to bind to
port	–	Port # to bind to
eventSize	2500	Maximum size of a single event line, in bytes
keepFields	none	Setting this to ‘all’ will preserve the Priority, Timestamp and Hostname in the body of the event. A spaced separated list of fields to include is allowed as well. Currently, the following fields can be included: priority, version, timestamp, hostname. The values ‘true’ and ‘false’ have been deprecated in favor of ‘all’ and ‘none’.
selector.type		replicating or multiplexing
selector.*	replicating	Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

For example, a syslog TCP source for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1 
   

Multiport Syslog TCP Source

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `multiport_syslogtcp`
host	–	Host name or IP address to bind to.
ports	–	Space-separated list (one or more) of ports to bind to.
eventSize	2500	Maximum size of a single event line, in bytes.
keepFields	none	Setting this to ‘all’ will preserve the Priority, Timestamp and Hostname in the body of the event. A spaced separated list of fields to include is allowed as well. Currently, the following fields can be included: priority, version, timestamp, hostname. The values ‘true’ and ‘false’ have been deprecated in favor of ‘all’ and ‘none’.
portHeader	–	If specified, the port number will be stored in the header of each event using the header name specified here. This allows for interceptors and channel selectors to customize routing logic based on the incoming port.
charset.default	UTF-8	Default character set used while parsing syslog events into strings.
charset.port.<port>	–	Character set is configurable on a per-port basis.
batchSize	100	Maximum number of events to attempt to process per request loop. Using the default is usually fine.
readBufferSize	1024	Size of the internal Mina read buffer. Provided for performance tuning. Using the default is usually fine.
numProcessors	(auto-detected)	Number of processors available on the system for use while processing messages. Default is to auto-detect # of CPUs using the Java Runtime API. Mina will spawn 2 request-processing threads per detected CPU, which is often reasonable.
selector.type	replicating	replicating, multiplexing, or custom
selector.*	–	Depends on the `selector.type` value
interceptors	–	Space-separated list of interceptors.
interceptors.*

For example, a multiport syslog TCP source for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.channels = c1
a1.sources.r1.host = 0.0.0.0
a1.sources.r1.ports = 10001 10002 10003
a1.sources.r1.portHeader = port 
   

Syslog UDP Source

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `syslogudp`
host	–	Host name or IP address to bind to
port	–	Port # to bind to
keepFields	false	Setting this to true will preserve the Priority, Timestamp and Hostname in the body of the event.
selector.type		replicating or multiplexing
selector.*	replicating	Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

For example, a syslog UDP source for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1 
   

HTTP Source

A source which accepts Flume Events by HTTP POST and GET. GET should be used for experimentation only. HTTP requests are converted into flume events by a pluggable “handler” which must implement the HTTPSourceHandler interface. This handler takes a HttpServletRequest and returns a list of flume events.

Property Name	Default	Description
type		The component type name, needs to be `http`
port	–	The port the source should bind to.
bind	0.0.0.0	The hostname or IP address to listen on
handler	`org.apache.flume.source.http.JSONHandler`	The FQCN of the handler class.
handler.*	–	Config parameters for the handler
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*
enableSSL	false	Set the property true, to enable SSL. HTTP Source does not support SSLv3.
excludeProtocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 is always excluded.
keystore		Location of the keystore includng keystore file name
keystorePassword Keystore password

For example, a http source for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props 
   

By default HTTPSource splits JSON input into Flume events. JsonHandler is provided out of the box which can handle events represented in JSON format

[{
  "headers" : {
             "timestamp" : "434324343",
             "host" : "random_host.example.com"
             },
  "body" : "random_body"
  },
  {
  "headers" : {
             "namenode" : "namenode.example.com",
             "datanode" : "random_datanode.example.com"
             },
  "body" : "really_random_body"
  }]

To set the charset, the request must have content type specified as application/json; charset=UTF-8 (replace UTF-8 with UTF-16 or UTF-32 as required).

BlobHandler is a handler for HTTPSource that returns an event that contains the request parameters as well as the Binary Large Object (BLOB) uploaded with this request.

Property Name	Default	Description
handler	–	The FQCN of this class: `org.apache.flume.sink.solr.morphline.BlobHandler`
handler.maxBlobLength	100000000	The maximum number of bytes to read and buffer for a given request

Custom Source

A custom source’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom source is its FQCN.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be your FQCN
selector.type		`replicating` or `multiplexing`
selector.*	replicating	Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

Example for agent named a1:

 
    a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.example.MySource
a1.sources.r1.channels = c1 
   

Flume Sinks

HDFS Sink

The following are the escape sequences supported:

Alias	Description
%{host}	Substitute value of event header named “host”. Arbitrary header names are supported.
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, ...)
%A	locale’s full weekday name (Monday, Tuesday, ...)
%b	locale’s short month name (Jan, Feb, ...)
%B	locale’s long month name (January, February, ...)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%e	day of month without padding (1)
%D	date; same as %m/%d/%y
%H	hour (00..23)
%I	hour (01..12)
%j	day of year (001..366)
%k	hour ( 0..23)
%m	month (01..12)
%n	month without padding (1..12)
%M	minute (00..59)
%p	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00..59)
%y	last two digits of year (00..99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)

note

For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless hdfs.useLocalTimeStamp is set to true). One way to add this automatically is to use the TimestampInterceptor.

Name	Default	Description
channel	–
type	–	The component type name, needs to be `hdfs`
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
hdfs.fileSuffix	–	Suffix to append to file (eg `.avro` - NOTE: period is not automatically added)
hdfs.inUsePrefix	–	Prefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix	`.tmp`	Suffix that is used for temporal files that flume actively writes into
hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout	0	Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize	100	number of events written to file before it is flushed to HDFS
hdfs.codeC	–	Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	SequenceFile	File format: currently `SequenceFile`, `DataStream` or `CompressedStream` (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles	5000	Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas	–	Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat	–	Format for sequence file records. One of “Text” or “Writable” (the default).
hdfs.callTimeout	10000	Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize	10	Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize	1	Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal	–	Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab	–	Kerberos keytab for accessing secure HDFS
hdfs.proxyUser
hdfs.round	false	Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue	1	Rounded down to the highest multiple of this (in the unit configured using `hdfs.roundUnit`), less than current time.
hdfs.roundUnit	second	The unit of the round down value - `second`, `minute` or `hour`.
hdfs.timeZone	Local Time	Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries	0	Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval	180	Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer	`TEXT`	Other possible options include `avro_event` or the fully-qualified class name of an implementation of the `EventSerializer.Builder` interface.
serializer.*

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute 
   

The above configuration will round down the timestamp to the last 10th minute. For example, an event with timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become /flume/events/2012-06-12/1150/00 .

Hive Sink

This sink is provided as a preview feature and not recommended for use in production.

Name	Default	Description
channel	–
type	–	The component type name, needs to be `hive`
hive.metastore	–	Hive metastore URI (eg thrift://a.b.com:9083 )
hive.database	–	Hive database name
hive.table	–	Hive table name
hive.partition	–	Comma separate list of partition values identifying the partition to write to. May contain escape sequences. E.g: If the table is partitioned by (continent: string, country :string, time : string) then ‘Asia,India,2014-02-26-01-21’ will indicate continent=Asia,country=India,time=2014-02-26-01-21
hive.txnsPerBatchAsk	100	Hive grants a batch of transactions instead of single transactions to streaming clients like Flume. This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files.
heartBeatInterval	240	(In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats.
autoCreatePartitions	true	Flume will automatically create the necessary Hive partitions to stream to
batchSize	15000	Max number of events written to Hive in a single Hive transaction
maxOpenConnections	500	Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed.
callTimeout	10000	(In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort.
serializer		Serializer is responsible for parsing out field from the event and mapping them to columns in the hive table. Choice of serializer depends upon the format of the data in the event. Supported serializers: DELIMITED and JSON
roundUnit	minute	The unit of the round down value - `second`, `minute` or `hour`.
roundValue	1	Rounded down to the highest multiple of this (in the unit configured using hive.roundUnit), less than current time
timeZone	Local Time	Name of the timezone that should be used for resolving the escape sequences in partition, e.g. America/Los_Angeles.
useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

Logger Sink

Logs event at INFO level. Typically useful for testing/debugging purpose. Required properties are in bold.

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `logger`
maxBytesToLog	16	Maximum number of bytes of the Event body to log

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1 
   

Avro Sink

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `avro`.
hostname	–	The hostname or IP address to bind to.
port	–	The port # to listen on.
batch-size	100	number of event to batch together for send.
connect-timeout	20000	Amount of time (ms) to allow for the first (handshake) request.
request-timeout	20000	Amount of time (ms) to allow for requests after the first.
reset-connection-interval	none	Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
compression-level	6	The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression
ssl	false	Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password”, “truststore-type”, and specify whether to “trust-all-certs”.
trust-all-certs	false	If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and “listen in” on the encrypted connection.
truststore	–	The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used.
truststore-password	–	The password for the specified truststore.
truststore-type	JKS	The type of the Java truststore. This can be “JKS” or other supported Java truststore type.
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
maxIoWorkers	2 * the number of available processors in the machine	The maximum number of I/O worker threads. This is configured on the NettyAvroRpcClient NioClientSocketChannelFactory.

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545 
   

Thrift Sink

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `thrift`.
hostname	–	The hostname or IP address to bind to.
port	–	The port # to listen on.
batch-size	100	number of event to batch together for send.
connect-timeout	20000	Amount of time (ms) to allow for the first (handshake) request.
request-timeout	20000	Amount of time (ms) to allow for requests after the first.
connection-reset-interval	none	Amount of time (s) before the connection to the next hop is reset. This will force the Thrift Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
ssl	false	Set to true to enable SSL for this ThriftSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password” and “truststore-type”
truststore	–	The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Thrift Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used.
truststore-password	–	The password for the specified truststore.
truststore-type	JKS	The type of the Java truststore. This can be “JKS” or other supported Java truststore type.
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude
kerberos	false	Set to true to enable kerberos authentication. In kerberos mode, client-principal, client-keytab and server-principal are required for successful authentication and communication to a kerberos enabled Thrift Source.
client-principal	—-	The kerberos principal used by the Thrift Sink to authenticate to the kerberos KDC.
client-keytab	—-	The keytab location used by the Thrift Sink in combination with the client-principal to authenticate to the kerberos KDC.
server-principal	–	The kerberos principal of the Thrift Source to which the Thrift Sink is configured to connect to.

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = thrift
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545 
   

IRC Sink

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `irc`
hostname	–	The hostname or IP address to connect to
port	6667	The port number of remote host to connect
nick	–	Nick name
user	–	User name
password	–	User password
chan	–	channel
name
splitlines	–	(boolean)
splitchars	n	line separator (if you were to enter the default value into the config file, then you would need to escape the backslash, like this: “\n”)

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = irc
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = irc.yourdomain.com
a1.sinks.k1.nick = flume
a1.sinks.k1.chan = #flume 
   

File Roll Sink

Stores events on the local filesystem.

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `file_roll`.
sink.directory	–	The directory where files will be stored
sink.rollInterval	30	Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.
sink.serializer	TEXT	Other possible options include `avro_event` or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize	100

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume 
   

Null Sink

Discards all events it receives from the channel. Required properties are in bold.

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `null`.
batchSize	100

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = null
a1.sinks.k1.channel = c1 
   

HBaseSinks

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `hbase`
table	–	The name of the table in Hbase to write to.
columnFamily	–	The column family in Hbase to write to.
zookeeperQuorum	–	The quorum spec. This is the value for the property`hbase.zookeeper.quorum` in hbase-site.xml
znodeParent	/hbase	The base path for the znode for the -ROOT- region. Value of `zookeeper.znode.parent` in hbase-site.xml
batchSize	100	Number of events to be written per txn.
coalesceIncrements	false	Should the sink coalesce multiple increments to a cell per batch. This might give better performance if there are multiple increments to a limited number of cells.
serializer	org.apache.flume.sink.hbase.SimpleHbaseEventSerializer	Default increment column = “iCol”, payload column = “pCol”.
serializer.*	–	Properties to be passed to the serializer.
kerberosPrincipal	–	Kerberos user principal for accessing secure HBase
kerberosKeytab	–	Kerberos keytab for accessing secure HBase

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1 
   

AsyncHBaseSink

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `asynchbase`
table	–	The name of the table in Hbase to write to.
zookeeperQuorum	–	The quorum spec. This is the value for the property`hbase.zookeeper.quorum` in hbase-site.xml
znodeParent	/hbase	The base path for the znode for the -ROOT- region. Value of `zookeeper.znode.parent` in hbase-site.xml
columnFamily	–	The column family in Hbase to write to.
batchSize	100	Number of events to be written per txn.
coalesceIncrements	false	Should the sink coalesce multiple increments to a cell per batch. This might give better performance if there are multiple increments to a limited number of cells.
timeout	60000	The length of time (in milliseconds) the sink waits for acks from hbase for all events in a transaction.
serializer	org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
serializer.*	–	Properties to be passed to the serializer.

MorphlineSolrSink

This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be`org.apache.flume.sink.solr.morphline.MorphlineSolrSink`
morphlineFile	–	The relative or absolute path on the local file system to the morphline configuration file. Example:`/etc/flume-ng/conf/morphline.conf`
morphlineId	null	Optional name used to identify a morphline if there are multiple morphlines in a morphline config file
batchSize	1000	The maximum number of events to take per flume transaction.
batchDurationMillis	1000	The maximum duration per flume transaction (ms). The transaction commits after this duration or when batchSize is exceeded, whichever comes first.
handlerClass	org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl	The FQCN of a class implementing org.apache.flume.sink.solr.morphline.MorphlineHandler
isProductionMode	false	This flag should be enabled for mission critical, large-scale online production systems that need to make progress without downtime when unrecoverable exceptions occur. Corrupt or malformed parser input data, parser bugs, and errors related to unknown Solr schema fields produce unrecoverable exceptions.
recoverableExceptionClasses	org.apache.solr.client.solrj.SolrServerException	Comma separated list of recoverable exceptions that tend to be transient, in which case the corresponding task can be retried. Examples include network connection errors, timeouts, etc. When the production mode flag is set to true, the recoverable exceptions configured using this parameter will not be ignored and hence will lead to retries.
isIgnoringRecoverableExceptions	false	This flag should be enabled, if an unrecoverable exception is accidentally misclassified as recoverable. This enables the sink to make progress and avoid retrying an event forever.

Example for agent named a1:

 
    a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.channel = c1
a1.sinks.k1.morphlineFile = /etc/flume-ng/conf/morphline.conf
# a1.sinks.k1.morphlineId = morphline1
# a1.sinks.k1.batchSize = 1000
# a1.sinks.k1.batchDurationMillis = 1000 
   

ElasticSearchSink

This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them.

The elasticsearch and lucene-core jars required for your environment must be placed in the lib directory of the Apache Flume installation.

Events are serialized for elasticsearch by the ElasticSearchLogStashEventSerializer by default. This behaviour can be overridden with the serializer parameter.

The type is the FQCN: org.apache.flume.sink.elasticsearch.ElasticSearchSink

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be`org.apache.flume.sink.elasticsearch.ElasticSearchSink`
hostNames	–	Comma separated list of hostname:port, if the port is not present the default port ‘9300’ will be used
indexName	flume	The name of the index which the date will be appended to. Example ‘flume’ -> ‘flume-yyyy-MM-dd’ Arbitrary header substitution is supported, eg. %{header} replaces with value of named event header
indexType	logs	The type to index the document to, defaults to ‘log’ Arbitrary header substitution is supported, eg. %{header} replaces with value of named event header
clusterName	elasticsearch	Name of the ElasticSearch cluster to connect to
batchSize	100	Number of events to be written per txn.
ttl	–	TTL in days, when set will cause the expired documents to be deleted automatically, if not set documents will never be automatically deleted. TTL is accepted both in the earlier form of integer only e.g. a1.sinks.k1.ttl = 5 and also with a qualifier ms (millisecond), s (second), m (minute), h (hour), d (day) and w (week). Example a1.sinks.k1.ttl = 5d will set TTL to 5 days. Followhttp://www.elasticsearch.org/guide/reference/mapping/ttl-field/ for more information.
serializer	org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer	The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use. Implementations of either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred.
serializer.*	–	Properties to be passed to the serializer.

Example for agent named a1:

 
  a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = elasticsearch
a1.sinks.k1.hostNames = 127.0.0.1:9200,127.0.0.2:9300
a1.sinks.k1.indexName = foo_index
a1.sinks.k1.indexType = bar_type
a1.sinks.k1.clusterName = foobar_cluster
a1.sinks.k1.batchSize = 500
a1.sinks.k1.ttl = 5d
a1.sinks.k1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
a1.sinks.k1.channel = c1 
 

Kafka Sink

This is a Flume Sink implementation that can publish data to a Kafka topic.

Property Name	Default	Description
type	–	Must be set to `org.apache.flume.sink.kafka.KafkaSink`
brokerList	–	List of brokers Kafka-Sink will connect to, to get the list of topic partitions This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port
topic	default-flume-topic	The topic in Kafka to which the messages will be published. If this parameter is configured, messages will be published to this topic. If the event header contains a “topic” field, the event will be published to that topic overriding the topic configured here.
batchSize	100	How many messages to process in one batch. Larger batches improve throughput while adding latency.
requiredAcks	1	How many replicas must acknowledge a message before its considered successfully written. Accepted values are 0 (Never wait for acknowledgement), 1 (wait for leader only), -1 (wait for all replicas) Set this to -1 to avoid data loss in some cases of leader failure.
Other Kafka Producer Properties	–	These properties are used to configure the Kafka Producer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix `kafka.`. For example: kafka.producer.type

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1

Custom Sink

A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be your FQCN

Example for agent named a1:

 
  a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = org.example.MySink
a1.sinks.k1.channel = c1 
 

Flume Channels

Memory Channel

Property Name	Default	Description
type	–	The component type name, needs to be `memory`
capacity	100	The maximum number of events stored in the channel
transactionCapacity	100	The maximum number of events the channel will take from a source or give to a sink per transaction
keep-alive	3	Timeout in seconds for adding or removing an event
byteCapacityBufferPercentage	20	Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity	see description	Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event `body`, which is the reason for providing the`byteCapacityBufferPercentage` configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to `0` will cause this value to fall back to a hard internal limit of about 200 GB.

Example for agent named a1:

 
  a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000 
 

JDBC Channel

Property Name	Default	Description
type	–	The component type name, needs to be `jdbc`
db.type	DERBY	Database vendor, needs to be DERBY.
driver.class	org.apache.derby.jdbc.EmbeddedDriver	Class for vendor’s JDBC driver
driver.url	(constructed from other properties)	JDBC connection URL
db.username	“sa”	User id for db connection
db.password	–	password for db connection
connection.properties.file	–	JDBC Connection property file path
create.schema	true	If true, then creates db schema if not there
create.index	true	Create indexes to speed up lookups
create.foreignkey	true
transaction.isolation	“READ_COMMITTED”	Isolation level for db session READ_UNCOMMITTED, READ_COMMITTED, SERIALIZABLE, REPEATABLE_READ
maximum.connections	10	Max connections allowed to db
maximum.capacity	0 (unlimited)	Max number of events in the channel
sysprop.*		DB Vendor specific properties
sysprop.user.home		Home path to store embedded Derby database

Example for agent named a1:

 
  a1.channels = c1
a1.channels.c1.type = jdbc

Kafka Channel

Property Name	Default	Description
type	–	The component type name, needs to be `org.apache.flume.channel.kafka.KafkaChannel`
brokerList	–	List of brokers in the Kafka cluster used by the channel This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port
zookeeperConnect	–	URI of ZooKeeper used by Kafka cluster The format is comma separated list of hostname:port. If chroot is used, it is added once at the end. For example: zookeeper-1:2181,zookeeper-2:2182, zookeeper-3:2181/kafka
topic	flume-channel	Kafka topic which the channel will use
groupId	flume	Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note that having non-channel consumers with the same ID can lead to data loss.
parseAsFlumeEvent	true	Expecting Avro datums with FlumeEvent schema in the channel. This should be true if Flume source is writing to the channel And false if other producers are writing into the topic that the channel is using Flume source messages to Kafka can be parsed outside of Flume by using org.apache.flume.source.avro.AvroFlumeEvent provided by the flume-ng-sdk artifact
readSmallestOffset	false	When set to true, the channel will read all data in the topic, starting from the oldest event when false, it will read only events written after the channel started When “parseAsFlumeEvent” is true, this will be false. Flume source will start prior to the sinks and this guarantees that events sent by source before sinks start will not be lost.
Other Kafka Properties	–	These properties are used to configure the Kafka Producer and Consumer used by the channel. Any property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix `kafka.`. For example: kafka.producer.type

Example for agent named a1:

 
  a1.channels.channel1.type   = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.channel1.capacity = 10000
a1.channels.channel1.transactionCapacity = 1000
a1.channels.channel1.brokerList=kafka-2:9092,kafka-3:9092
a1.channels.channel1.topic=channel1
a1.channels.channel1.zookeeperConnect=kafka-1:2181 
 

File Channel

Property Name Default	Description
type	–	The component type name, needs to be `file`.
checkpointDir	~/.flume/file-channel/checkpoint	The directory where checkpoint file will be stored
useDualCheckpoints	false	Backup the checkpoint. If this is set to `true`, `backupCheckpointDir` mustbe set
backupCheckpointDir	–	The directory where the checkpoint is backed up to. This directorymust not be the same as the data directories or the checkpoint directory
dataDirs	~/.flume/file-channel/data	Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance
transactionCapacity	10000	The maximum size of transaction supported by the channel
checkpointInterval	30000	Amount of time (in millis) between checkpoints
maxFileSize	2146435071	Max size (in bytes) of a single log file
minimumRequiredSpace	524288000	Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity	1000000	Maximum capacity of the channel
keep-alive	3	Amount of time (in sec) to wait for a put operation
use-log-replay-v1	false	Expert: Use old replay logic
use-fast-replay	false	Expert: Replay without using queue
checkpointOnClose	true	Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.
encryption.activeKey	–	Key name used to encrypt new data
encryption.cipherProvider	–	Cipher provider type, supported types: AESCTRNOPADDING
encryption.keyProvider	–	Key provider type, supported types: JCEKSFILE
encryption.keyProvider.keyStoreFile	–	Path to the keystore file
encrpytion.keyProvider.keyStorePasswordFile	–	Path to the keystore password file
encryption.keyProvider.keys	–	List of all keys (e.g. history of the activeKey setting)
encyption.keyProvider.keys.*.passwordFile	–	Path to the optional key password file

Example for agent named a1:

 
  a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data 
 

Encryption

Generating a key with a password seperate from the key store password:

 
  keytool -genseckey -alias key-0 -keypass keyPassword -keyalg AES \
  -keysize 128 -validity 9000 -keystore test.keystore \
  -storetype jceks -storepass keyStorePassword

Generating a key with the password the same as the key store password:

 
  keytool -genseckey -alias key-1 -keyalg AES -keysize 128 -validity 9000 \
  -keystore src/test/resources/test.keystore -storetype jceks \
  -storepass keyStorePassword

 
  a1.channels.c1.encryption.activeKey = key-0
a1.channels.c1.encryption.cipherProvider = AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider = key-provider-0
a1.channels.c1.encryption.keyProvider = JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys = key-0
 
 

Let’s say you have aged key-0 out and new files should be encrypted with key-1:

 
  a1.channels.c1.encryption.activeKey = key-1
a1.channels.c1.encryption.cipherProvider = AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider = JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys = key-0 key-1
 
 

The same scenerio as above, however key-0 has its own password:

 
  a1.channels.c1.encryption.activeKey = key-1
a1.channels.c1.encryption.cipherProvider = AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider = JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys = key-0 key-1
a1.channels.c1.encryption.keyProvider.keys.key-0.passwordFile = /path/to/key-0.password 
 

Spillable Memory Channel

The events are stored in an in-memory queue and on disk. The in-memory queue serves as the primary store and the disk as overflow.

Property Name	Default	Description
type	–	The component type name, needs to be `SPILLABLEMEMORY`
memoryCapacity	10000	Maximum number of events stored in memory queue. To disable use of in-memory queue, set this to zero.
overflowCapacity	100000000	Maximum number of events stored in overflow disk (i.e File channel). To disable use of overflow, set this to zero.
overflowTimeout	3	The number of seconds to wait before enabling disk overflow when memory fills up.
byteCapacityBufferPercentage	20	Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity	see description	Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event `body`, which is the reason for providing the`byteCapacityBufferPercentage` configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to `0` will cause this value to fall back to a hard internal limit of about 200 GB.
avgEventSize	500	Estimated average size of events, in bytes, going into the channel
<file channel properties>	see file channel	Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity.

In-memory queue is considered full if either memoryCapacity or byteCapacity limit is reached.

Example for agent named a1:

 
  a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.byteCapacity = 800000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
 
 

To disable the use of the in-memory queue and function like a file channel:

 
  a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 0
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
 
 

To disable the use of overflow disk and function purely as a in-memory channel:

 
  a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 100000
a1.channels.c1.overflowCapacity = 0 
 

Custom Channel

Property Name	Default	Description
type	–	The component type name, needs to be a FQCN

Example for agent named a1:

 
  a1.channels = c1
a1.channels.c1.type = org.example.MyChannel

Flume Channel Selectors

Replicating Channel Selector (default)

Property Name	Default	Description
selector.type	replicating	The component type name, needs to be `replicating`
selector.optional	–	Set of channels to be marked as `optional`

Example for agent named a1 and it’s source called r1:

 
  a1.sources = r1
a1.channels = c1 c2 c3
a1.source.r1.selector.type = replicating
a1.source.r1.channels = c1 c2 c3
a1.source.r1.selector.optional = c3 
 

In the above configuration, c3 is an optional channel. Failure to write to c3 is simply ignored. Since c1 and c2 are not marked optional, failure to write to those channels will cause the transaction to fail.

Multiplexing Channel Selector

Property Name	Default	Description
selector.type	replicating	The component type name, needs to be `multiplexing`
selector.header	flume.selector.header
selector.default	–
selector.mapping.*	–

Example for agent named a1 and it’s source called r1:

 
  a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4 
 

Custom Channel Selector

Property Name	Default	Description
sinks	–	Space-separated list of sinks that are participating in the group
processor.type	`default`	The component type name, needs to be `default`, `failover` or `load_balance`

Example for agent named a1:

 
  a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance 
 

Default Sink Processor

Default sink processor accepts only a single sink. User is not forced to create processor (sink group) for single sinks.

Failover Sink Processor

The Sinks have a priority associated with them, larger the number, higher the priority. If a Sink fails while sending a Event the next Sink with highest priority shall be tried next for sending Events.

Property Name	Default	Description
sinks	–	Space-separated list of sinks that are participating in the group
processor.type	`default`	The component type name, needs to be `failover`
processor.priority.<sinkName>	–	Priority value. <sinkName> must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority
processor.maxpenalty	30000	The maximum backoff period for the failed Sink (in millis)

Example for agent named a1:

 
  a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000 
 

Load balancing Sink Processor

via round_robin or random selection mechanisms.

Property Name	Default	Description
processor.sinks	–	Space-separated list of sinks that are participating in the group
processor.type	`default`	The component type name, needs to be `load_balance`
processor.backoff	false	Should failed sinks be backed off exponentially.
processor.selector	`round_robin`	Selection mechanism. Must be either `round_robin`, `random` or FQCN of custom class that inherits from `AbstractSinkSelector`
processor.selector.maxTimeOut	30000	Used by backoff selectors to limit exponential backoff (in milliseconds)

Example for agent named a1:

 
  a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random 
 

Custom Sink Processor

Custom sink processors are not supported at the moment.

Flume Interceptors

Flume has the capability to modify/drop events in-flight. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface.

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

In the above example, events are passed to the HostInterceptor first and the events returned by the HostInterceptor are then passed along to the TimestampInterceptor. You can specify either the fully qualified class name (FQCN) or the alias timestamp . If you have multiple collectors writing to the same HDFS path, then you could also use the HostInterceptor.

Timestamp Interceptor

This interceptor inserts a header with key timestamp whose value is the relevant timestamp. This interceptor can preserve an existing timestamp if it is already present in the configuration.

Property Name	Default	Description
type	–	The component type name, has to be `timestamp` or the FQCN
preserveExisting	false	If the timestamp already exists, should it be preserved - true or false

Example for agent named a1:

 
  a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp 
 

Host Interceptor

It inserts a header with key host or a configured key whose value is the hostname or IP address of the host, based on configuration.

Property Name	Default	Description
type	–	The component type name, has to be `host`
preserveExisting	false	If the host header already exists, should it be preserved - true or false
useIP	true	Use the IP Address if true, else use hostname.
hostHeader	host	The header key to be used.

Example for agent named a1:

 
  a1.sources = r1
a1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.hostHeader = hostname 
 

Static Interceptor

Static interceptor allows user to append a static header with static value to all events.The current implementation does not allow specifying multiple headers at one time.

Property Name	Default	Description
type	–	The component type name, has to be `static`
preserveExisting	true	If configured header already exists, should it be preserved - true or false
key	key	Name of header that should be created
value	value	Static value that should be created

Example for agent named a1:

 
  a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK 
 

UUID Interceptor

This interceptor sets a universally unique identifier on all events that are intercepted.

Property Name	Default	Description
type	–	The component type name has to be `org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder`
headerName	id	The name of the Flume header to modify
preserveExisting	true	If the UUID header already exists, should it be preserved - true or false
prefix	“”	The prefix string constant to prepend to each generated UUID

Morphline Interceptor

This interceptor filters the events through a morphline configuration file that defines a chain of transformation commands that pipe records from one command to another.

Property Name	Default	Description
type	–	The component type name has to be `org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder`
morphlineFile	–	The relative or absolute path on the local file system to the morphline configuration file. Example:`/etc/flume-ng/conf/morphline.conf`
morphlineId	null	Optional name used to identify a morphline if there are multiple morphlines in a morphline config file

Sample flume.conf file:

 
  a1.sources.avroSrc.interceptors = morphlineinterceptor
a1.sources.avroSrc.interceptors.morphlineinterceptor.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineFile = /etc/flume-ng/conf/morphline.conf
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineId = morphline1 
 

Search and Replace Interceptor

This interceptor provides simple string-based search-and-replace functionality based on Java regular expressions.

Property Name	Default	Description
type	–	The component type name has to be `search_replace`
searchPattern	–	The pattern to search for and replace.
replaceString	–	The replacement string.
charset	UTF-8	The charset of the event body. Assumed by default to be UTF-8.

Example configuration:

 
  a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace

# Remove leading alphanumeric characters in an event body.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = ^[A-Za-z0-9_]+
a1.sources.avroSrc.interceptors.search-replace.replaceString = 
 

Another example:

 
  a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace

# Use grouping operators to reorder and munge words on a line.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = The quick brown ([a-z]+) jumped over the lazy ([a-z]+)
a1.sources.avroSrc.interceptors.search-replace.replaceString = The hungry $2 ate the careless $1 
 

Regex Filtering Interceptor

Property Name	Default	Description
type	–	The component type name has to be `regex_filter`
regex	”.*”	Regular expression for matching against events
excludeEvents	false	If true, regex determines events to exclude, otherwise regex determines events to include.

Regex Extractor Interceptor

Property Name	Default	Description
type	–	The component type name has to be `regex_extractor`
regex	–	Regular expression for matching against events
serializers	–	Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers:`org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializerorg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer`
serializers.<s1>.type	default	Must be `default` (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), `org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer`, or the FQCN of a custom class that implements `org.apache.flume.interceptor.RegexExtractorInterceptorSerializer`
serializers.<s1>.name	–
serializers.*	–	Serializer-specific properties

Example 1:

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used

 
   a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
 
  

The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

Example 2:

If the Flume event body contained 2012-10-18 18:47:57,614 some log line and the following configuration was used

 
   a1.sources.r1.interceptors.i1.regex = ^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d)
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
a1.sources.r1.interceptors.i1.serializers.s1.name = timestamp
a1.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm
 
  

the extracted event will contain the same body but the following headers will have been added timestamp=>1350611220000

wjacketcn

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录