转自:http://www.cnblogs.com/smartloli/p/4468708.html
1.概述
今天补充一篇关于Flume的博客,前面在讲解高可用的Hadoop平台的时候遗漏了这篇,本篇博客为大家讲述以下内容:
- Flume NG简述
- 单点Flume NG搭建、运行
- 高可用Flume NG搭建
- Failover测试
- 截图预览
下面开始今天的博客介绍。
2.Flume NG简述
Flume NG是一个分布式,高可用,可靠的系统,它能将不同的海量数据收集,移动并存储到一个数据存储系统中。轻量,配置简单,适用于各种日志收集,并支持Failover和负载均衡。并且它拥有非常丰富的组件。Flume NG采用的是三层架构:Agent层,Collector层和Store层,每一层均可水平拓展。其中Agent包含Source,Channel和Sink,三者组建了一个Agent。三者的职责如下所示:
- Source:用来消费(收集)数据源到Channel组件中
- Channel:中转临时存储,保存所有Source组件信息
- Sink:从Channel中读取,读取成功后会删除Channel中的信息
下图是Flume NG的架构图,如下所示:
图中描述了,从外部系统(Web Server)中收集产生的日志,然后通过Flume的Agent的Source组件将数据发送到临时存储Channel组件,最后传递给Sink组件,Sink组件直接把数据存储到HDFS文件系统中。
3.单点Flume NG搭建、运行
我们在熟悉了Flume NG的架构后,我们先搭建一个单点Flume收集信息到HDFS集群中,由于资源有限,本次直接在之前的高可用Hadoop集群上搭建Flume。
场景如下:在NNA节点上搭建一个Flume NG,将本地日志收集到HDFS集群。
3.1基础软件
在搭建Flume NG之前,我们需要准备必要的软件,具体下载地址如下所示:
- Flume 《下载地址》
JDK由于之前在安装Hadoop集群时已经配置过,这里就不赘述了,若需要配置的同学,可参考《配置高可用的Hadoop平台》。
3.2安装与配置
- 安装
首先,我们解压flume安装包,命令如下所示:
[hadoop@nna ~]$ tar -zxvf apache-flume-1.5.2-bin.tar.gz
- 配置
环境变量配置内容如下所示:
export FLUME_HOME=/home/hadoop/flume-1.5.2 export PATH=$PATH:$FLUME_HOME/bin
flume-conf.properties
#agent1 name agent1.sources=source1 agent1.sinks=sink1 agent1.channels=channel1 #Spooling Directory #set source1 agent1.sources.source1.type=spooldir agent1.sources.source1.spoolDir=/home/hadoop/dir/logdfs agent1.sources.source1.channels=channel1 agent1.sources.source1.fileHeader = false agent1.sources.source1.interceptors = i1 agent1.sources.source1.interceptors.i1.type = timestamp #set sink1 agent1.sinks.sink1.type=hdfs agent1.sinks.sink1.hdfs.path=/home/hdfs/flume/logdfs agent1.sinks.sink1.hdfs.fileType=DataStream agent1.sinks.sink1.hdfs.writeFormat=TEXT agent1.sinks.sink1.hdfs.rollInterval=1 agent1.sinks.sink1.channel=channel1 agent1.sinks.sink1.hdfs.filePrefix=%Y-%m-%d #set channel1 agent1.channels.channel1.type=file agent1.channels.channel1.checkpointDir=/home/hadoop/dir/logdfstmp/point agent1.channels.channel1.dataDirs=/home/hadoop/dir/logdfstmp
flume-env.sh
JAVA_HOME=/usr/java/jdk1.7
注:配置中的目录若不存在,需提前创建。
3.3启动
启动命令如下所示:
flume-ng agent -n agent1 -c conf -f flume-conf.properties -Dflume.root.logger=DEBUG,console
注:命令中的agent1表示配置文件中的Agent的Name,如配置文件中的agent1。flume-conf.properties表示配置文件所在配置,需填写准确的配置文件路径。
3.4效果预览
之后,成功上传后本地目的会被标记完成。如下图所示:
4.高可用Flume NG搭建
在完成单点的Flume NG搭建后,下面我们搭建一个高可用的Flume NG集群,架构图如下所示:
图中,我们可以看出,Flume的存储可以支持多种,这里只列举了HDFS和Kafka(如:存储最新的一周日志,并给Storm系统提供实时日志流)。
4.1节点分配
Flume的Agent和Collector分布如下表所示:
名称 | HOST | 角色 |
Agent1 | 10.211.55.14 | Web Server |
Agent2 | 10.211.55.15 | Web Server |
Agent3 | 10.211.55.16 | Web Server |
Collector1 | 10.211.55.18 | AgentMstr1 |
Collector2 | 10.211.55.19 | AgentMstr2 |
图中所示,Agent1,Agent2,Agent3数据分别流入到Collector1和Collector2,Flume NG本身提供了Failover机制,可以自动切换和恢复。在上图中,有3个产生日志服务器分布在不同的机房,要把所有的日志都收集到一个集群中存储。下面我们开发配置Flume NG集群
4.2配置
在下面单点Flume中,基本配置都完成了,我们只需要新添加两个配置文件,它们是flume-client.properties和flume-server.properties,其配置内容如下所示:
- flume-client.properties
#agent1 name agent1.channels = c1 agent1.sources = r1 agent1.sinks = k1 k2 #set gruop agent1.sinkgroups = g1 #set channel agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sources.r1.type = exec agent1.sources.r1.command = tail -F /home/hadoop/dir/logdfs/test.log agent1.sources.r1.interceptors = i1 i2 agent1.sources.r1.interceptors.i1.type = static agent1.sources.r1.interceptors.i1.key = Type agent1.sources.r1.interceptors.i1.value = LOGIN agent1.sources.r1.interceptors.i2.type = timestamp # set sink1 agent1.sinks.k1.channel = c1 agent1.sinks.k1.type = avro agent1.sinks.k1.hostname = nna agent1.sinks.k1.port = 52020 # set sink2 agent1.sinks.k2.channel = c1 agent1.sinks.k2.type = avro agent1.sinks.k2.hostname = nns agent1.sinks.k2.port = 52020 #set sink group agent1.sinkgroups.g1.sinks = k1 k2 #set failover agent1.sinkgroups.g1.processor.type = failover agent1.sinkgroups.g1.processor.priority.k1 = 10 agent1.sinkgroups.g1.processor.priority.k2 = 1 agent1.sinkgroups.g1.processor.maxpenalty = 10000
注:指定Collector的IP和Port。
- flume-server.properties
#set Agent name a1.sources = r1 a1.channels = c1 a1.sinks = k1 #set channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # other node,nna to nns a1.sources.r1.type = avro a1.sources.r1.bind = nna a1.sources.r1.port = 52020 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = static a1.sources.r1.interceptors.i1.key = Collector a1.sources.r1.interceptors.i1.value = NNA a1.sources.r1.channels = c1 #set sink to hdfs a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/home/hdfs/flume/logdfs a1.sinks.k1.hdfs.fileType=DataStream a1.sinks.k1.hdfs.writeFormat=TEXT a1.sinks.k1.hdfs.rollInterval=1 a1.sinks.k1.channel=c1 a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
注:在另一台Collector节点上修改IP,如在NNS节点将绑定的对象有nna修改为nns。
flume的拦截器也是chain形式的,可以对一个source指定多个拦截器,按先后顺序依次处理。如下是几种常用拦截器:
(1)、Timestamp Interceptor :在event的header中添加一个key叫:timestamp,value为当前的时间戳。这个拦截器在sink为hdfs 时很有用
(2)、Host Interceptor:在event的header中添加一个key叫:host,value为当前机器的hostname或者ip。
(3)、Static Interceptor:可以在event的header中添加自定义的key和value。
(4)、Regex Filtering Interceptor:通过正则来清洗或包含匹配的events。
(5)、Regex Extractor Interceptor:通过正则表达式来在header中添加指定的key,value则为正则匹配的部分
4.3启动
在Agent节点上启动命令如下所示:
flume-ng agent -n agent1 -c conf -f flume-client.properties -Dflume.root.logger=DEBUG,console
注:命令中的agent1表示配置文件中的Agent的Name,如配置文件中的agent1。flume-client.properties表示配置文件所在配置,需填写准确的配置文件路径。
在Collector节点上启动命令如下所示:
flume-ng agent -n a1 -c conf -f flume-server.properties -Dflume.root.logger=DEBUG,console
注:命令中的a1表示配置文件中的Agent的Name,如配置文件中的a1。flume-server.properties表示配置文件所在配置,需填写准确的配置文件路径。
5.Failover测试
下面我们来测试下Flume NG集群的高可用(故障转移)。场景如下:我们在Agent1节点上传文件,由于我们配置Collector1的权重比Collector2大,所以Collector1优先采集并上传到存储系统。然后我们kill掉Collector1,此时有Collector2负责日志的采集上传工作,之后,我们手动恢复Collector1节点的Flume服务,再次在Agent1上次文件,发现Collector1恢复优先级别的采集工作。具体截图如下所示:
- Collector1优先上传
- HDFS集群中上传的log内容预览
- Collector1宕机,Collector2获取优先上传权限
- 重启Collector1服务,Collector1重新获得优先上传的权限
6.截图预览
下面为大家附上HDFS文件系统中的截图预览,如下图所示:
- HDFS文件系统中的文件预览
- 上传的文件内容预览
7.总结
在配置高可用的Flume NG时,需要注意一些事项。在Agent中需要绑定对应的Collector1和Collector2的IP和Port,另外,在配置Collector节点时,需要修改当前Flume节点的配置文件,Bind的IP(或HostName)为当前节点的IP(或HostName),最后,在启动的时候,指定配置文件中的Agent的Name和配置文件的路径,否则会出错。
8.结束语
这篇博客就和大家分享到这里,如果大家在研究学习的过程当中有什么问题,可以加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!
配置文件详解:
For more flexibility, the failover Flume client can be configured with these properties:
For more flexibility, the load-balancing Flume client can be configured with these properties:
Fan out flow
Flume supports fanning out the flow from one source to multiple channels.
There are two modes:replicating and multiplexing.In the replicating flow, the event is sent to all the configured channels.
In case of multiplexing, the event is sent to only a subset of qualifying channels.
replicating:
multiplexing:
The selector checks for a header called “State”. If the value is “CA” then its sent to mem-channel-1, if its “AZ” then it goes to file-channel-2 or if its “NY” then both. If the “State” header is not set or doesn’t match any of the three, then it goes to mem-channel-1 which is designated as ‘default’.
To specify optional channels for a header, the config parameter ‘optional’ is used
The selector will attempt to write to the required channels first and will fail the transaction if even one of these channels fails to consume the events. The transaction is reattempted on all of the channels. Once all required channels have consumed the events, then the selector will attempt to write to the optional channels. A failure by any of the optional channels to consume the event is simply ignored and not retried.
Avro Source
Listens on Avro port and receives events from external Avro client streams. When paired with the built-in Avro Sink on another (previous hop) Flume agent, it can create tiered collection topologies. Required properties are in bold .
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be avro |
bind | – | hostname or IP address to listen on |
port | – | Port # to bind to |
threads | – | Maximum number of worker threads to spawn |
selector.type | ||
selector.* | ||
interceptors | – | Space-separated list of interceptors |
interceptors.* | ||
compression-type | none | This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource |
ssl | false | Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”. |
keystore | – | This is the path to a Java keystore file. Required for SSL. |
keystore-password | – | The password for the Java keystore. Required for SSL. |
keystore-type | JKS | The type of the Java keystore. This can be “JKS” or “PKCS12”. |
exclude-protocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified. |
ipFilter | false | Set this to true to enable ipFiltering for netty |
ipFilterRules | – | Define N netty ipFilter pattern rules with this config. |
Thrift Source
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be thrift |
bind | – | hostname or IP address to listen on |
port | – | Port # to bind to |
threads | – | Maximum number of worker threads to spawn |
selector.type | ||
selector.* | ||
interceptors | – | Space separated list of interceptors |
interceptors.* | ||
ssl | false | Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”. |
keystore | – | This is the path to a Java keystore file. Required for SSL. |
keystore-password | – | The password for the Java keystore. Required for SSL. |
keystore-type | JKS | The type of the Java keystore. This can be “JKS” or “PKCS12”. |
exclude-protocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified. |
kerberos | false | Set to true to enable kerberos authentication. In kerberos mode, agent-principal and agent-keytab are required for successful authentication. The Thrift source in secure mode, will accept connections only from Thrift clients that have kerberos enabled and are successfully authenticated to the kerberos KDC. |
agent-principal | – | The kerberos principal used by the Thrift Source to authenticate to the kerberos KDC. |
agent-keytab | —- | The keytab location used by the Thrift Source in combination with the agent-principal to authenticate to the kerberos KDC. |
Exec Source
Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.
Required properties are in bold.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be exec |
command | – | The command to execute |
shell | – | A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc. |
restartThrottle | 10000 | Amount of time (in millis) to wait before attempting a restart |
restart | false | Whether the executed cmd should be restarted if it dies |
logStdErr | false | Whether the command’s stderr should be logged |
batchSize | 20 | The max number of lines to read and send to the channel at a time |
batchTimeout | 3000 | Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
Example for agent named a1:
The ‘shell’ config is used to invoke the ‘command’ through a command shell
In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.
JMS Source
Spooling Directory Source
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be spooldir. |
spoolDir | – | The directory from which to read files from. |
fileSuffix | .COMPLETED | Suffix to append to completely ingested files |
deletePolicy | never | When to delete completed files: never or immediate |
fileHeader | false | Whether to add a header storing the absolute path filename. |
fileHeaderKey | file | Header key to use when appending absolute path filename to event header. |
basenameHeader | false | Whether to add a header storing the basename of the file. |
basenameHeaderKey | basename | Header Key to use when appending basename of file to event header. |
ignorePattern | ^$ | Regular expression specifying which files to ignore (skip) |
trackerDir | .flumespool | Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir. |
consumeOrder | oldest | In which order files in the spooling directory will be consumed oldest, youngest and random. In case of oldest and youngest, the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest laxicographical order will be consumed first. In case ofrandom any file will be picked randomly. When using oldest and youngest the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using random may cause old files to be consumed very late if new files keep coming in the spooling directory. |
maxBackoff | 4000 | The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter. |
batchSize | 100 | Granularity at which to batch transfer to the channel |
inputCharset | UTF-8 | Character set used by deserializers that treat the input file as text. |
decodeErrorPolicy | FAIL | What to do when we see a non-decodable character in the input file. FAIL: Throw an exception and fail to parse the file. REPLACE: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. IGNORE: Drop the unparseable character sequence. |
deserializer | LINE | Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder. |
deserializer.* | Varies per event deserializer. | |
bufferMaxLines | – | (Obselete) This option is now ignored. |
bufferMaxLineLength | 5000 | (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead. |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
Example for an agent named agent-1:
Kafka Source
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be org.apache.flume.source.kafka,KafkaSource |
zookeeperConnect | – | URI of ZooKeeper used by Kafka cluster |
groupId | flume | Unique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group |
topic | – | Kafka topic we’ll read messages from. At the time, this is a single topic only. |
batchSize | 1000 | Maximum number of messages written to Channel in one batch |
batchDurationMillis | 1000 | Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached. |
backoffSleepIncrement | 1000 | Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors. |
maxBackoffSleep | 5000 | Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors. |
Other Kafka Consumer Properties | – | These properties are used to configure the Kafka Consumer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefixkafka.. For example: kafka.consumer.timeout.ms Check Kafka documentation <https://kafka.apache.org/08/configuration.html#consumerconfigs> for details |
Example for agent named tier1:
NetCat Source
A netcat-like source that listens on a given port and turns each line of text into an event. Acts like nc -k -l [host] [port] .Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be netcat |
bind | – | Host name or IP address to bind to |
port | – | Port # to bind to |
max-line-length | 512 | Max line length per event body (in bytes) |
ack-every-event | true | Respond with an “OK” for every event received |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
Example for agent named a1:
Sequence Generator Source
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be seq |
selector.type | replicating or multiplexing | |
selector.* | replicating | Depends on the selector.type value |
interceptors | – | Space-separated list of interceptors |
interceptors.* | ||
batchSize | 1 |
Example for agent named a1:
Syslog TCP Source
The original, tried-and-true syslog TCP source.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be syslogtcp |
host | – | Host name or IP address to bind to |
port | – | Port # to bind to |
eventSize | 2500 | Maximum size of a single event line, in bytes |
keepFields | none | Setting this to ‘all’ will preserve the Priority, Timestamp and Hostname in the body of the event. A spaced separated list of fields to include is allowed as well. Currently, the following fields can be included: priority, version, timestamp, hostname. The values ‘true’ and ‘false’ have been deprecated in favor of ‘all’ and ‘none’. |
selector.type | replicating or multiplexing | |
selector.* | replicating | Depends on the selector.type value |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
For example, a syslog TCP source for agent named a1:
Multiport Syslog TCP Source
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be multiport_syslogtcp |
host | – | Host name or IP address to bind to. |
ports | – | Space-separated list (one or more) of ports to bind to. |
eventSize | 2500 | Maximum size of a single event line, in bytes. |
keepFields | none | Setting this to ‘all’ will preserve the Priority, Timestamp and Hostname in the body of the event. A spaced separated list of fields to include is allowed as well. Currently, the following fields can be included: priority, version, timestamp, hostname. The values ‘true’ and ‘false’ have been deprecated in favor of ‘all’ and ‘none’. |
portHeader | – | If specified, the port number will be stored in the header of each event using the header name specified here. This allows for interceptors and channel selectors to customize routing logic based on the incoming port. |
charset.default | UTF-8 | Default character set used while parsing syslog events into strings. |
charset.port.<port> | – | Character set is configurable on a per-port basis. |
batchSize | 100 | Maximum number of events to attempt to process per request loop. Using the default is usually fine. |
readBufferSize | 1024 | Size of the internal Mina read buffer. Provided for performance tuning. Using the default is usually fine. |
numProcessors | (auto-detected) | Number of processors available on the system for use while processing messages. Default is to auto-detect # of CPUs using the Java Runtime API. Mina will spawn 2 request-processing threads per detected CPU, which is often reasonable. |
selector.type | replicating | replicating, multiplexing, or custom |
selector.* | – | Depends on the selector.type value |
interceptors | – | Space-separated list of interceptors. |
interceptors.* |
For example, a multiport syslog TCP source for agent named a1:
Syslog UDP Source
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be syslogudp |
host | – | Host name or IP address to bind to |
port | – | Port # to bind to |
keepFields | false | Setting this to true will preserve the Priority, Timestamp and Hostname in the body of the event. |
selector.type | replicating or multiplexing | |
selector.* | replicating | Depends on the selector.type value |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
For example, a syslog UDP source for agent named a1:
HTTP Source
A source which accepts Flume Events by HTTP POST and GET. GET should be used for experimentation only. HTTP requests are converted into flume events by a pluggable “handler” which must implement the HTTPSourceHandler interface. This handler takes a HttpServletRequest and returns a list of flume events.Property Name | Default | Description |
---|---|---|
type | The component type name, needs to be http | |
port | – | The port the source should bind to. |
bind | 0.0.0.0 | The hostname or IP address to listen on |
handler | org.apache.flume.source.http.JSONHandler | The FQCN of the handler class. |
handler.* | – | Config parameters for the handler |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* | ||
enableSSL | false | Set the property true, to enable SSL. HTTP Source does not support SSLv3. |
excludeProtocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude. SSLv3 is always excluded. |
keystore | Location of the keystore includng keystore file name | |
keystorePassword Keystore password |
For example, a http source for agent named a1:
Property Name | Default | Description |
---|---|---|
handler | – | The FQCN of this class: org.apache.flume.sink.solr.morphline.BlobHandler |
handler.maxBlobLength | 100000000 | The maximum number of bytes to read and buffer for a given request |
Custom Source
A custom source’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom source is its FQCN.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be your FQCN |
selector.type | replicating or multiplexing | |
selector.* | replicating | Depends on the selector.type value |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
Example for agent named a1:
Flume Sinks
HDFS Sink
The following are the escape sequences supported:
Alias | Description |
---|---|
%{host} | Substitute value of event header named “host”. Arbitrary header names are supported. |
%t | Unix time in milliseconds |
%a | locale’s short weekday name (Mon, Tue, ...) |
%A | locale’s full weekday name (Monday, Tuesday, ...) |
%b | locale’s short month name (Jan, Feb, ...) |
%B | locale’s long month name (January, February, ...) |
%c | locale’s date and time (Thu Mar 3 23:05:25 2005) |
%d | day of month (01) |
%e | day of month without padding (1) |
%D | date; same as %m/%d/%y |
%H | hour (00..23) |
%I | hour (01..12) |
%j | day of year (001..366) |
%k | hour ( 0..23) |
%m | month (01..12) |
%n | month without padding (1..12) |
%M | minute (00..59) |
%p | locale’s equivalent of am or pm |
%s | seconds since 1970-01-01 00:00:00 UTC |
%S | second (00..59) |
%y | last two digits of year (00..99) |
%Y | year (2010) |
%z | +hhmm numeric timezone (for example, -0400) |
note
For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless hdfs.useLocalTimeStamp is set to true). One way to add this automatically is to use the TimestampInterceptor.
Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be hdfs |
hdfs.path | – | HDFS directory path (eg hdfs://namenode/flume/webdata/) |
hdfs.filePrefix | FlumeData | Name prefixed to files created by Flume in hdfs directory |
hdfs.fileSuffix | – | Suffix to append to file (eg .avro - NOTE: period is not automatically added) |
hdfs.inUsePrefix | – | Prefix that is used for temporal files that flume actively writes into |
hdfs.inUseSuffix | .tmp | Suffix that is used for temporal files that flume actively writes into |
hdfs.rollInterval | 30 | Number of seconds to wait before rolling current file (0 = never roll based on time interval) |
hdfs.rollSize | 1024 | File size to trigger roll, in bytes (0: never roll based on file size) |
hdfs.rollCount | 10 | Number of events written to file before it rolled (0 = never roll based on number of events) |
hdfs.idleTimeout | 0 | Timeout after which inactive files get closed (0 = disable automatic closing of idle files) |
hdfs.batchSize | 100 | number of events written to file before it is flushed to HDFS |
hdfs.codeC | – | Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy |
hdfs.fileType | SequenceFile | File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC |
hdfs.maxOpenFiles | 5000 | Allow only this number of open files. If this number is exceeded, the oldest file is closed. |
hdfs.minBlockReplicas | – | Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath. |
hdfs.writeFormat | – | Format for sequence file records. One of “Text” or “Writable” (the default). |
hdfs.callTimeout | 10000 | Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring. |
hdfs.threadsPoolSize | 10 | Number of threads per HDFS sink for HDFS IO ops (open, write, etc.) |
hdfs.rollTimerPoolSize | 1 | Number of threads per HDFS sink for scheduling timed file rolling |
hdfs.kerberosPrincipal | – | Kerberos user principal for accessing secure HDFS |
hdfs.kerberosKeytab | – | Kerberos keytab for accessing secure HDFS |
hdfs.proxyUser | ||
hdfs.round | false | Should the timestamp be rounded down (if true, affects all time based escape sequences except %t) |
hdfs.roundValue | 1 | Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time. |
hdfs.roundUnit | second | The unit of the round down value - second, minute or hour. |
hdfs.timeZone | Local Time | Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles. |
hdfs.useLocalTimeStamp | false | Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. |
hdfs.closeTries | 0 | Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart. |
hdfs.retryInterval | 180 | Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension. |
serializer | TEXT | Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface. |
serializer.* |
Example for agent named a1:
Hive Sink
This sink is provided as a preview feature and not recommended for use in production.
Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be hive |
hive.metastore | – | Hive metastore URI (eg thrift://a.b.com:9083 ) |
hive.database | – | Hive database name |
hive.table | – | Hive table name |
hive.partition | – | Comma separate list of partition values identifying the partition to write to. May contain escape sequences. E.g: If the table is partitioned by (continent: string, country :string, time : string) then ‘Asia,India,2014-02-26-01-21’ will indicate continent=Asia,country=India,time=2014-02-26-01-21 |
hive.txnsPerBatchAsk | 100 | Hive grants a batch of transactions instead of single transactions to streaming clients like Flume. This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files. |
heartBeatInterval | 240 | (In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats. |
autoCreatePartitions | true | Flume will automatically create the necessary Hive partitions to stream to |
batchSize | 15000 | Max number of events written to Hive in a single Hive transaction |
maxOpenConnections | 500 | Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed. |
callTimeout | 10000 | (In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort. |
serializer | Serializer is responsible for parsing out field from the event and mapping them to columns in the hive table. Choice of serializer depends upon the format of the data in the event. Supported serializers: DELIMITED and JSON | |
roundUnit | minute | The unit of the round down value - second, minute or hour. |
roundValue | 1 | Rounded down to the highest multiple of this (in the unit configured using hive.roundUnit), less than current time |
timeZone | Local Time | Name of the timezone that should be used for resolving the escape sequences in partition, e.g. America/Los_Angeles. |
useLocalTimeStamp | false | Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. |
Logger Sink
Logs event at INFO level. Typically useful for testing/debugging purpose. Required properties are in bold.
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be logger |
maxBytesToLog | 16 | Maximum number of bytes of the Event body to log |
Example for agent named a1:
Avro Sink
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be avro. |
hostname | – | The hostname or IP address to bind to. |
port | – | The port # to listen on. |
batch-size | 100 | number of event to batch together for send. |
connect-timeout | 20000 | Amount of time (ms) to allow for the first (handshake) request. |
request-timeout | 20000 | Amount of time (ms) to allow for requests after the first. |
reset-connection-interval | none | Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent. |
compression-type | none | This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource |
compression-level | 6 | The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression |
ssl | false | Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password”, “truststore-type”, and specify whether to “trust-all-certs”. |
trust-all-certs | false | If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and “listen in” on the encrypted connection. |
truststore | – | The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used. |
truststore-password | – | The password for the specified truststore. |
truststore-type | JKS | The type of the Java truststore. This can be “JKS” or other supported Java truststore type. |
exclude-protocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified. |
maxIoWorkers | 2 * the number of available processors in the machine | The maximum number of I/O worker threads. This is configured on the NettyAvroRpcClient NioClientSocketChannelFactory. |
Example for agent named a1:
Thrift Sink
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be thrift. |
hostname | – | The hostname or IP address to bind to. |
port | – | The port # to listen on. |
batch-size | 100 | number of event to batch together for send. |
connect-timeout | 20000 | Amount of time (ms) to allow for the first (handshake) request. |
request-timeout | 20000 | Amount of time (ms) to allow for requests after the first. |
connection-reset-interval | none | Amount of time (s) before the connection to the next hop is reset. This will force the Thrift Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent. |
ssl | false | Set to true to enable SSL for this ThriftSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password” and “truststore-type” |
truststore | – | The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Thrift Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used. |
truststore-password | – | The password for the specified truststore. |
truststore-type | JKS | The type of the Java truststore. This can be “JKS” or other supported Java truststore type. |
exclude-protocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude |
kerberos | false | Set to true to enable kerberos authentication. In kerberos mode, client-principal, client-keytab and server-principal are required for successful authentication and communication to a kerberos enabled Thrift Source. |
client-principal | —- | The kerberos principal used by the Thrift Sink to authenticate to the kerberos KDC. |
client-keytab | —- | The keytab location used by the Thrift Sink in combination with the client-principal to authenticate to the kerberos KDC. |
server-principal | – | The kerberos principal of the Thrift Source to which the Thrift Sink is configured to connect to. |
Example for agent named a1:
IRC Sink
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be irc |
hostname | – | The hostname or IP address to connect to |
port | 6667 | The port number of remote host to connect |
nick | – | Nick name |
user | – | User name |
password | – | User password |
chan | – | channel |
name | ||
splitlines | – | (boolean) |
splitchars | n | line separator (if you were to enter the default value into the config file, then you would need to escape the backslash, like this: “\n”) |
Example for agent named a1:
File Roll Sink
Stores events on the local filesystem.Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be file_roll. |
sink.directory | – | The directory where files will be stored |
sink.rollInterval | 30 | Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file. |
sink.serializer | TEXT | Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface. |
batchSize | 100 |
Example for agent named a1:
Null Sink
Discards all events it receives from the channel. Required properties are in bold.
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be null. |
batchSize | 100 |
Example for agent named a1:
HBaseSinks
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be hbase |
table | – | The name of the table in Hbase to write to. |
columnFamily | – | The column family in Hbase to write to. |
zookeeperQuorum | – | The quorum spec. This is the value for the propertyhbase.zookeeper.quorum in hbase-site.xml |
znodeParent | /hbase | The base path for the znode for the -ROOT- region. Value of zookeeper.znode.parent in hbase-site.xml |
batchSize | 100 | Number of events to be written per txn. |
coalesceIncrements | false | Should the sink coalesce multiple increments to a cell per batch. This might give better performance if there are multiple increments to a limited number of cells. |
serializer | org.apache.flume.sink.hbase.SimpleHbaseEventSerializer | Default increment column = “iCol”, payload column = “pCol”. |
serializer.* | – | Properties to be passed to the serializer. |
kerberosPrincipal | – | Kerberos user principal for accessing secure HBase |
kerberosKeytab | – | Kerberos keytab for accessing secure HBase |
Example for agent named a1:
AsyncHBaseSink
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be asynchbase |
table | – | The name of the table in Hbase to write to. |
zookeeperQuorum | – | The quorum spec. This is the value for the propertyhbase.zookeeper.quorum in hbase-site.xml |
znodeParent | /hbase | The base path for the znode for the -ROOT- region. Value of zookeeper.znode.parent in hbase-site.xml |
columnFamily | – | The column family in Hbase to write to. |
batchSize | 100 | Number of events to be written per txn. |
coalesceIncrements | false | Should the sink coalesce multiple increments to a cell per batch. This might give better performance if there are multiple increments to a limited number of cells. |
timeout | 60000 | The length of time (in milliseconds) the sink waits for acks from hbase for all events in a transaction. |
serializer | org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer | |
serializer.* | – | Properties to be passed to the serializer. |
MorphlineSolrSink
This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr serversProperty Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to beorg.apache.flume.sink.solr.morphline.MorphlineSolrSink |
morphlineFile | – | The relative or absolute path on the local file system to the morphline configuration file. Example:/etc/flume-ng/conf/morphline.conf |
morphlineId | null | Optional name used to identify a morphline if there are multiple morphlines in a morphline config file |
batchSize | 1000 | The maximum number of events to take per flume transaction. |
batchDurationMillis | 1000 | The maximum duration per flume transaction (ms). The transaction commits after this duration or when batchSize is exceeded, whichever comes first. |
handlerClass | org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl | The FQCN of a class implementing org.apache.flume.sink.solr.morphline.MorphlineHandler |
isProductionMode | false | This flag should be enabled for mission critical, large-scale online production systems that need to make progress without downtime when unrecoverable exceptions occur. Corrupt or malformed parser input data, parser bugs, and errors related to unknown Solr schema fields produce unrecoverable exceptions. |
recoverableExceptionClasses | org.apache.solr.client.solrj.SolrServerException | Comma separated list of recoverable exceptions that tend to be transient, in which case the corresponding task can be retried. Examples include network connection errors, timeouts, etc. When the production mode flag is set to true, the recoverable exceptions configured using this parameter will not be ignored and hence will lead to retries. |
isIgnoringRecoverableExceptions | false | This flag should be enabled, if an unrecoverable exception is accidentally misclassified as recoverable. This enables the sink to make progress and avoid retrying an event forever. |
Example for agent named a1:
ElasticSearchSink
This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them. The elasticsearch and lucene-core jars required for your environment must be placed in the lib directory of the Apache Flume installation.
Events are serialized for elasticsearch by the ElasticSearchLogStashEventSerializer by default. This behaviour can be overridden with the serializer parameter.
The type is the FQCN: org.apache.flume.sink.elasticsearch.ElasticSearchSink
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to beorg.apache.flume.sink.elasticsearch.ElasticSearchSink |
hostNames | – | Comma separated list of hostname:port, if the port is not present the default port ‘9300’ will be used |
indexName | flume | The name of the index which the date will be appended to. Example ‘flume’ -> ‘flume-yyyy-MM-dd’ Arbitrary header substitution is supported, eg. %{header} replaces with value of named event header |
indexType | logs | The type to index the document to, defaults to ‘log’ Arbitrary header substitution is supported, eg. %{header} replaces with value of named event header |
clusterName | elasticsearch | Name of the ElasticSearch cluster to connect to |
batchSize | 100 | Number of events to be written per txn. |
ttl | – | TTL in days, when set will cause the expired documents to be deleted automatically, if not set documents will never be automatically deleted. TTL is accepted both in the earlier form of integer only e.g. a1.sinks.k1.ttl = 5 and also with a qualifier ms (millisecond), s (second), m (minute), h (hour), d (day) and w (week). Example a1.sinks.k1.ttl = 5d will set TTL to 5 days. Followhttp://www.elasticsearch.org/guide/reference/mapping/ttl-field/ for more information. |
serializer | org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer | The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use. Implementations of either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred. |
serializer.* | – | Properties to be passed to the serializer. |
Example for agent named a1:
Kafka Sink
This is a Flume Sink implementation that can publish data to a Kafka topic.
Property Name | Default | Description |
---|---|---|
type | – | Must be set to org.apache.flume.sink.kafka.KafkaSink |
brokerList | – | List of brokers Kafka-Sink will connect to, to get the list of topic partitions This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port |
topic | default-flume-topic | The topic in Kafka to which the messages will be published. If this parameter is configured, messages will be published to this topic. If the event header contains a “topic” field, the event will be published to that topic overriding the topic configured here. |
batchSize | 100 | How many messages to process in one batch. Larger batches improve throughput while adding latency. |
requiredAcks | 1 | How many replicas must acknowledge a message before its considered successfully written. Accepted values are 0 (Never wait for acknowledgement), 1 (wait for leader only), -1 (wait for all replicas) Set this to -1 to avoid data loss in some cases of leader failure. |
Other Kafka Producer Properties | – | These properties are used to configure the Kafka Producer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.. For example: kafka.producer.type |
Custom Sink
A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
Property Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be your FQCN |
Example for agent named a1:
Flume Channels
Memory Channel
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be memory |
capacity | 100 | The maximum number of events stored in the channel |
transactionCapacity | 100 | The maximum number of events the channel will take from a source or give to a sink per transaction |
keep-alive | 3 | Timeout in seconds for adding or removing an event |
byteCapacityBufferPercentage | 20 | Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below. |
byteCapacity | see description | Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing thebyteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. |
Example for agent named a1:
JDBC Channel
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be jdbc |
db.type | DERBY | Database vendor, needs to be DERBY. |
driver.class | org.apache.derby.jdbc.EmbeddedDriver | Class for vendor’s JDBC driver |
driver.url | (constructed from other properties) | JDBC connection URL |
db.username | “sa” | User id for db connection |
db.password | – | password for db connection |
connection.properties.file | – | JDBC Connection property file path |
create.schema | true | If true, then creates db schema if not there |
create.index | true | Create indexes to speed up lookups |
create.foreignkey | true | |
transaction.isolation | “READ_COMMITTED” | Isolation level for db session READ_UNCOMMITTED, READ_COMMITTED, SERIALIZABLE, REPEATABLE_READ |
maximum.connections | 10 | Max connections allowed to db |
maximum.capacity | 0 (unlimited) | Max number of events in the channel |
sysprop.* | DB Vendor specific properties | |
sysprop.user.home | Home path to store embedded Derby database |
Example for agent named a1:
Kafka Channel
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be org.apache.flume.channel.kafka.KafkaChannel |
brokerList | – | List of brokers in the Kafka cluster used by the channel This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port |
zookeeperConnect | – | URI of ZooKeeper used by Kafka cluster The format is comma separated list of hostname:port. If chroot is used, it is added once at the end. For example: zookeeper-1:2181,zookeeper-2:2182, zookeeper-3:2181/kafka |
topic | flume-channel | Kafka topic which the channel will use |
groupId | flume | Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note that having non-channel consumers with the same ID can lead to data loss. |
parseAsFlumeEvent | true | Expecting Avro datums with FlumeEvent schema in the channel. This should be true if Flume source is writing to the channel And false if other producers are writing into the topic that the channel is using Flume source messages to Kafka can be parsed outside of Flume by using org.apache.flume.source.avro.AvroFlumeEvent provided by the flume-ng-sdk artifact |
readSmallestOffset | false | When set to true, the channel will read all data in the topic, starting from the oldest event when false, it will read only events written after the channel started When “parseAsFlumeEvent” is true, this will be false. Flume source will start prior to the sinks and this guarantees that events sent by source before sinks start will not be lost. |
Other Kafka Properties | – | These properties are used to configure the Kafka Producer and Consumer used by the channel. Any property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.. For example: kafka.producer.type |
Example for agent named a1:
File Channel
Property Name Default | Description | |
---|---|---|
type | – | The component type name, needs to be file. |
checkpointDir | ~/.flume/file-channel/checkpoint | The directory where checkpoint file will be stored |
useDualCheckpoints | false | Backup the checkpoint. If this is set to true, backupCheckpointDir mustbe set |
backupCheckpointDir | – | The directory where the checkpoint is backed up to. This directorymust not be the same as the data directories or the checkpoint directory |
dataDirs | ~/.flume/file-channel/data | Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance |
transactionCapacity | 10000 | The maximum size of transaction supported by the channel |
checkpointInterval | 30000 | Amount of time (in millis) between checkpoints |
maxFileSize | 2146435071 | Max size (in bytes) of a single log file |
minimumRequiredSpace | 524288000 | Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value |
capacity | 1000000 | Maximum capacity of the channel |
keep-alive | 3 | Amount of time (in sec) to wait for a put operation |
use-log-replay-v1 | false | Expert: Use old replay logic |
use-fast-replay | false | Expert: Replay without using queue |
checkpointOnClose | true | Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay. |
encryption.activeKey | – | Key name used to encrypt new data |
encryption.cipherProvider | – | Cipher provider type, supported types: AESCTRNOPADDING |
encryption.keyProvider | – | Key provider type, supported types: JCEKSFILE |
encryption.keyProvider.keyStoreFile | – | Path to the keystore file |
encrpytion.keyProvider.keyStorePasswordFile | – | Path to the keystore password file |
encryption.keyProvider.keys | – | List of all keys (e.g. history of the activeKey setting) |
encyption.keyProvider.keys.*.passwordFile | – | Path to the optional key password file |
Example for agent named a1:
Encryption
Generating a key with a password seperate from the key store password:
Generating a key with the password the same as the key store password:
Let’s say you have aged key-0 out and new files should be encrypted with key-1:
The same scenerio as above, however key-0 has its own password:
Spillable Memory Channel
The events are stored in an in-memory queue and on disk. The in-memory queue serves as the primary store and the disk as overflow.
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be SPILLABLEMEMORY |
memoryCapacity | 10000 | Maximum number of events stored in memory queue. To disable use of in-memory queue, set this to zero. |
overflowCapacity | 100000000 | Maximum number of events stored in overflow disk (i.e File channel). To disable use of overflow, set this to zero. |
overflowTimeout | 3 | The number of seconds to wait before enabling disk overflow when memory fills up. |
byteCapacityBufferPercentage | 20 | Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below. |
byteCapacity | see description | Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event body, which is the reason for providing thebyteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. |
avgEventSize | 500 | Estimated average size of events, in bytes, going into the channel |
<file channel properties> | see file channel | Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity. |
In-memory queue is considered full if either memoryCapacity or byteCapacity limit is reached.
Example for agent named a1:
To disable the use of the in-memory queue and function like a file channel:
To disable the use of overflow disk and function purely as a in-memory channel:
Custom Channel
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be a FQCN |
Example for agent named a1:
Flume Channel Selectors
Replicating Channel Selector (default)
Property Name | Default | Description |
---|---|---|
selector.type | replicating | The component type name, needs to be replicating |
selector.optional | – | Set of channels to be marked as optional |
Example for agent named a1 and it’s source called r1:
Multiplexing Channel Selector
Property Name | Default | Description |
---|---|---|
selector.type | replicating | The component type name, needs to be multiplexing |
selector.header | flume.selector.header | |
selector.default | – | |
selector.mapping.* | – |
Example for agent named a1 and it’s source called r1:
Custom Channel Selector
Property Name | Default | Description |
---|---|---|
sinks | – | Space-separated list of sinks that are participating in the group |
processor.type | default | The component type name, needs to be default, failover or load_balance |
Example for agent named a1:
Default Sink Processor
Default sink processor accepts only a single sink. User is not forced to create processor (sink group) for single sinks.Failover Sink Processor
The Sinks have a priority associated with them, larger the number, higher the priority. If a Sink fails while sending a Event the next Sink with highest priority shall be tried next for sending Events.
Property Name | Default | Description |
---|---|---|
sinks | – | Space-separated list of sinks that are participating in the group |
processor.type | default | The component type name, needs to be failover |
processor.priority.<sinkName> | – | Priority value. <sinkName> must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority |
processor.maxpenalty | 30000 | The maximum backoff period for the failed Sink (in millis) |
Example for agent named a1:
Load balancing Sink Processor
via round_robin or random selection mechanisms.
Property Name | Default | Description |
---|---|---|
processor.sinks | – | Space-separated list of sinks that are participating in the group |
processor.type | default | The component type name, needs to be load_balance |
processor.backoff | false | Should failed sinks be backed off exponentially. |
processor.selector | round_robin | Selection mechanism. Must be either round_robin, random or FQCN of custom class that inherits from AbstractSinkSelector |
processor.selector.maxTimeOut | 30000 | Used by backoff selectors to limit exponential backoff (in milliseconds) |
Example for agent named a1:
Custom Sink Processor
Custom sink processors are not supported at the moment.Flume Interceptors
Flume has the capability to modify/drop events in-flight. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface.In the above example, events are passed to the HostInterceptor first and the events returned by the HostInterceptor are then passed along to the TimestampInterceptor. You can specify either the fully qualified class name (FQCN) or the alias timestamp . If you have multiple collectors writing to the same HDFS path, then you could also use the HostInterceptor.
Timestamp Interceptor
This interceptor inserts a header with key timestamp whose value is the relevant timestamp. This interceptor can preserve an existing timestamp if it is already present in the configuration.
Property Name | Default | Description |
---|---|---|
type | – | The component type name, has to be timestamp or the FQCN |
preserveExisting | false | If the timestamp already exists, should it be preserved - true or false |
Example for agent named a1:
Host Interceptor
It inserts a header with key host or a configured key whose value is the hostname or IP address of the host, based on configuration.
Property Name | Default | Description |
---|---|---|
type | – | The component type name, has to be host |
preserveExisting | false | If the host header already exists, should it be preserved - true or false |
useIP | true | Use the IP Address if true, else use hostname. |
hostHeader | host | The header key to be used. |
Example for agent named a1:
Static Interceptor
Static interceptor allows user to append a static header with static value to all events.The current implementation does not allow specifying multiple headers at one time.
Property Name | Default | Description |
---|---|---|
type | – | The component type name, has to be static |
preserveExisting | true | If configured header already exists, should it be preserved - true or false |
key | key | Name of header that should be created |
value | value | Static value that should be created |
Example for agent named a1:
UUID Interceptor
This interceptor sets a universally unique identifier on all events that are intercepted.
Property Name | Default | Description |
---|---|---|
type | – | The component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder |
headerName | id | The name of the Flume header to modify |
preserveExisting | true | If the UUID header already exists, should it be preserved - true or false |
prefix | “” | The prefix string constant to prepend to each generated UUID |
Morphline Interceptor
This interceptor filters the events through a morphline configuration file that defines a chain of transformation commands that pipe records from one command to another.
Property Name | Default | Description |
---|---|---|
type | – | The component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder |
morphlineFile | – | The relative or absolute path on the local file system to the morphline configuration file. Example:/etc/flume-ng/conf/morphline.conf |
morphlineId | null | Optional name used to identify a morphline if there are multiple morphlines in a morphline config file |
Sample flume.conf file:
Search and Replace Interceptor
This interceptor provides simple string-based search-and-replace functionality based on Java regular expressions.
Property Name | Default | Description |
---|---|---|
type | – | The component type name has to be search_replace |
searchPattern | – | The pattern to search for and replace. |
replaceString | – | The replacement string. |
charset | UTF-8 | The charset of the event body. Assumed by default to be UTF-8. |
Example configuration:
Another example:
Regex Filtering Interceptor
Property Name | Default | Description |
---|---|---|
type | – | The component type name has to be regex_filter |
regex | ”.*” | Regular expression for matching against events |
excludeEvents | false | If true, regex determines events to exclude, otherwise regex determines events to include. |
Regex Extractor Interceptor
Property Name | Default | Description |
---|---|---|
type | – | The component type name has to be regex_extractor |
regex | – | Regular expression for matching against events |
serializers | – | Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers:org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer |
serializers.<s1>.type | default | Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer |
serializers.<s1>.name | – | |
serializers.* | – | Serializer-specific properties |
Example 1:
If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3
Example 2:
If the Flume event body contained 2012-10-18 18:47:57,614 some log line and the following configuration was used
the extracted event will contain the same body but the following headers will have been added timestamp=>1350611220000