flume 用户指南 - part 2

Flume Source

Avro Source

监听avro端口,从外部avro client接收事件。当前一个agent的sink是avro类型时,可以构建多级agent。加粗的是必选熟悉。

Property Name Default Description
channels  
type The component type name, needs to be avro
bind hostname or IP address to listen on
port Port # to bind to
threads Maximum number of worker threads to spawn
selector.type    
selector.*    
interceptors Space-separated list of interceptors
interceptors.*    
compression-type none This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl false Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore This is the path to a Java keystore file. Required for SSL.
keystore-password The password for the Java keystore. Required for SSL.
keystore-type JKS The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
ipFilter false Set this to true to enable ipFiltering for netty
ipFilterRules Define N netty ipFilter pattern rules with this config.
简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
ipFilterRules示例

它可以配置N个netty类型的ipFilter,逗号分隔。每个rule需要符合下面的格式。

<’allow’ or deny>:<’ip’ or ‘name’ for computer name>:<pattern> or allow/deny:ip/name:pattern

举例: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*

解释:

"allow:name:localhost,deny:ip:" : 这个会允许本地的客户端,拒绝其他ip的客户端

“deny:name:localhost,allow:ip:“ : 这个会拒绝本地客户端,运行其他ip的客户端

Thrift Source

说明同上。thrift source可以通过启用kerberos authentication,来使用安全模式启动。agent-principal, agent-keytab是用来配置这个的属性。加粗的是必选属性。


简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
Exec Source

exec source在启动的时候执行unix command,把数据源源不断的输出的到标准输出(stderr的输出直接被丢弃,除非把logStdErr设成true)。如果进程由于某些原因退出,则source也会推出导致无法继续产生数据。所以比如cat [named pipe]或者tail -F [file]可以产生持续的数据,而date 则不能。前2个命令产生持续的数据,而后面的命令仅产生单一数据然后退出。

Property Name Default Description
channels  
type The component type name, needs to be exec
command The command to execute
shell A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle 10000 Amount of time (in millis) to wait before attempting a restart
restart false Whether the executed cmd should be restarted if it dies
logStdErr false Whether the command’s stderr should be logged
batchSize 20 The max number of lines to read and send to the channel at a time
batchTimeout 3000 Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.type replicating replicating or multiplexing
selector.*   Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*    
警告:略

注:命令行tail -F [file], -F参数更好因为file rolling之后依然有效。(比如log文件每天生成新的,则能独到新的文件)

简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
shell是用来配置shell类型(bash或者powershell,跟写shell时候一样,可以指定不同的shell,然后利用不同的特性)

常用的值有‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.

a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done
JMS Source

jms source可以从queue或者topic读数据。作为jms app,必须和jms provider一起使用,不过目前只测试了activemq。jms source提供了可配置的一些属性,见下面。第三方提供的jms jar包有3种方式添加到flume classpath,

plugins.d目录(推荐),命令行添加-classpath,在flume-env.sh的FLUME_CLASSPATH变量。加粗为必选属性。

Property Name Default Description
channels  
type The component type name, needs to be jms
initialContextFactory Inital Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory The JNDI name the connection factory shoulld appear as
providerURL The JMS provider URL
destinationName Destination name
destinationType Destination type (queue or topic)
messageSelector Message selector to use when creating the consumer
userName Username for the destination/provider
passwordFile File containing the password for the destination/provider
batchSize 100 Number of messages to consume in one batch
converter.type DEFAULT Class to use to convert messages to flume events. See below.
converter.* Converter properties.
converter.charset UTF-8 Default converter only. Charset to use when converting JMS TextMessages to byte arrays.
消息转换器

jms source支持可插拔的converter。默认情况,默认的converter即可满足使用。jms message的属性会添加到flume event的头消息里面。

ByteMessage

byte会复制到event的消息体,最大2G

TextMessage

转成byte数组复制到event的消息体,UTF-8是默认字符集,可配置。

ObjectMessage

转成输出流(ByteArrayOutputStream)复制到event消息体。

简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
Spooling Source 

这种数据源是在“spooling”的目录里面放置文件,然后获取数据。这种source会监测目录,看是否有新文件,新文件到达的时候会发送事件。处理逻辑是可配的,当文件被完全读入channel后,可以重命名文件或删掉。

其他略。

Twitter 1% firehose Source(测试阶段/不稳定阶段)

Kafka Source

kafka source就是kafka的消费者,从topic读数据,如果有多个kafka source,把他们配到同一个consumer group。

Property Name Default Description
channels  
type The component type name, needs to be org.apache.flume.source.kafka,KafkaSource
zookeeperConnect URI of ZooKeeper used by Kafka cluster
groupId flume Unique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group
topic Kafka topic we’ll read messages from. At the time, this is a single topic only.
batchSize 1000 Maximum number of messages written to Channel in one batch
batchDurationMillis 1000 Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached.
backoffSleepIncrement 1000 Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
maxBackoffSleep 5000 Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
Other Kafka Consumer Properties These properties are used to configure the Kafka Consumer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.. For example: kafka.consumer.timeout.ms Check Kafka documentation <https://kafka.apache.org/08/configuration.html#consumerconfigs> for details
注:kafka source覆盖了2个kafka消费者的参数。auto.commit.enable设成了false,需要我们手动做commit,为了提高性能,可以设成true,但是,这会导致损失数据消费者(个人:什么意思?)。time.ms 设成了10ms,当我们检查kafka是否有数据到达的时候,最多等待10ms,把这个值设高会降低cpu利用率(我们使用不那么频繁的循环速率从kafka拉数据),但是同时会增加数据写入的延迟(我们在数据到达前要等待更久)。

简单示例

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = test1
tier1.sources.source1.groupId = flume
tier1.sources.source1.kafka.consumer.timeout.ms = 100
Netcat source

netcat命令类似的source,监听一个端口,把到达的每行数据转成时间发送。类似nc -k -l [host] [port]。换句话说,它打开一个具体的端口然后监听数据。预期的结果是到达的数据的单独的行文本,每行文本转成flume事件然后通过链接的通道发送。

Property Name Default Description
channels  
type The component type name, needs to be netcat
bind Host name or IP address to bind to
port Port # to bind to
max-line-length 512 Max line length per event body (in bytes)
ack-every-event true Respond with an “OK” for every event received
selector.type replicating replicating or multiplexing
selector.*   Depends on the selector.type value
interceptors Space-separated list of interceptors 空格分隔
interceptors.*    

简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.bind = 6666
a1.sources.r1.channels = c1
Sequence Generator Source(序列生成器)

计数器从0开始,每次递增1,连续产生事件。一般测试使用。

Property Name Default Description
channels  
type The component type name, needs to be seq
selector.type   replicating or multiplexing
selector.* replicating Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*    
batchSize 1  

简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.channels = c1
Syslog Source

UDP source把整个消息当成一个事件,TCP source把每一行创建一个事件(“\n”)

HTTP Source

通过http post或者get接收flume event,get应该用在测试使用。http请求会被实现了HTTPSourceHandler接口的“handler”转成flume 事件,这个handler接收HttpServletRequest,返回flume event list,在http请求里面处理的所以事件会在一个事务里面写入channel,这可以提供channel的效率,比如说文件通道。如果handler抛出异常,source会返回400.如果channel满了或者source不能在channel追加事件,source会返回503.

一个post请求的所有事件被认为是一个批量,且会在一个事务里写入channel。

Property Name Default Description
type   The component type name, needs to be http
port The port the source should bind to.
bind 0.0.0.0 The hostname or IP address to listen on
handler org.apache.flume.source.http.JSONHandler The FQCN of the handler class.
handler.* Config parameters for the handler
selector.type replicating replicating or multiplexing
selector.*   Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*    
enableSSL false Set the property true, to enable SSL. HTTP Source does not support SSLv3.
excludeProtocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 is always excluded.
keystore   Location of the keystore includng keystore file name
keystorePassword Keystore password

简单示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props
JSONHandler

发送的数据格式需要是json数组的形式,即使只有一组数据。示例格式数据如下

[{
  "headers" : {
             "timestamp" : "434324343",
             "host" : "random_host.example.com"
             },
  "body" : "random_body"
  },
  {
  "headers" : {
             "namenode" : "namenode.example.com",
             "datanode" : "random_datanode.example.com"
             },
  "body" : "really_random_body"
  }]

测试:

curl -X POST -H 'Content-Type: application/json; charset=UTF-8' -d '[{ "headers" : { "timestamp" : "434324343", "host" : "random_host.example.com" }, "body" : "random_body123456789" }, { "headers" : { "namenode" : "namenode.example.com", "datanode" : "random_datanode.example.com" }, "body" : "really_random_body" }]' http://localhost:44444/

在agent的启动窗口显示

2016-05-26 17:29:12,494 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=434324343, host=random_host.example.com} body: 72 61 6E 64 6F 6D 5F 62 6F 64 79 31 32 33 34 35 random_body12345 }
2016-05-26 17:29:12,494 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com} body: 72 65 61 6C 6C 79 5F 72 61 6E 64 6F 6D 5F 62 6F really_random_bo }

打印的body并不是传的值,原因是LoggerSink里面有设置打印的长度,默认16,可更改。


BlobHandler

处理如pdf,jpg等,但受限于内存,因为会都加载进来。


Stress Source

压力测试用的,后续再说吧,略

Legacy Source

版本兼容用的,略

Custom Source

我们自己实现Source接口,agent启动的时候,jar和依赖jar必须包括在agent的classpath里面,

Property Name Default Description
channels  
type The component type name, needs to be your FQCN
selector.type   replicating or multiplexing
selector.* replicating Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*    

Example for agent named a1:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.example.MySource
a1.sources.r1.channels = c1

Scribe Source



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值