Flume-常见source、channel、sink

Flume核心概念

在这里插入图片描述

Event

  1. Event是流经flume agent的最小数据单元。一个Event(由Event接口实现)从source流向channel,再到sink。
  2. Event包含了一个payload(byte array)和可选的header(string attributes)。
  3. 一个flume agent就是一个jvm下的进程:控制着Events从一个外部的源头到一个外部的目的地。

Source

  1. Source的目的是从外部客户端接收数据并将其存储到已配置的Channels中。
  2. 将接收的数据以Flume的event格式传递给一个或者多个通道channel。
  3. Flume提供多种数据接收的方式,比如Avro,Thrift等。
  4. source 必须至少和一个channel关联,不同类型的Source
    与系统集成的Source:Syslog,Netcat,监测目录池
    自动生成事件的Source:Exec
    用于Agent和Agent之间通信的IPC source:avro,thrift

Channel

  1. channel是一种短暂的存储容器,它将从source处接收到的event格式的数据缓存起来,直到它们被sinks消费掉,它在source和sink间起着桥梁的作用。
  2. channel是一个完整的事务,这一点保证了数据在收发的时候的一致性. 并且它可以和任意数量的source和sink链接.。
  3. 支持的类型有:
    Memory channel :volatile (不稳定的)
    File Channel:基于WAL( 预写式日志Write-Ahead logging)实现
    JDBC channel :基于嵌入式database实现
  4. 可以和任何数量的source和sink工作,channel 的内容只输出一次,同一个event 如果sink1 输出,sink2 不输出;如果sink1 输出,sink1 不输出。 最终 sink1+sink2=channel 中的数据。

Sink

  1. sink将数据存储到集中存储器比如Hbase和HDFS。
  2. 从channals消费数据(events)并将其传递给目标地. 目标地可能是另一个sink,也可能HDFS,HBase.
  3. 存储event到最终目的地终端sink,比如 HDFS,HBase
    自动消耗的sink 比如 null sink
    用于agent间通信的IPC:sink:Avro
    必须作用于一个确切的channel

Flume Sources

Avro Source

Listens on Avro port and receives events from external Avro client streams. When paired with the built-in Avro Sink on another (previous hop) Flume agent, it can create tiered collection topologies. Required properties are in bold.

监听Avro端口并从外部Avro客户端流接收事件,可以监听服务器指定端口

Property NameDefaultDescription
channels
typeThe component type name, needs to be avro
bindhostname or IP address to listen on
portPort # to bind to
threadsMaximum number of worker threads to spawn
selector.type
selector.*
interceptorsSpace-separated list of interceptors
interceptors.*
compression-typenoneThis can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
sslfalseSet this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystoreThis is the path to a Java keystore file. Required for SSL.
keystore-passwordThe password for the Java keystore. Required for SSL.
keystore-typeJKSThe type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocolsSSLv3Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
ipFilterfalseSet this to true to enable ipFiltering for netty
ipFilterRulesDefine N netty ipFilter pattern rules with this config.

样例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.

用于执行linux命令

PropertyNameDefault Description
channels
typeThe component type name, needs to be exec
commandThe command to execute
shellA shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle10000Amount of time (in millis) to wait before attempting a restart
restartfalseWhether the executed cmd should be restarted if it dies
logStdErrfalseWhether the command’s stderr should be logged
batchSize20The max number of lines to read and send to the channel at a time
batchTimeout3000Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.typereplicatingreplicating or multiplexing
selector.*Dependson the selector.type value
interceptorsSpace-separated list of interceptors
interceptors.*

样例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
# 添加shell环境
a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

Spooling Directory Source

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).

通过此源,您可以通过将要摄取的文件放入磁盘上的“假脱机”目录中来摄取数据。 此源将监视指定目录中的新文件,并从出现的新文件中解析事件。 事件解析逻辑是可插入的。 将给定文件完全读入通道后,将其重命名以指示完成(或选择删除)

PropertyNameDefault Description
channels
typeThe component type name, needs to be spooldir.
spoolDirThe directory from which to read files from.
fileSuffix.COMPLETEDSuffix to append to completely ingested files
deletePolicyneverWhen to delete completed files: never or immediate
fileHeaderfalseWhether to add a header storing the absolute path filename.
fileHeaderKeyfileHeader key to use when appending absolute path filename to event header.
basenameHeaderfalseWhether to add a header storing the basename of the file.
basenameHeaderKeybasenameHeader Key to use when appending basename of file to event header.
includePattern^.*$Regular expression specifying which files to include. It can used together with ignorePattern. If a file matches both ignorePattern and includePattern regex, the file is ignored.
ignorePattern^$Regular expression specifying which files to ignore (skip). It can used together with includePattern. If a file matches both ignorePattern and includePattern regex, the file is ignored.
trackerDir.flumespoolDirectory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
consumeOrderoldestIn which order files in the spooling directory will be consumed oldest, youngest and random. In case of oldest and youngest, the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest lexicographical order will be consumed first. In case of random any file will be picked randomly. When using oldest and youngest the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using random may cause old files to be consumed very late if new files keep coming in the spooling directory.
pollDelay500Delay (in milliseconds) used when polling for new files.
recursiveDirectorySearchfalseWhether to monitor sub directories for new files to read.
maxBackoff4000The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize100Granularity at which to batch transfer to the channel
inputCharsetUTF-8Character set used by deserializers that treat the input file as text.
decodeErrorPolicyFAILWhat to do when we see a non-decodable character in the input file. FAIL: Throw an exception and fail to parse the file. REPLACE: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. IGNORE: Drop the unparseable character sequence.
deserializerLINESpecify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
deserializer.*Varies
bufferMaxLines(Obselete) This option is now ignored.
bufferMaxLineLength5000(Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.typereplicatingreplicating or multiplexing
selector.*Depends on the selector.type value
interceptorsSpace-separated list of interceptors
interceptors.*

样例:

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

Flume Channels

Channels are the repositories where the events are staged on a agent. Source adds the events and Sink removes it.

Memory Channel

The events are stored in an in-memory queue with configurable max size. It’s ideal for flows that need higher throughput and are prepared to lose the staged data in the event of a agent failures. Required properties are in bold.

PropertyNameDefault Description
typeThe component type name, needs to be memory
capacity100The maximum number of events stored in the channel
transactionCapacity100The maximum number of events the channel will take from a source or give to a sink per transaction
keep-alive3Timeout in seconds for adding or removing an event
byteCapacityBufferPercentage20Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacitysee descriptionMaximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

样例:

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

Flume Sinks

HDFS Sink

样例:

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

Logger Sink

Logs event at INFO level. Typically useful for testing/debugging purpose. Required properties are in bold. This sink is the only exception which doesn’t require the extra configuration explained in the Logging raw data section.

主要用于测试,flume采集到数据打印到控制台,配合启动命令
-Dflume.root.logger=INFO,console

PropertyNameDefault Description
channel
typeThe component type name, needs to be logger
maxBytesToLog16Maximum number of bytes of the Event body to log

样例:

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

Kafka Sink

PropertyNameDefault Description
typeMust be set to org.apache.flume.sink.kafka.KafkaSink
kafka.bootstrap.serversList of brokers Kafka-Sink will connect to, to get the list of topic partitions This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port
kafka.topicdefault-flume-topicThe topic in Kafka to which the messages will be published. If this parameter is configured, messages will be published to this topic. If the event header contains a “topic” field, the event will be published to that topic overriding the topic configured here.
flumeBatchSize100How many messages to process in one batch. Larger batches improve throughput while adding latency.
kafka.producer.acks1How many replicas must acknowledge a message before its considered successfully written. Accepted values are 0 (Never wait for acknowledgement), 1 (wait for leader only), -1 (wait for all replicas) Set this to -1 to avoid data loss in some cases of leader failure.
useFlumeEventFormatfalseBy default events are put as bytes onto the Kafka topic directly from the event body. Set to true to store events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSource or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers for the producing side.
defaultPartitionIdSpecifies a Kafka partition ID (integer) for all events in this channel to be sent to, unless overriden by partitionIdHeader. By default, if this property is not set, events will be distributed by the Kafka Producer’s partitioner - including by key if specified (or by a partitioner specified by kafka.partitioner.class).
partitionIdHeaderWhen set, the sink will take the value of the field named using the value of this property from the event header and send the message to the specified partition of the topic. If the value represents an invalid partition, an EventDeliveryException will be thrown. If the header value is present then this setting overrides defaultPartitionId.
kafka.producer.security.protocolPLAINTEXTSet to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup.
more producer security propsIf using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on producer.
Other Kafka Producer PropertiesThese properties are used to configure the Kafka Producer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.producer. For example: kafka.producer.linger.ms
Note Kafka Sink uses the topic and key properties from the FlumeEvent headers to send events to Kafka. If topic exists in the headers, the event will be sent to that specific topic, overriding the topic configured for the Sink. If key exists in the headers, the key will used by Kafka to partition the data between the topic partitions. Events with same key will be sent to the same partition. If the key is null, events will be sent to random partitions.
The Kafka sink also provides defaults for the key.serializer(org.apache.kafka.common.serialization.StringSerializer) and value.serializer(org.apache.kafka.common.serialization.ByteArraySerializer). Modification of these parameters is not recommended.

Deprecated Properties

PropertyNameDefault Description
brokerListUse kafka.bootstrap.servers
topicdefault-flume-topicUse kafka.topic
batchSize100Use kafka.flumeBatchSize
requiredAcks1Use kafka.producer.acks

An example configuration of a Kafka sink is given below. Properties starting with the prefix kafka.producer the Kafka producer. The properties that are passed when creating the Kafka producer are not limited to the properties given in this example. Also it is possible to include your custom properties here and access them inside the preprocessor through the Flume Context object passed in as a method argument.

官方样例:

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.ki.kafka.producer.compression.type = snappy

用户手册

http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html

官网地址

http://flume.apache.org/index.html

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
好的,以下是一个示例的file-flume-kafka.conf配置文件,您可以根据自己的需求进行修改: ```properties # Name the components on this agent kafka-flume-agent.sources = kafka-source kafka-flume-agent.channels = kafka-channel kafka-flume-agent.sinks = logger-sink # Describe/configure the source kafka-flume-agent.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource kafka-flume-agent.sources.kafka-source.zookeeperConnect = hadoop102:2181,hadoop103:2181,hadoop104:2181 kafka-flume-agent.sources.kafka-source.topic = test-topic kafka-flume-agent.sources.kafka-source.batchSize = 1000 # Describe the sink kafka-flume-agent.sinks.logger-sink.type = logger # Use a channel which buffers events in memory kafka-flume-agent.channels.kafka-channel.type = memory kafka-flume-agent.channels.kafka-channel.capacity = 10000 kafka-flume-agent.channels.kafka-channel.transactionCapacity = 1000 # Bind the source and sink to the channel kafka-flume-agent.sources.kafka-source.channels = kafka-channel kafka-flume-agent.sinks.logger-sink.channel = kafka-channel ``` 这个示例配置文件定义了一个名为"kafka-flume-agent"的Flume代理程序,它从名为"test-topic"的Kafka主题中读取数据,并将其写入到一个内存通道中。 注意,这个示例配置文件中的"zookeeperConnect"参数指定了Kafka使用的Zookeeper地址,您需要根据自己的实际情况进行修改。 启动这个配置文件的方法已经在上一条回答中给出。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值