Flume常用组件详解之Source

最新推荐文章于 2022-05-25 23:32:04 发布

飞鸟与鹿

最新推荐文章于 2022-05-25 23:32:04 发布

阅读量2.3k

点赞数 2

分类专栏： Flume

本文链接：https://blog.csdn.net/qq_30612351/article/details/109616433

版权

Flume 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Flume常用组件详解：Source

Flume支持众多的source、sink、拦截器等组件具体实现，详细手册可参考官方文档http://flume.apache.org/FlumeUserGuide.html

一、netcat source

1、工作机制

启动一个socket服务，监听一个端口；

将端口上收到的数据，转成event写入channel；

2、配置文件

a1.sources = s1

a1.sources.s1.type = netcat

a1.sources.s1.bind = 0.0.0.0

a1.sources.s1.port = 44444

a1.sources.s1.channels = c1

二、exec source

1、工作机制

启动一个用户所指定的linux shell命令；

采集这个linux shell命令的标准输出，作为收集到的数据，转为event写入channel；

2、参数详解

channels	–	本source要发往的channel
type	–	本source的类别名称：exec
command	–	本source所要运行的linux命令,比如： tail -F /path/file
shell	–	指定运行上述命令所用shell
restartThrottle	10000	命令die了以后，重启的时间间隔
restart	false	命令die了以后，是否要重启
logStdErr	false	是否收集命令的错误输出stderr
batchSize	20	提交的event批次大小
batchTimeout	3000	发往下游没完成前，等待的时间
selector.type	replicating	指定channel选择器：replicating or multiplexing
selector.*		选择器的具体参数
interceptors	–	指定拦截器
interceptors.*		指定的拦截器的具体参数

3、配置文件

a1.sources = s1

a1.sources.s1.channels = c1

a1.sources.s1.type = exec

a1.sources.s1.command = tail -F /root/weblog/access.log

a1.sources.s1.batchSize = 100

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 200

a1.channels.c1.transactionCapacity = 100

a1.sinks = k1

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

4、启动测试

1.准备一个日志文件

2.写一个脚本模拟往日志文件中持续写入数据

for i in {1..10000}; do echo ${i}--------------------------- >> access.log ; sleep 0.5; done

3.创建一个flume自定义配置文件

4.启动flume采集

注意：通过人为破坏测试，发现这个exec source，不会记录宕机前所采集数据的偏移量位置，重启后可能会造成数据丢失！

三、spooldir source

1、工作机制

监视一个指定的文件夹，如果文件夹下有没采集过的新文件，则将这些新文件中的数据采集，并转成event写入channel；

注意：spooling目录中的文件必须是不可变的，而且是不能重名的！否则，source会loudly fail！

2、参数详解

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be spooldir.
spoolDir	–	The directory from which to read files from.
fileSuffix	.COMPLETED	采集完成的文件，添加什么后缀名
deletePolicy	never	是否删除采完的文件: never or immediate
fileHeader	false	是否将所采集文件的绝对路径添加到header中
fileHeaderKey	file	上述header的key名称
basenameHeader	false	是否将文件名添加到header
basenameHeaderKey	basename	上述header的key名称
includePattern	^.*$	指定需要采集的文件名的正则表达式
ignorePattern	^$	指定要排除的文件名的正则表达式如果一个文件名即符合includePattern又匹配ignorePattern，则该文件不采
trackerDir	.flumespool	记录元数据的目录所在路径，可以用绝对路径也可以用相对路径（相对于采集目录）
trackingPolicy	rename	采集进度跟踪策略，有两种： “rename”和 “tracker_dir”. 本参数只在deletePolicy=never时才生效 “rename”- 采完的文件根据filesuffix重命名 “tracker_dir” - 采完的文件会在trackerDir目录中生成一个同名的空文件
consumeOrder	oldest	采集顺序： oldest, youngest and random. oldest和youngest情况下，可能会带来一定效率的损失；（需要对文件夹中所有文件进行一次扫描以寻找最old或最young的）
pollDelay	500	Delay (in milliseconds) used when polling for new files.
recursiveDirectorySearch	false	Whether to monitor sub directories for new files to read.
maxBackoff	4000	The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize	100	一次传输到channel的event条数（一批）
inputCharset	UTF-8	Character set used by deserializers that treat the input file as text.
decodeErrorPolicy	FAIL	What to do when we see a non-decodable character in the input file. FAIL: Throw an exception and fail to parse the file. REPLACE: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. IGNORE: Drop the unparseable character sequence.
deserializer	LINE	Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implementEventDeserializer.Builder.
deserializer.*		Varies per event deserializer.
bufferMaxLines	–	(Obselete) This option is now ignored.
bufferMaxLineLength	5000	(Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

3、配置文件：

a1.sources = s1

a1.sources.s1.channels = c1

a1.sources.s1.type = spooldir

a1.sources.s1.spoolDir = /root/weblog

a1.sources.s1.batchSize = 200

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 200

a1.channels.c1.transactionCapacity = 100

a1.sinks = k1

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

4、启动测试：

bin/flume-ng agent -n a1 -c conf -f myconf/spooldir-mem-logger.conf -Dflume.root.logger=DEBUG,console

注意：spooldir source 与exec source不同，spooldir source本身是可靠的！会记录崩溃之前的采集位置!

四、avro source

Avro source 是通过监听一个网络端口来接受数据，而且接受的数据必须是使用avro序列化框架序列化后的数据；

Avro是一种序列化框架，跨语言的；

扩展：什么是序列化，什么是序列化框架？

序列化：是将一个有复杂结构的数据块（对象）变成扁平的（线性的）二进制序列

序列化框架：一套现成的软件，可以按照既定策略，将对象转成二进制序列

比如： jdk就有： ObjectOutputStream

hadoop就有： Writable

跨平台的序列化框架： avro

1、工作机制

启动一个网络服务，监听一个端口，收集端口上收到的avro序列化数据流！

该source中拥有avro的反序列化器，能够将收到的二进制流进行正确反序列化，并装入一个event写入channel！

2、参数详解

Property Name	Default	Description
channels	–
type	–	本source的别名： avro
bind	–	要绑定的地址
port	–	要绑定的端口号
threads	–	服务的最大线程数
selector.type
selector.*
interceptors	–	Space-separated list of interceptors
interceptors.*
compression-type	none	压缩类型：跟发过来的数据是否压缩要匹配：none \| deflate
ssl	false	Set this to true to enable SSL encryption. If SSL is enabled, you must also specify a “keystore” and a “keystore-password”, either through component level parameters (see below) or as global SSL parameters (see SSL/TLS support section).
keystore	–	This is the path to a Java keystore file. If not specified here, then the global keystore will be used (if defined, otherwise configuration error).
keystore-password	–	The password for the Java keystore. If not specified here, then the global keystore password will be used (if defined, otherwise configuration error).
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”. If not specified here, then the global keystore type will be used (if defined, otherwise the default is JKS).
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
include-protocols	–	Space-separated list of SSL/TLS protocols to include. The enabled protocols will be the included protocols without the excluded protocols. If included-protocols is empty, it includes every supported protocols.
exclude-cipher-suites	–	Space-separated list of cipher suites to exclude.
include-cipher-suites	–	Space-separated list of cipher suites to include. The enabled cipher suites will be the included cipher suites without the excluded cipher suites. If included-cipher-suites is empty, it includes every supported cipher suites.
ipFilter	false	Set this to true to enable ipFiltering for netty
ipFilterRules	–	Define N netty ipFilter pattern rules with this config.

3、配置文件

a1.sources = r1

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4141

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 200

a1.channels.c1.transactionCapacity = 100

a1.sinks = k1

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

4、启动测试

启动agent：

bin/flume-ng agent -c ./conf -f ./myconf/avro-mem-logger.conf -n a1 -Dflume.root.logger=DEBUG,console

用一个客户端去给启动好的source发送avro序列化数据：

bin/flume-ng avro-client --host c703 --port 4141

5、利用avro source和avro sink实现agent级联

（1）需求说明

（2）配置文件

上游配置文件： vi exec-m-avro.conf

a1.sources = r1

a1.channels = c1

a1.sinks = k1

a1.sources.r1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /tmp/logs/access.log

a1.sources.r1.batchSize = 100

a1.channels.c1.type = memory

a1.channels.c1.capacity = 10000

a1.channels.c1.trasactionCapacity = 100

a1.sinks.k1.channel = c1

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = h3

a1.sinks.k1.port = 4455

游配置文件：vi avro-m-log.conf

a1.sources = r1

a1.channels = c1

a1.sinks = k1

a1.sources.r1.channels = c1

a1.sources.r1.type = avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4455

a1.channels.c1.type = memory

a1.channels.c1.capacity = 10000

a1.channels.c1.trasactionCapacity = 100

a1.sinks.k1.channel = c1

a1.sinks.k1.type = logger

（3）启动测试

先启动下游：bin/flume-ng agent -n a1 -c conf/ -f avro-m-log.conf -Dflume.root.logger=INFO,console

再启动上游：bin/flume-ng agent -n a1 -c conf/ -f exec-m-avro.conf

然后写一个脚本在h1上模拟生成数据

while true

echo "hello " >> /tmp/logs/access.log

sleep 0.1

done

五、taildir source

1、工作机制

监视指定目录下的一批文件，只要某个文件中有新写入的行，则会被tail到。

它会记录每一个文件所tail到的位置，记录到一个指定的positionfile保存目录中，格式为json（如果需要的时候，可以人为修改，就可以让source从任意指定的位置开始读取数据）

它对采集完成的文件，不会做任何修改（比如重命名，删除…..）

taildir source会把读到的数据成功写入channel后，再更新记录偏移量。这种机制，能保证数据不会漏采（丢失），但是有可能会产生数据重复！

2、参数详解

Property Name	Default	Description
channels	–	所要写往的channel
type	–	本source的别名： TAILDIR.
filegroups	–	空格分割的组名，每一组代表着一批文件 g1 g2
filegroups.<filegroupName>	–	每个文件组的绝路路径，文件名可用正则表达式
positionFile	~/.flume/taildir_position.json	记录偏移量位置的文件所在路径
headers.<filegroupName>.<headerKey>	–	Header value which is the set with header key. Multiple headers can be specified for one file group.
byteOffsetHeader	false	Whether to add the byte offset of a tailed line to a header called ‘byteoffset’.
skipToEnd	false	Whether to skip the position to EOF in the case of files not written on the position file.
idleTimeout	120000	关闭非活动文件的时延。如果被关闭的这个文件又在某个时间有了新增行,会被此source检测到，并重新打开
writePosInterval	3000	3s 记录一次偏移量到positionfile
batchSize	100	提交event到channel的批次最大条数
maxBatchCount	Long.MAX_VALUE	控制在一个文件上连续读取的最大批次个数（如果某个文件正在被高速写入，那就应该让这个参数调为最大值，以让source可以集中精力专采这个文件）
backoffSleepIncrement	1000	The increment for time delay before reattempting to poll for new data, when the last attempt did not find any new data.
maxBackoffSleep	5000	The max time delay between each reattempt to poll for new data, when the last attempt did not find any new data.
cachePatternMatching	true	Listing directories and applying the filename regex pattern may be time consuming for directories containing thousands of files. Caching the list of matching files can improve performance. The order in which files are consumed will also be cached. Requires that the file system keeps track of modification times with at least a 1-second granularity.
fileHeader	false	Whether to add a header storing the absolute path filename.
fileHeaderKey	file	Header key to use when appending absolute path filename to event header.

3、配置文件

a1.sources = r1

a1.sources.r1.type = TAILDIR

a1.sources.r1.channels = c1

a1.sources.r1.positionFile = /root/flumedata/taildir_position.json

a1.sources.r1.filegroups = f1

a1.sources.r1.filegroups.f1 = /root/weblog/access.log

a1.sources.r1.fileHeader = true

a1.sources.ri.maxBatchCount = 1000

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 200

a1.channels.c1.transactionCapacity = 100

a1.sinks = k1

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

4、启动测试

bin/flume-ng agent -n a1 -c conf/ -f myconf/taildir-mem-logger.conf -Dflume.root.logger=DEBUG,console

经过人为破坏测试，发现， this source还是真正挺reliable的！

不会丢失数据，但在极端情况下可能会产生重复数据！

六、kafka source

1、工作机制

Kafka source的工作机制：就是用kafka consumer连接kafka，读取数据，然后转换成event，写入channel

2、参数详解

Property Name	Default	Description
channels	–	数据发往的channel
type	–	本source的名称： org.apache.flume.source.kafka.KafkaSource
kafka.bootstrap.servers	–	Kafka broker服务器列表，逗号分隔
kafka.consumer.group.id	flume	Kafka消费者组id
kafka.topics	–	Kafka消息主题列表，逗号隔开
kafka.topics.regex	–	用正则表达式来指定一批topic；本参数的优先级高于kafka.topics
batchSize	1000	写入channel的event 批，最大消息条数
batchDurationMillis	1000	批次写入channel的最大时长
backoffSleepIncrement	1000	Kafka Topic 显示为空时触发的初始和增量等待时间。
maxBackoffSleep	5000	Kafka Topic 显示为空时触发的最长等待时间
useFlumeEventFormat	false	默认情况下，event 将从Kafka Topic 直接作为字节直接进入event 主体。设置为true以读取event 作为Flume Avro二进制格式。与Kafka Sink上的相同属性或Kafka Channel上的parseAsFlumeEvent属性一起使用时，这将保留在生成端发送的任何Flume标头。
setTopicHeader	true	是否要往header中加入一个kv：topic信息
topicHeader	topic	应上面开关的需求，加入kv：topic =>topic名称
kafka.consumer.security.protocol	PLAINTEXT	Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup.
more consumer security props		If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on consumer.
Other Kafka Consumer Properties	–	本source，允许直接配置任意的kafka消费者参数，格式如下： For example: kafka.consumer.auto.offset.reset （就是在消费者参数前加统一前缀： kafka.consumer.）

3、配置文件

a1.sources = s1

a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.s1.channels = c1

a1.sources.s1.batchSize = 100

a1.sources.s1.batchDurationMillis = 2000

a1.sources.s1.kafka.bootstrap.servers = c701:9092,c702:9092,c703:9092

a1.sources.s1.kafka.topics = TAOGE

a1.sources.s1.kafka.consumer.group.id = g1

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 200

a1.channels.c1.transactionCapacity = 100

a1.sinks = k1

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

4、启动测试

（1）首先，操作kafka，准备好topic

# 查看当前kafka集群中的topic：

bin/kafka-topics.sh --list --zookeeper c701:2181

# 创建一个新的topic

bin/kafka-topics.sh --create --topic TAOGE --partitions 3 --replication-factor 2 --zookeeper c701:2181

# 查看topic的详细信息

bin/kafka-topics.sh --describe --topic TAOGE --zookeeper c701:2181

# 控制台生产者，向topic中写入数据

bin/kafka-console-producer.sh --broker-list c701:9092,c702:9092,c703:9092 --topic TAOGE

（2）启动flume agent来采集kafka中的数据

bin/flume-ng agent -n a1 -c conf/ -f myconf/kfk-mem-logger.conf -Dflume.root.logger=INFO,console

注意：

Source往channel中写入数据的批次大小 <= channel的事务控制容量大小

飞鸟与鹿

关注

2
点赞
踩
20

收藏

觉得还不错? 一键收藏
2
评论
Flume常用组件详解之Source

Flume常用组件详解：SourceFlume支持众多的source、sink、拦截器等组件具体实现，详细手册可参考官方文档http://flume.apache.org/FlumeUserGuide.html一、netcatsource1、工作机制启动一个socket服务，监听一个端口；将端口上收到的数据，转成event写入channel；2、配置文件a1.sources = s1a1.sources.s1.type = netcata1.sources.s1.bi.
复制链接

扫一扫