Flume-1.7.0之TAILDIR Source
Taildir Source简介
Taildir Source可以监控指定文件,而且一旦有数据追加到每个文件,这种source都能实时的跟踪并发现文件的数据追加,这个source会等待数据写完,然后尝试接着读取这些文件。
Taildir Source是可靠的,它不会丢失数据即使监控的文件进行滚动并产生新数据,因为它会周期性的将每个文件的每次读取的最后的position位置以json的格式写入到文件
,就算flume停止或者down掉,重启之后还是从每个文件的position位置开始读取。
在其它场景中,taildir source也可以使用给定的position file从每个文件的人任意位置开始读取,当指定路径上没有position file 时,默认情况下从每个文件的第一行开始跟踪tail。
文件将按照修改时间
的顺序被消耗,修改时间最早的文件将首先被消耗,这个taildir source不会对跟踪的文件作任何重命名、删除等其它修改操作,目前这个taildir source 不支持tail二进制文件,它能读取text file一行接一行的读取
。
Taildir Source常用的配置属性名称
加粗的部分是必须在xxxx.conf中指定的
Property Name | Default | Description |
---|---|---|
channels | – | 指定source所对应的channels |
type | – | 组件的类型,必须为TAILDIR . |
filegroups | – | 每个filegroup代表一个被tail的files的集合 |
filegroups. | – | Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only. |
positionFile | ~/.flume/taildir_position.json | Json格式的文件,用来记录每个file的agent名称、绝对路径、最后读取的position位置等信息。 |
headers.. | – | Header value which is the set with header key. Multiple headers can be specified for one file group. |
byteOffsetHeader | false | Whether to add the byte offset of a tailed line to a header called ‘byteoffset’. |
skipToEnd | false | 假如files的消费没有写入到position file,那么直接从文件的end开始读取 |
idleTimeout | 120000 | 关闭对文件的交互处理的超时时间,如果被关闭的file被追加了新的lines,Taildir source会自动重新打开对文件的引用。(ms) |
writePosInterval | 3000 | Interval time (ms) to write the last position of each file on the position file. |
batchSize | 100 | Max number of lines to read and send to the channel at a time. Using the default is usually fine. |
backoffSleepIncrement | 1000 | The increment for time delay before reattempting to poll for new data, when the last attempt did not find any new data. |
maxBackoffSleep | 5000 | The max time delay between each reattempt to poll for new data, when the last attempt did not find any new data. |
cachePatternMatching | true | Listing directories and applying the filename regex pattern may be time consuming for directories containing thousands of files. Caching the list of matching files can improve performance. The order in which files are consumed will also be cached. Requires that the file system keeps track of modification times with at least a 1-second granularity. |
fileHeader | false | Whether to add a header storing the absolute path filename. |
fileHeaderKey | file | Header key to use when appending absolute path filename to event header. |
使用案例如下Taildir Source & logger Sink
# 1.启动agent
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties &
# 2.配置taildir_to_console.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /Users/shufang/program_files/flume-1.7.0/test.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
# 配置sink
a1.sinks.k1.type = logger
# 配置channel信息,用来缓存从source接收到的数据
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 3.开始采集
bin/flume-ng agent
--conf conf
--conf-file jobs/taildir_to_console.conf
--name a1
-Dflume.root.logger=INFO,console