Flume-断点续传taildir
一般的flume日志采集方式会出现重复采集的情况,比如:当某个flume应用挂掉后,重启应用,就会将采集过得日志重复采集。
解决办法:采用断点续传taildir,记录上一次的采集位置,重启应用后,从记录的位置开始采集。
#*********** set agent *************
a1.sources=r1
a1.channels=c1
a1.sinks=k1
#********* set sources **********
a1.sources.r1.type=taildir
a1.sources.r1.positionFile=/home/hadoop/data/taildir/taildir_position.json
a1.sources.r1.filegroups=f1 f2 f3
a1.sources.r1.filegroups.f1=/home/hadoop/data/interceptor_data/access.log
a1.sources.r1.filegroups.f2=/home/hadoop/data/interceptor_data/nginx.log
a1.sources.r1.filegroups.f3=/home/hadoop/data/interceptor_data/web.log
a1.sources.r1.headers.f1.headerKey=access
a1.sources.r1.headers.f2.headerKey=nginx
a1.sources.r1.headers.f3.headerKey=web
a1.sources.r1.fileHeader=true
a1.sources.r1.channels=c1
#************ set channels **************
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
#************* set sinks *************
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=/flume_taildir/%{headerKey}/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix=event_data
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=1
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.rollInterval=20
a1.sinks.k1.hdfs.rollSize=50
a1.sinks.k1.hdfs.rollCount=10
a1.sinks.k1.hdfs.batchSize=100
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.threadsPoolSize=10
a1.sinks.k1.hdfs.callTimeout=10000
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.useLocalTimeStamp=true