需求:
- 不同服务产生不同的日志文件,例如: server/test_a_20181217.log server/test_b_20181217.log;日志是不断写入的
- flume采集日志到对应HDFS文件夹里,即 :
server/test_a_20181217.log ——> /user/hive/logs/ymd=20181217/testa/xxxx.txt
server/test_b_20181217.log ——> /user/hive/logs/ymd=20181217/testb/xxxx.txt
解决方案:
- flume会将日志转成header,body
- 在source把文件名加入header,写入hdfs获取变量产生目录
- 实现: Taildir Source / filegroups
- 架构
日志机器flume配置:
#gent的名称为"a1"
a1.sources = r
a1.sinks = k-1 k-2 k-3
a1.channels = c
#***注册日志收集***
#source配置信息
a1.sources.r.type = TAILDIR
# 指定元数据存储
a1.sources.r.positionFile = /home/apache-flume-1.8.0-bin/logs/test.json
#监控分组,兼容多个目录(is ok)
a1.sources.r.filegroups = f1 f2
#test_a
a1.sources.r.filegroups.f1 = /server/test_a_.*log
a1.sources.r.headers.f1.headerKey1 = test_a # header 变量
#test_b
a1.sources.r.filegroups.f2 = /server/test_b_.*log # header 变量
a1.sources.r.headers.f2.headerKey1 = test_b
# 开启文件路径存入header
a1.sources.r.fileHeader = true
a1.sources.r.fileHeaderKey = file
#sink组
a1.sinkgroups=g
a1.sinkgroups.g.sinks=k-1 k-2 k-3
a1.sinkgroups.g.processor.type=failover
a1.sinkgroups.g.processor.priority.k-1=10
a1.sinkgroups.g.processor.priority.k-2=5
a1.sinkgroups.g.processor.priority.k-3=1
a1.sinkgroups.g.processor.maxpenalty=10000
#sink配置信息(故障转移,优先级依次从高到低)
a1.sinks.k-1.type = avro
a1.sinks.k-1.hostname = 192.168.0.1
a1.sinks.k-1.port = 41401
a1.sinks.k-2.type = avro
a1.sinks.k-2.hostname = 192.168.0.1
a1.sinks.k-2.port = 41401
a1.sinks.k-3.type = avro
a1.sinks.k-3.hostname = 192.168.0.1
a1.sinks.k-3.port = 41401
#channel配置信息
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r.channels = c
a1.sinks.k-1.channel = c
a1.sinks.k-2.channel = c
a1.sinks.k-3.channel = c
CDH上flume配置
#source配置信息
a1.sources.r.type = avro
a1.sources.r.bind = 0.0.0.0
a1.sources.r.port = 41401
#sink配置信息
a1.sinks.k.type = hdfs
a1.sinks.k.hdfs.fileType=DataStream
a1.sinks.k.hdfs.useLocalTimeStamp=true
a1.sinks.k.hdfs.path = /user/hive/ymd=%Y%m%d/%{headerKey1} # 获取变量
a1.sinks.k.hdfs.filePrefix = log
a1.sinks.k.hdfs.inUseSuffix=.txt
a1.sinks.k.hdfs.writeFormat = Text
a1.sinks.k.hdfs.idleTimeout = 3600
a1.sinks.k.hdfs.batchSize=10
a1.sinks.k.hdfs.rollSize = 0
a1.sinks.k.hdfs.rollInterval = 0
a1.sinks.k.hdfs.rollCount = 0
a1.sinks.k.hdfs.minBlockReplicas = 1
a1.sinks.k.hdfs.round = true
a1.sinks.k.hdfs.roundValue = 1
#channel配置信息
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r.channels = c
a1.sinks.k.channel = c
附录
- 过滤器实战,关注过滤器推荐 http://www.cnblogs.com/zlslch/p/7244211.html
- Spooling Directory Source采集,但是不支持文件同时读写: https://blog.csdn.net/xiao_jun_0820/article/details/41576999