flume实践（二）：TAILDIR多文件采集到对应HDFS文件

最新推荐文章于 2023-04-17 17:40:31 发布

PaperAgent

最新推荐文章于 2023-04-17 17:40:31 发布

阅读量1.3k

点赞数

分类专栏： 7 大数据文章标签： flume header TAILDIR hdfs 不同文件

本文链接：https://blog.csdn.net/ai_1046067944/article/details/85058132

版权

7 大数据专栏收录该内容

11 篇文章 1 订阅

订阅专栏

需求：

不同服务产生不同的日志文件，例如： server/test_a_20181217.log server/test_b_20181217.log；日志是不断写入的
flume采集日志到对应HDFS文件夹里，即：

server/test_a_20181217.log ——> /user/hive/logs/ymd=20181217/testa/xxxx.txt

server/test_b_20181217.log ——> /user/hive/logs/ymd=20181217/testb/xxxx.txt

解决方案：

flume会将日志转成header，body
在source把文件名加入header，写入hdfs获取变量产生目录
实现： Taildir Source / filegroups
架构

日志机器flume配置：

#gent的名称为"a1"  

a1.sources = r
a1.sinks = k-1 k-2 k-3
a1.channels = c

#***注册日志收集***
#source配置信息  
a1.sources.r.type = TAILDIR
# 指定元数据存储
a1.sources.r.positionFile = /home/apache-flume-1.8.0-bin/logs/test.json
#监控分组，兼容多个目录(is ok)
a1.sources.r.filegroups = f1 f2
#test_a
a1.sources.r.filegroups.f1 = /server/test_a_.*log
a1.sources.r.headers.f1.headerKey1 = test_a               # header 变量
#test_b
a1.sources.r.filegroups.f2 = /server/test_b_.*log        # header 变量
a1.sources.r.headers.f2.headerKey1 = test_b
# 开启文件路径存入header
a1.sources.r.fileHeader = true
a1.sources.r.fileHeaderKey = file
#sink组
a1.sinkgroups=g 
a1.sinkgroups.g.sinks=k-1 k-2 k-3
a1.sinkgroups.g.processor.type=failover
a1.sinkgroups.g.processor.priority.k-1=10
a1.sinkgroups.g.processor.priority.k-2=5
a1.sinkgroups.g.processor.priority.k-3=1
a1.sinkgroups.g.processor.maxpenalty=10000
#sink配置信息(故障转移，优先级依次从高到低)
a1.sinks.k-1.type = avro
a1.sinks.k-1.hostname = 192.168.0.1
a1.sinks.k-1.port = 41401
a1.sinks.k-2.type = avro
a1.sinks.k-2.hostname = 192.168.0.1
a1.sinks.k-2.port = 41401
a1.sinks.k-3.type = avro
a1.sinks.k-3.hostname = 192.168.0.1
a1.sinks.k-3.port = 41401
#channel配置信息  
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上  
a1.sources.r.channels = c
a1.sinks.k-1.channel = c
a1.sinks.k-2.channel = c
a1.sinks.k-3.channel = c

CDH上flume配置

#source配置信息  
a1.sources.r.type = avro
a1.sources.r.bind = 0.0.0.0
a1.sources.r.port = 41401
#sink配置信息  
a1.sinks.k.type = hdfs
a1.sinks.k.hdfs.fileType=DataStream  
a1.sinks.k.hdfs.useLocalTimeStamp=true
a1.sinks.k.hdfs.path = /user/hive/ymd=%Y%m%d/%{headerKey1}      # 获取变量
a1.sinks.k.hdfs.filePrefix = log
a1.sinks.k.hdfs.inUseSuffix=.txt
a1.sinks.k.hdfs.writeFormat = Text
a1.sinks.k.hdfs.idleTimeout = 3600
a1.sinks.k.hdfs.batchSize=10
a1.sinks.k.hdfs.rollSize = 0
a1.sinks.k.hdfs.rollInterval = 0
a1.sinks.k.hdfs.rollCount = 0
a1.sinks.k.hdfs.minBlockReplicas = 1
a1.sinks.k.hdfs.round = true
a1.sinks.k.hdfs.roundValue = 1
#channel配置信息
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r.channels = c
a1.sinks.k.channel = c

附录