1.flume版本是1.8
关于flume的各个source channel sink选型 注意看看官网http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
source sink 选型。看你是T+1还是实时,是监控目录还是tail -F 文件。
具体优缺点看这篇博客https://blog.csdn.net/u013384984/article/details/79436078
我这里选之前选spooldir ,监控目录一旦目录中有新文件缺点是一旦监控的目录中有新的文件可以读,但这个个文件就不能写因此做不到准实时。
选用taildir 实时读,但要注意解决多个小文件问题,这时就要看参数怎么配置。在hdfs上生成的文件滚动方式设置,我们是按大小滚动,注意其他的滚定方式就要设置为0。
taildir_logtohdfs.properties
#agent_name
read.sources = r1
read.sinks = k1
read.channels = c1
#source的配置
# source类型
read.sources.r1.type = TAILDIR
# 元数据位置,可以断点续传。
read.sources.r1.positionFile = /usr/local/flume-1.8.0/conf/taildir_position.json
# 监控的目录
read.sources.r1.filegroups = f1
read.sources.r1.filegroups.f1 = /data/logs/read/.*log
read.sources.r1.fileHeader = true
#sink的配置
read.sinks.k1.type = hdfs
read.sinks.k1.hdfs.useLocalTimeStamp = true
#注意这里最好要加小时,这时就可以H+1入库,要不然load进库没有会报错。
read.sinks.k1.hdfs.path = hdfs://192.168.1.154:8022/user/read/data/%Y%m%d/%H
read.sinks.k1.hdfs.filePrefix = read_access
read.sinks.k1.hdfs.fileSuffix = .log
read.sinks.k1.hdfs.rollSize = 134217728
read.sinks.k1.hdfs.rollCount = 0
read.sinks.k1.hdfs.rollInterval = 0
read.sinks.k1.hdfs.writeFormat = Text
read.sinks.k1.hdfs.fileType = DataStream
read.sinks.k1.hdfs.idleTimeout = 120
read.sinks.k1.hdfs.minBlockReplicas = 1
#channel的配置
read.channels.c1.type = memory
read.channels.c1.capacity = 100000
read.channels.c1.transactionCapacity = 100000
#用channel链接source和sink
read.sources.r1.channels = c1
read.sinks.k1.channel = c1
启动命令
nohup bin/flume-ng agent -n read -c conf -f /usr/local/flume-1.8.0/conf/taildir_logtohdfs.properties &