1 案例需求
业务系统使用log4j生成日志,日志内容不断增加,需要把追加到日志文件中的数据实时采集到HDFS
2 需求分析
根据需求,首先定义以下3大要素
采集源,即Source--监控文件内容更新:exec "tail -F file"
下沉目标,即Sink--HDFS文件系统:hdfs sink
Source与Sink之间的传递通道--Channel,可用FileChannel也可以用内存Channel
说明:spooldir组件读取的是静态文件,直接读取某个文件中不断新增的行数据的source组件是没有的,但是有一个source exec组件可以读取一个linux命令的输出,从而可以通过命令“tail -F file”读取某个文件中不断新增的行数据。
tail -f 与 tail -F的区别:
tail -f 等同于--follow=descriptor,根据文件描述符进行追踪,当文件改名或被删除,追踪停止
tail -F 等同于--follow=name --retry,根据文件名进行追踪,并保持重试,即该文件被删除或改名后,如果再次创建相同的文件名,会继续追踪
3 实现步骤
3.1 创建并配置文件tail-hdfs.conf
配置文件属性配置参考官方文档:http://flume.apache.org/FlumeUserGuide.html
[caimh@master-node job]$ vim tail-hdfs.conf
#定义三大组件的名称
ag1.sources = source1
ag1.sinks = sink1
ag1.channels = channel1
# 配置source组件
ag1.sources.source1.type = exec
ag1.sources.source1.command= tail -F /opt/module/flume-1.7.0/logs/access.log
# 配置sink组件
ag1.sinks.sink1.type = hdfs
ag1.sinks.sink1.hdfs.path =hdfs://master-node:9000/access_log/%y-%m-%d/%H-%M
#上传文件的前缀
ag1.sinks.sink1.hdfs.filePrefix = app_log
#上传文件的后缀
ag1.sinks.sink1.hdfs.fileSuffix = .log
#积攒多少个Event才flush到HDFS一次
ag1.sinks.sink1.hdfs.batchSize= 100
ag1.sinks.sink1.hdfs.fileType = DataStream
ag1.sinks.sink1.hdfs.writeFormat =Text
## roll:滚动切换:控制写文件的切换规则
## 按文件体积(字节)来切
ag1.sinks.sink1.hdfs.rollSize = 512000
## 按event条数切
ag1.sinks.sink1.hdfs.rollCount = 1000000
## 按时间间隔切换文件,多久生成一个新的文件
ag1.sinks.sink1.hdfs.rollInterval = 60
## 控制生成目录的规则
ag1.sinks.sink1.hdfs.round = true
##多少时间单位创建一个新的文件夹
ag1.sinks.sink1.hdfs.roundValue = 10
ag1.sinks.sink1.hdfs.roundUnit = minute
#是否使用本地时间戳
ag1.sinks.sink1.hdfs.useLocalTimeStamp = true
# channel组件配置
ag1.channels.channel1.type = memory
## event条数
ag1.channels.channel1.capacity = 500000
##flume事务控制所需要的缓存容量600条event
ag1.channels.channel1.transactionCapacity = 600
# 绑定source、channel和sink之间的连接
ag1.sources.source1.channels = channel1
ag1.sinks.sink1.channel = channel1
3.2 启动监控文件夹命令
[caimh@master-node flume-1.7.0]$ bin/flume-ng agent -c conf/ -n ag1 -f job/tail-hdfs.conf
3.3 向之前创建/opt/module/flume-1.7.0/logs文件夹动态添加文件内容(通过脚本模拟动态日志)
[caimh@master-node logs]$ while true
> do
> echo `date`>>access.log
> sleep 0.1
> done
3.4 查看HDFS上的数据
4 问题
[caimh@master-node flume-1.7.0]$ bin/flume-ng agent -c conf/ -n ag1 -f job/tail-hdfs.conf Dflume.root.logger=info,console
Info: Sourcing environment configuration script /opt/module/flume-1.7.0/conf/flume-env.sh
Info: Including Hadoop libraries found via (/opt/module/hadoop-2.7.4/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /opt/module/jdk1.8.0_211/bin/java -Xmx20m -cp '/opt/module/flume-1.7.0/conf:/opt/module/flume-1.7.0/lib/*:/opt/module/hadoop-2.7.4/etc/hadoop:/opt/module/hadoop-2.7.4/share/hadoop/common/lib/*:/opt/module/hadoop-2.7.4/share/hadoop/common/*:/opt/module/hadoop-2.7.4/share/hadoop/hdfs:/opt/module/hadoop-2.7.4/share/hadoop/hdfs/lib/*:/opt/module/hadoop-2.7.4/share/hadoop/hdfs/*:/opt/module/hadoop-2.7.4/share/hadoop/yarn/lib/*:/opt/module/hadoop-2.7.4/share/hadoop/yarn/*:/opt/module/hadoop-2.7.4/share/hadoop/mapreduce/lib/*:/opt/module/hadoop-2.7.4/share/hadoop/mapreduce/*:/opt/module/hadoop-2.7.4/contrib/capacity-scheduler/*.jar:/lib/*' -Djava.library.path=:/opt/module/hadoop-2.7.4/lib/native org.apache.flume.node.Application -n ag1 -f job/tail-hdfs.conf Dflume.root.logger=info,console
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/flume-1.7.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
发生jar包冲突了:
分别为:
file:/opt/module/flume-1.7.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class
file:/opt/module/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class
解决办法:删除其中一个jar包
[caimh@master-node flume-1.7.0]$ rm -rf lib/slf4j-log4j12-1.6.1.jar