什么是flume
Flume是一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。它是一个基于流数据的简单而灵活的架构。具有健壮的可靠性,容错性及故障转移和恢复机制。
flume原理简介(博客上看到的)
这是一个关于池子的故事。有一个池子,它一头进水,另一头出水,进水口可以配置各种管子,出水口也可以配置各种管子,可以有多个进水口、多个出水口。水术语称为Event,进水口术语称为Source、出水口术语成为Sink、池子术语成为Channel,Source+Channel+Sink,术语称为Agent。如果有需要,还可以把多个Agent连起来。
-source:采集数据,并发送给channel
-channel:管道,用于连接source和sink的
-sink:发送数据,用于采集channel中的数据
flume组件启动顺序
flume组件启动顺序:channels——>sinks——>sources,关闭顺序:sources——>sinks——>channels;
搭建环境
1.解压
tar -zxf flume-ng-1.5.0-cdh5.3.6.tar.gz -C /opt/cdh-5.3.6/
2.修改配置文件
flume-env.sh:修改Java_home
将hdfs的配置文件放到conf目录
core hdfs两个文件拷贝到flume的conf目录下
3.拷贝四个jar包到flume的lib目录下
案例
案例1
《《《《《《source-hive的log,channel-内存,sink:终端》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = memory
# define the sink
a1.sinks.k1.type = logger
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例2
《《《《《《source-hive的log,channel-file,sink:终端》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/datas/flume-ch/check
a1.channels.c1.dataDirs = /opt/datas/flume-ch/data
# define the sink
a1.sinks.k1.type = logger
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例3
《《《《《《source:exec,channel:mem,sink:HDFS》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/hdfs
a1.sinks.k1.hdfs.fileType = DataStream
##第一配置找到了hdfs
##第二如果目录不存在会主动创建
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例4
《《《《《《文件的大小》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/size
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollCount = 0
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例5
《《《《《《时间分区》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H-%M
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = hive-log
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例6
《《《《《《时间分区》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H-%M
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = tru
a1.sinks.k1.hdfs.filePrefix = hive-log
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
不设置a1.sinks.k1.hdfs.useLocalTimeStamp = true会报如下错误
案例7
《《《《《《监控文件夹》》》》》》
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume-ch/spdir
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/spdir
a1.sinks.k1.hdfs.fileType = DataStream
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
这是轮询模式,不是中断模式。如果检测的目录下没有生成文件,hdfs目录也不会创建,当检测目录下有文件后,HDFS目录也会创建
解决所有文件都会上传的方法:查找配置项,进行配置
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# define the source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume-ch/spdir
a1.sources.s1.ignorePattern = ([^ ]*\.tmp$)
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# define the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/spdir
a1.sinks.k1.hdfs.fileType = DataStream
# zuhe
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
案例8
《《《《《《监控文件夹和文件》》》》》》
既要监控某一个目录,又要动态读取目录中文件的数据?
exec:动态读一个文件
spooling dir :动态读取文件夹
需要自动编译taildir