1.先看一个简单的入门案例,通过官网:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html
flume的配置文件为 simple.cof:
# 定义agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#定义source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop000
a1.sources.r1.port = 44444
#定义channnel
a1.channels.c1.type = memory
#定义sink
a1.sinks.k1.type = logger
#定义配置关系
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
该案例的启动命令:
[hadoop@hadoop000 bin]$ flume-ng agent \
> --name a1 \
> --conf-file /home/hadoop/script/flume/simple.conf \
> --conf $FLUME_HOME/conf \
> -Dflume.root.logger=INFO,console
启动后的日志部分截图表明三大组件的启动顺序为:channel->sink->resource
通过telnet命令,发数据到flume
[hadoop@hadoop000 ~]$ telnet hadoop000 44444
Trying 192.168.31.100...
Connected to hadoop000.
Escape character is '^]'.
hello flume
OK
在flume中,接收数据
2019-09-19 23:50:59,149 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65 0D hello flume. }
如果数据过长,可能会被截断,logger-sink默认是16个字符,如下图
退出telnet:输入ctrl + ] 然后输入quit
2. 实时监控一个指定文件,将新增的内容输出到HDFS
source: exec
channle: memory
sink:logger -> HDFS
监听一个指定的命令,获取一条命令的结果作为它的数据源,source组件从这个命令的结果中取数据。常用的是tail -F 【file】指令,即只要应用程序向日志(文件)里面写数据,source组件就可以获取到日志(文件)中最新的内容 ,EXEC执行一个给定的命令获得输出的源,如果要使用tail命令,必选使得file足够大才能看到输出内容。
flume的配置文件 exec-memory-hdfs.conf 为
#define agent
exec-hdfs-agent.sources = exec-source
exec-hdfs-agent