文章目录
1. 通过netcat作为source, sink为logger的方式
1.1 conf文件配置
# example.conf: 一个单节点的 Flume 实例配置
# 配置Agent a1各个组件的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 配置Agent a1的sink k1的属性
a1.sinks.k1.type = logger
# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink绑定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
这个配置文件定义了一个Agent叫做a1,a1有一个source监听本机44444端口上接收到的数据、一个缓冲数据的channel还有一个把Event数据输出到控制台的sink。这个配置文件给各个组件命名,并且设置了它们的类型和其他属性。通常一个配置文件里面可能有多个Agent,当启动Flume时候通常会传一个Agent名字来做为程序运行的标记。
1.2 启动控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-netcat.conf -name a1 -Dflume.root.logger=INFO,console
1.3 远程连接端口
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
1.4 测试
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
OK
word
OK
dzw
OK
ttt
OK
haddop^H
OK
spark
OK
flume
OK
Flume的终端里面会以log的形式输出这个收到的Event内容。
2021-01-19 16:05:27,669 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }
2021-01-19 16:05:29,842 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 72 64 0D word. }
2021-01-19 16:05:38,846 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 64 7A 77 0D dzw. }
2021-01-19 16:14:24,955 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 74 74 74 0D ttt. }
2021-01-19 16:19:43,018 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 64 64 6F 70 08 0D haddop.. }
2021-01-19 16:19:52,022 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 61 72 6B 0D spark. }
2021-01-19 16:19:53,289 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 66 6C 75 6D 65 0D flume. }
2. 通过netcat作为source, sink为logger的方式,只留字母,过滤掉数字
2.1 配置conf文件
# 配置Agent a1各个组件的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# source定义正则匹配规则
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =regex_filter
a1.sources.r1.interceptors.i1.regex =^[0-9]*$
a1.sources.r1.interceptors.i1.excludeEvents =true
# 配置Agent a1的sink k1的属性
a1.sinks.k1.type = logger
# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink绑定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
增加了正则匹配规则部分
2.2 启用控制台和远程连接
同1
2.3 测试
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
liuyichang
OK
1234
OK
hand
OK
1199
OK
hahahaah
OK
1
OK
2
OK
3
OK
4dididi
OK
12wd34
OK
Connection closed by foreign host.
查看输出
2021-01-19 17:29:16,832 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 6C 69 75 79 69 63 68 61 6E 67 0D liuyichang. }
2021-01-19 17:29:31,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 6E 64 0D hand. }
2021-01-19 17:30:49,868 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 68 61 68 61 61 68 0D hahahaah. }
2021-01-19 17:30:53,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 34 64 69 64 69 64 69 0D 4dididi. }
2021-01-19 17:31:09,362 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 31 32 77 64 33 34 0D 12wd34. }
3. 通过netcat作为source, sink写到HDFS
3.1 conf配置
# 配置Agent a1各个组件的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 配置Agent a1的sink k1的属性
#a1.sinks.k1.type = logger
a1.sinks.k1.type=hdfs
#配置HDFS路径
a1.sinks.k1.hdfs.path=hdfs:/flume
#最终的文件前缀
a1.sinks.k1.hdfs.filePrefix=events
# 表示到了需要触发的时间时,是否要更新文件夹,true:表示是
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
# 表示切换时间的单位是分钟
a1.sinks.k1.hdfs.roundUnit = minute
# 表示过了一分钟生成一个文件
a1.sinks.k1.hdfs.roundInterval = 60
a1.sinks.k1.hdfs.fileType = DataStream
# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink绑定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.2 启用控制台和远程连接
启用控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hdfs.conf -name a1 -Dflume.root.logge
r=INFO,console
远程连接
telnet localhost 44444
3.3 测试
3.3.1 检验HDFS
[root@master ~]# hadoop fs -ls /
Found 10 items
-rw-r--r-- 2 root supergroup 1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x - root supergroup 0 2020-12-13 17:41 /data
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /dzw
drwxr-xr-x - root supergroup 0 2020-12-14 18:06 /hadoop
drwxr-xr-x - root supergroup 0 2020-12-29 17:59 /mr_wc
drwxr-xr-x - root supergroup 0 2020-12-29 17:57 /output
drwxr-xr-x - root supergroup 0 2020-12-21 15:34 /prodata
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /test
drwx-wx-wx - root supergroup 0 2020-12-14 21:43 /tmp
drwxr-xr-x - root supergroup 0 2020-12-25 11:40 /user
可以看到此时没有flume文件夹
3.3.2 输入测试
[root@master apache-flume-1.6.0-bin]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
qwq
OK
qqdeqd
OK
stupid
OK
liuyichang
OK
100086
OK
sichuan
OK
China
OK
panda
OK
3.3.3 检验HDFS输出文件
[root@slave1 ~]# hadoop fs -ls /
Found 11 items
-rw-r--r-- 2 root supergroup 1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x - root supergroup 0 2020-12-13 17:41 /data
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /dzw
drwxr-xr-x - root supergroup 0 2021-01-20 16:26 /flume
drwxr-xr-x - root supergroup 0 2020-12-14 18:06 /hadoop
drwxr-xr-x - root supergroup 0 2020-12-29 17:59 /mr_wc
drwxr-xr-x - root supergroup 0 2020-12-29 17:57 /output
drwxr-xr-x - root supergroup 0 2020-12-21 15:34 /prodata
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /test
drwx-wx-wx - root supergroup 0 2020-12-14 21:43 /tmp
drwxr-xr-x - root supergroup 0 2020-12-25 11:40 /user
此时Flume运行自动在HDFS目录下创建了Flume文件夹
[root@slave1 ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r-- 2 root supergroup 13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r-- 2 root supergroup 13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 2 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 12 2021-01-20 16:27 /flume/events.1611131231774.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r-- 2 root supergroup 14 2021-01-20 16:27 /flume/events.1611131262116.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r-- 2 root supergroup 14 2021-01-20 16:28 /flume/events.1611131262116
[root@slave1 ~]# hadoop fs -ls /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
[root@slave1 ~]# hadoop fs -cat /flume/events.1611131189758
qwq
qqdeqd
stupid
Flume下能够查询到输入的信息。
注意:出现tmp临时文件的原因
因为在conf文件中配置了一分钟生成一个文件,一分钟之内写入的文件都将写入到tmp文件中,一分钟之后传入的信息将写入新的tmp文件中。
如何设置flume防止小文件过多?
a、限定一个文件的文件数据大小
a1.sinks.k1.hdfs.rollSize = 200_1024_1024
b、限定文件可以存储多少个event
a1.sinks.k1.hdfs.rollCount = 10000
4. 通过HTTP作为source, sink写到logger
4.1 配置conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置源
a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind=master
a1.sources.r1.port=50020
#配置目标
a1.sinks.k1.type=logger
#配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
#绑定源和目标
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
4.2 启动控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-http.conf -name a1 -Dflume.root.logge
r=INFO,console
4.3 输入HTTP测试
[root@master ~]# curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" : "random_host.example.com"},"body" : "random_body"
},{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "liuyichang"}]' master:50020
4.4 查看结果
2021-01-20 17:20:26,958 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)]
Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com}
body: 6C 69 75 79 69 63 68 61 6E 67 liuyichang }