大数据开发之Flume实践

1. 通过netcat作为source, sink为logger的方式

1.1 conf文件配置
# example.conf: 一个单节点的 Flume 实例配置

# 配置Agent a1各个组件的名称
a1.sources = r1    
a1.sinks = k1      
a1.channels = c1   

# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat       
a1.sources.r1.bind = localhost    
a1.sources.r1.port = 44444        

# 配置Agent a1的sink k1的属性
a1.sinks.k1.type = logger         

# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory                
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink绑定到channel上
a1.sources.r1.channels = c1       
a1.sinks.k1.channel = c1

这个配置文件定义了一个Agent叫做a1,a1有一个source监听本机44444端口上接收到的数据、一个缓冲数据的channel还有一个把Event数据输出到控制台的sink。这个配置文件给各个组件命名,并且设置了它们的类型和其他属性。通常一个配置文件里面可能有多个Agent,当启动Flume时候通常会传一个Agent名字来做为程序运行的标记。

1.2 启动控制台
 ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-netcat.conf -name a1 -Dflume.root.logger=INFO,console
1.3 远程连接端口
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
1.4 测试
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
OK
word
OK
dzw
OK
ttt
OK
haddop^H
OK
spark
OK
flume
OK

Flume的终端里面会以log的形式输出这个收到的Event内容。

2021-01-19 16:05:27,669 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
2021-01-19 16:05:29,842 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 72 64 0D                                  word. }
2021-01-19 16:05:38,846 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 64 7A 77 0D                                     dzw. }
2021-01-19 16:14:24,955 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 74 74 74 0D                                     ttt. }
2021-01-19 16:19:43,018 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 64 64 6F 70 08 0D                         haddop.. }
2021-01-19 16:19:52,022 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 61 72 6B 0D                               spark. }
2021-01-19 16:19:53,289 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 66 6C 75 6D 65 0D                               flume. }

2. 通过netcat作为source, sink为logger的方式,只留字母,过滤掉数字

2.1 配置conf文件
# 配置Agent a1各个组件的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# source定义正则匹配规则
a1.sources.r1.interceptors = i1  
a1.sources.r1.interceptors.i1.type =regex_filter  
a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
a1.sources.r1.interceptors.i1.excludeEvents =true

# 配置Agent a1的sink k1的属性
a1.sinks.k1.type = logger

# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink绑定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

增加了正则匹配规则部分

2.2 启用控制台和远程连接

同1

2.3 测试
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
liuyichang
OK
1234
OK
hand
OK
1199
OK
hahahaah
OK
1
OK
2
OK
3
OK
4dididi
OK
12wd34
OK
Connection closed by foreign host.

查看输出

2021-01-19 17:29:16,832 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 6C 69 75 79 69 63 68 61 6E 67 0D                liuyichang. }
2021-01-19 17:29:31,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 6E 64 0D                                  hand. }
2021-01-19 17:30:49,868 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 68 61 68 61 61 68 0D                      hahahaah. }
2021-01-19 17:30:53,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 34 64 69 64 69 64 69 0D                         4dididi. }
2021-01-19 17:31:09,362 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 31 32 77 64 33 34 0D                            12wd34. }

3. 通过netcat作为source, sink写到HDFS

3.1 conf配置
# 配置Agent a1各个组件的名称
a1.sources = r1    
a1.sinks = k1      
a1.channels = c1   
# 配置Agent a1的source r1的属性
a1.sources.r1.type = netcat       
a1.sources.r1.bind = localhost    
a1.sources.r1.port = 44444        
# 配置Agent a1的sink k1的属性
#a1.sinks.k1.type = logger         
a1.sinks.k1.type=hdfs
#配置HDFS路径
a1.sinks.k1.hdfs.path=hdfs:/flume
#最终的文件前缀
a1.sinks.k1.hdfs.filePrefix=events
# 表示到了需要触发的时间时,是否要更新文件夹,true:表示是
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
# 表示切换时间的单位是分钟
a1.sinks.k1.hdfs.roundUnit = minute
# 表示过了一分钟生成一个文件
a1.sinks.k1.hdfs.roundInterval = 60 
a1.sinks.k1.hdfs.fileType = DataStream
# 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
a1.channels.c1.type = memory                
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink绑定到channel上
a1.sources.r1.channels = c1       
a1.sinks.k1.channel = c1
3.2 启用控制台和远程连接

启用控制台

./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hdfs.conf -name a1 -Dflume.root.logge
r=INFO,console  

远程连接

telnet localhost 44444
3.3 测试
3.3.1 检验HDFS
[root@master ~]# hadoop fs -ls / 
Found 10 items
-rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user

可以看到此时没有flume文件夹

3.3.2 输入测试
[root@master apache-flume-1.6.0-bin]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
qwq
OK
qqdeqd
OK
stupid
OK
liuyichang
OK
100086
OK
sichuan
OK
China
OK
panda
OK
3.3.3 检验HDFS输出文件
[root@slave1 ~]# hadoop fs -ls /
Found 11 items
-rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
drwxr-xr-x   - root supergroup          0 2021-01-20 16:26 /flume
drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user

此时Flume运行自动在HDFS目录下创建了Flume文件夹

[root@slave1 ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 2 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         12 2021-01-20 16:27 /flume/events.1611131231774.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r--   2 root supergroup         14 2021-01-20 16:27 /flume/events.1611131262116.tmp
[root@slave1 ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r--   2 root supergroup         14 2021-01-20 16:28 /flume/events.1611131262116
[root@slave1 ~]# hadoop fs -ls /flume/events.1611131189758   
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
[root@slave1 ~]# hadoop fs -cat /flume/events.1611131189758
qwq
qqdeqd
stupid

Flume下能够查询到输入的信息。
注意:出现tmp临时文件的原因
因为在conf文件中配置了一分钟生成一个文件,一分钟之内写入的文件都将写入到tmp文件中,一分钟之后传入的信息将写入新的tmp文件中。

如何设置flume防止小文件过多?
a、限定一个文件的文件数据大小
a1.sinks.k1.hdfs.rollSize = 20010241024
b、限定文件可以存储多少个event
a1.sinks.k1.hdfs.rollCount = 10000

4. 通过HTTP作为source, sink写到logger

4.1 配置conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置源
a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind=master
a1.sources.r1.port=50020

#配置目标
a1.sinks.k1.type=logger

#配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

#绑定源和目标
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
4.2 启动控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-http.conf -name a1 -Dflume.root.logge
r=INFO,console
4.3 输入HTTP测试
[root@master ~]# curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" : "random_host.example.com"},"body" : "random_body"
},{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "liuyichang"}]' master:50020
4.4 查看结果
2021-01-20 17:20:26,958 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] 
Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com} 
body: 6C 69 75 79 69 63 68 61 6E 67                   liuyichang }
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值