1.flume使用
flume最重要的是查官网
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
不然是再花哨的配置,都来自官网,一定要多看
2.flume日志聚集
2.1 官网介绍
2.2 案例
此案例是flume搜集 日志 与 Tcp的数据,并汇聚到一起,输出到控制台
实际中的应用: 多种数据源(如PC,web端数据)统一收集汇总sink到HDFS
2.2.1 flume收集日志和TCP的数据,聚集后输出到控制台
日志 ---- > agent1
/opt/logs [avro] ---- > agent3 ---- > 输出到控制台
TCP ---- > agent2
syslogtcp
关键点: AVRO作为中间连接层
2.2.2 agent1配置
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = spooldir
a1.sources.s1.channels = c1
a1.sources.s1.spoolDir = /opt/logs
a1.sources.s1.fileHeader = true
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /flume/flumeFlu/checkPoint/agent1
a1.channels.c1.dataDirs = /flume/flumeFlu/data/agent1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 4141
2.2.3 agent2配置
a2.sources = s1
a2.channels = c1
a2.sinks = k1
#tcp协议
a2.sources.s1.type = syslogtcp
a2.sources.s1.port = 5140
a2.sources.s1.host = hadoop03
a2.sources.s1.channels = c1
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 10000
a2.channels.c1.byteCapacityBufferPercentage = 20
a2.channels.c1.byteCapacity = 800000
a2.sinks.k1.type = avro
a2.sinks.k1.channel = c1
a2.sinks.k1.hostname = hadoop03
a2.sinks.k1.port = 4141
2.2.4 agent3配置
#agent3
a3.sources = s1
a3.channels = c1
a3.sinks = k1
#tcp协议
a3.sources.s1.type = avro
a3.sources.s1.channels = c1
a3.sources.s1.bind = hadoop03
a3.sources.s1.port = 4141
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 10000
a3.channels.c1.byteCapacityBufferPercentage = 20
a3.channels.c1.byteCapacity = 800000
a3.sinks.k1.type = logger
a3.sinks.k1.channel = c1
2.2.5 测试
(1)开启flume
//开启flume监听(开启顺序a3 a2 a1)
vim start-agent-consolidation.sh
#!/bin/bash
nowtime=`date '+%Y%m%d_%H%M%S'`
function agent(){
echo "执行agent$1"
nohup /flume/flume/bin/flume-ng agent \
--conf /flume/flume/conf \
--name a$1 \
--conf-file /flume/flumeFlu/conf/agent$1.properties \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=961$1 \
> /flume/flumeFlu/logs/agent$1_$nowtime.log &
}
# 执行函数
agent 3 && \
agent 2 && \
agent 1
(2)agent2输入数据连接,flume通过监听获取数据
nc hadoop 5140
aa
bb
(3)agent1输入数据
echo "cc" > 1.txt
echo "dd" > 1.txt
echo "ee" > 1.txt
cp 1.txt /opt/logs
(4)在flume.log的日志控制台可以看到
11 Aug 2020 13:43:45,568 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95) - Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 61 61 aa }
11 Aug 2020 13:43:45,569 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95) - Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 62 62 bb }
11 Aug 2020 13:52:56,579 INFO [pool-5-thread-1] (org.apache.flume.client.avro.ReliableSpoolingFileEventReader.rollCurrentFile:497) - Preparing to move file /opt/logs/3.txt to /opt/logs/3.txt.COMPLETED
11 Aug 2020 13:52:57,587 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95) - Event: { headers:{file=/opt/logs/3.txt} body: 63 63 cc }
11 Aug 2020 13:52:57,587 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95) - Event: { headers:{file=/opt/logs/3.txt} body: 64 64 dd }
11 Aug 2020 13:52:57,588 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95) - Event: { headers:{file=/opt/logs/3.txt} body: 65 65 ee }
3.flume的sink容错(故障转移)
3.1 官网介绍
3.2 案例
3.2.1 flume收集日志fail over sink架构
sink1 --> agent2 --> 输出(配置的优先级更高)
日志 --> agent1
/opt/failover.log sink2 --> agent3
关键点:其中有一个sink是备用的,下面配置中agent2的输出优先级高于agent3,只要agent2不挂的情况下,数据
会优先从agent2输出.
3.2.2 agent1配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
# source
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/failover.log
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# sink
#set sink1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 52020
#set sink2
a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52021
#set sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
#set failover
#设置sink的容错
a1.sinkgroups.g1.processor.type = failover
#设置sink的优先级,改值的绝对值越大,代表优先级越高,则数据优先从改sink输出
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 1
#故障转移的最大时间
a1.sinkgroups.g1.processor.maxpenalty = 10000
3.2.3 agent2配置
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# Source
a2.sources.r1.type = avro
a2.sources.r1.channels = c1
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52021
# Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Sink
a2.sinks.k1.channel = c1
a2.sinks.k1.type = logger
3.2.4 agent3配置
a3.sources = r1
a3.channels = c1
a3.sinks = k1
# Source
a3.sources.r1.type = avro
a3.sources.r1.channels = c1
a3.sources.r1.bind = hadoop03
a3.sources.r1.port = 52020
# Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Sink
a3.sinks.k1.channel = c1
a3.sinks.k1.type = logger
3.2.5 测试
(1)开启flume
//开启flume监听
vim start-failoverSink.sh
#!/bin/bash
nowtime=`date '+%Y%m%d_%H%M%S'`
function agent(){
echo "执行agent -> failover$1.properties"
/flume/flume/bin/flume-ng agent \
--conf /flume/flume/conf \
--name a$1 \
--conf-file /flume/flumeFlu/conf/failover$1.properties \
-Dflume.root.logger=INFO,console
}
# 执行函数
agent $1
//启动顺序 3 -> 2 -> 1
start-failoverSink.sh 3
start-failoverSink.sh 2
start-failoverSink.sh 1
(2)输入日志
[root@hadoop03 opt]# echo "1" >> failover.log
[root@hadoop03 opt]# echo "2" >> failover.log
[root@hadoop03 opt]# echo "3" >> failover.log
[root@hadoop03 opt]# echo "4" >> failover.log
(3)目前均输出在 agent3
2020-08-11 16:15:16,906 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 31 1 }
2020-08-11 16:15:31,914 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 32 2 }
2020-08-11 16:15:31,914 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 33 3 }
2020-08-11 16:15:34,733 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 34 4 }
(4)模拟agent3挂掉(kill掉)
输入日志
[root@hadoop03 opt]# echo "5" >> failover.log
[root@hadoop03 opt]# echo "6" >> failover.log
此时agent2 输出日志
2020-08-11 16:17:09,852 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 35 5 }
2020-08-11 16:17:09,853 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 36 6 }
以上说明agent1 agent2的sink实现的容错,简单来讲 挂掉一个sink对程序没有影响.
4.flume sink的负载均衡
4.1 官网介绍
官方给的架构图
4.2 案例
4.2.1 flume收集日志 load balance sink架构
sink1 --> agent2
日志 --> agent1 --> 输出
/opt/balance.log sink2 --> agent3
4.2.2 agent1配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
# source
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/balance.log
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# sink
#set sink1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 52020
#set sink2
a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52021
#set sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
#set balance
# 设置sink的负载均衡
a1.sinkgroups.g1.processor.type = load_balance
# 选择sink的方式,有round_robin和randmo两种
a1.sinkgroups.g1.processor.selector = round_robin
#下面两个参数一起奏效,如果backoff设置为true,挂掉的sink会进入黑名单,并在给定的30000ms内被移除.(都是官网的翻译,自己看源英文会理解的更好)
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector.maxTimeOut=30000
4.2.3 agent2配置
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# Source
a2.sources.r1.type = avro
a2.sources.r1.channels = c1
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52021
# Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Sink
a2.sinks.k1.channel = c1
a2.sinks.k1.type = logger
4.2.4 agent3配置
a3.sources = r1
a3.channels = c1
a3.sinks = k1
# Source
a3.sources.r1.type = avro
a3.sources.r1.channels = c1
a3.sources.r1.bind = hadoop03
a3.sources.r1.port = 52020
# Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Sink
a3.sinks.k1.channel = c1
a3.sinks.k1.type = logger
4.2.5 测试
(1)开启flume
//开启flume监听
vim start-loadBalanceSink.sh
#!/bin/bash
nowtime=`date '+%Y%m%d_%H%M%S'`
function agent(){
echo "执行agent -> loadBanance$1.properties"
/flume/flume/bin/flume-ng agent \
--conf /flume/flume/conf \
--name a$1 \
--conf-file /flume/flumeFlu/conf/loadBalance$1.properties \
-Dflume.root.logger=INFO,console
}
# 执行函数
agent $1
//启动顺序 3 -> 2 -> 1
start-loadBalanceSink.sh 3
start-loadBalanceSink.sh 2
start-loadBalanceSink.sh 1
(2)输入日志
[root@hadoop03 opt]# echo "1" >> balance.log
[root@hadoop03 opt]# echo "2" >> balance.log
[root@hadoop03 opt]# echo "3" >> balance.log
[root@hadoop03 opt]# echo "4" >> balance.log
(3)sink1输出
2020-08-11 15:46:41,117 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 31 1 }
2020-08-11 15:46:53,004 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 33 3 }
(4)sink2输出
2020-08-11 15:46:51,724 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 32 2 }
2020-08-11 15:46:57,727 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 34 4 }