Flume最典型的7种用法_flume使用-CSDN博客

本文链接：https://blog.csdn.net/weixin_43833543/article/details/103406261

案例一：接收telent数据

案例：使用网络telent命令向一台机器发送一些网络数据，然后通过flume采集网络端口数据

在这里插入图片描述

配置文件

	vim   /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf/netcat-logger.conf

# 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件：r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.52.120
a1.sources.r1.port = 44444

# 描述和配置sink组件：k1
a1.sinks.k1.type = logger

# 描述和配置channel组件，此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Channel参数解释：
capacity：默认该通道中最大的可以存储的event数量
trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量

启动配置文件：

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1  -Dflume.root.logger=INFO,console

在node02机器上面安装telnet客户端，用于模拟数据的发送

yum -y install telnet
telnet node03 44444 # 使用telnet模拟数据发送

在这里插入图片描述

案例二：采集目录到HDFS

结构示意图：
在这里插入图片描述

采集需求：某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去
根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir
spooldir特性：
1、监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2、采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3、所监视的目录中不允许重复出现相同文件名的文件
下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用file channel 也可以用内存channel

配置文件编写：

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
mkdir -p /export/servers/dirfile
vim spooldir.conf

# Name the components on this agent
a1.sources=r1
a1.channels=c1
a1.sinks=k1
# Describe/configure the source
##注意：不能往监控目中重复丢同名文件
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/export/dir
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://node01:8020/spooldir/
# Describe the channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

启动flume

bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

将不同的文件上传到下面目录里面去，注意文件不能重名

 cd /export/dir

案例三采集文件到HDFS

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -F file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

node03开发配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf 
vim tail-file.conf

配置文件内容

a1.sources=r1
a1.channels=c1
a1.sinks=k1
# Describe/configure tail -F source1
a1.sources.r1.type=exec
a1.sources.r1.command =tail -F /export/taillogs/access_log
 
# Describe sink1
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://node01:8020/spooldir/

# Use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

启动flume

bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

开发shell脚本定时追加文件内容

mkdir -p /export/shells/
cd  /export/shells/
vim tail-file.sh

#!/bin/bash
while true
do
 date >> /export/servers/taillogs/access_log;
  sleep 0.5;
done

创建文件夹
mkdir -p /export/servers/taillogs
启动脚本
sh /export/shells/tail-file.sh

案例四：两个agent级联

在这里插入图片描述

需求分析：
第一个agent负责收集文件当中的数据，通过网络发送到第二个agent当中去，第二个agent负责接收第一个agent发送的数据，并将数据保存到hdfs上面去

node02配置flume配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim tail-avro-avro-logger.conf

##################
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/taillogs/access_log

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

##sink端的avro是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.52.120
a1.sinks.k1.port = 4141

#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

node02开发定脚本文件往写入数据

直接将node03下面的脚本和数据拷贝到node02即可，node03机器上执行以下命令

cd  /export
scp -r shells/ taillogs/ node02:$PWD

node03开发flume配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim avro-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.52.120
a1.sources.r1.port = 4141

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node01:8020/avro

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

顺序启动

node03机器启动flume进程

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin
bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1  -Dflume.root.logger=INFO,console

node02机器启动flume进程

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/
bin/flume-ng agent -c conf -f conf/tail-avro-avro-logger.conf -n a1  -Dflume.root.logger=INFO,console

node02机器启shell脚本生成文件

mkdir /export/taillogs/
cd /export/servers/shells
sh tail-file.sh

案例五：故障转移（高可用）

图例：
在这里插入图片描述
角色分配：

名称	HOST	角色
Agent1	node01	Web Server
Collector1	node02	AgentMstr1
Collector2	node03	AgentMstr1

图中所示，Agent1数据分别流入到Collector1和Collector2，Flume NG本身提供了Failover机制，可以自动切换和恢复。在上图中，有3个产生日志服务器分布在不同的机房，要把所有的日志都收集到一个集群中存储。下面我们开发配置Flume NG集群

node01机器配置agent的配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
agent.conf

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2
#
##set gruop
agent1.sinkgroups = g1
##set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /export/taillogs/access_log

#
##set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
## set sink1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020
#
## set sink2
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020
#
##set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 2
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000
#
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
agent1.sinks.k2.channel = c1

maxpenalty故障转移的默认时间

node02机器修改配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
collector.conf

#set Agent name
a1.sources = r1
a1.channels = c1
a1.sinks = k1

## other node,nna to nns
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020

##set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#
##set sink to hdfs
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/


a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

node03机器修改配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
collector.conf

#set Agent name
a1.sources = r1
a1.channels = c1
a1.sinks = k1

## other node,nna to nns
a1.sources.r1.type = avro
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020

##set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#
##set sink to hdfs
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/


a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

顺序启动命令

node03机器上面启动flume

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin bin/flume-ng agent
-n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node02机器上面启动flume

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin bin/flume-ng agent
-n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node01机器上面启动flume

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin bin/flume-ng agent
-n agent1 -c conf -f conf/agent.conf -Dflume.root.logger=DEBUG,console

node01机器启动文件产生脚本

cd /export/shells
sh tail-file.sh

FAILOVER测试

下面我们来测试下Flume NG集群的高可用（故障转移）。场景如下：我们在Agent1节点上传文件，由于我们配置Collector1的权重比Collector2大，所以 Collector1优先采集并上传到存储系统。然后我们kill掉Collector1，此时有Collector2负责日志的采集上传工作，之后，我们手动恢复Collector1节点的Flume服务，再次在Agent1上次文件，发现Collector1恢复优先级别的采集工作。具体截图如下所示：

Collector1优先上传

在这里插入图片描述

HDFS集群中上传的log内容预览

在这里插入图片描述

Collector1宕机，Collector2获取优先上传权限在这里插入图片描述
重启Collector1服务，Collector1重新获得优先上传的权限

案例六：负载均衡load balancer

负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法。Load balancing Sink Processor 能够实现 load balance 功能，如下图Agent1 是一个路由节点，负责将 Channel 暂存的 Event 均衡到对应的多个 Sink组件上，而每个 Sink 组件分别连接到一个独立的 Agent 上，示例配置，如下所示：

在这里插入图片描述

在此处我们通过三台机器来进行模拟flume的负载均衡
三台机器规划如下：
node01：采集数据，发送到node02和node03机器上去
node02：接收node01的部分数据
node03：接收node01的部分数据

node01服务器配置：

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
load_banlancer_client.conf

#agent name
a1.channels = c1
a1.sources = r1
a1.sinks = k1 k2

#set gruop
a1.sinkgroups = g1
#set sink group
a1.sinkgroups.g1.sinks = k1 k2

#set sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/taillogs/access_log

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# set sink1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 52021

# set sink2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = node03
a1.sinks.k2.port = 52021


#set failover
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000


a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

node02服务器的flume配置

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
load_banlancer_server.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port = 52021

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = logger

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

node03服务器flume配置

node03服务器配置

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
load_banlancer_server.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = node03
a1.sources.r1.port = 52021

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = logger

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume服务

启动node03的flume服务

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

启动node02的flume服务

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin bin/flume-ng agent
-n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=DEBUG,console

启动node01的flume服务

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin bin/flume-ng agent
-n a1 -c conf -f conf/load_banlancer_client.conf -Dflume.root.logger=DEBUG,console

ode01服务器运行脚本产生数据

cd /export/shells
sh tail-file.sh

案例七：flume过滤器

案例场景：
A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log
现在要求：
把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。
但是在hdfs中要求的目录为：
/source/logs/access/20180101/**
/source/logs/nginx/20180101/**
/source/logs/web/20180101/**

在这里插入图片描述

数据流程处理分析
在这里插入图片描述

实现：
服务器A对应的IP为 192.168.52.100
服务器B对应的IP为 192.168.52.110
服务器C对应的IP为 192.168.52.120

node01与node02服务器开发flume的配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
exec_source_avro_sink.conf

# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/taillogs/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
##  static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /export/taillogs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /export/taillogs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node03
a1.sinks.k1.port = 41414

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

注：
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

在node03上面开发flume配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf vim
avro_source_hdfs_sink.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.52.120
a1.sources.r1.port =41414

#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder


#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

#定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.52.100:8020/source/logs/%{type}/%Y%m%d
 
#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在node01与node02上面开发shell脚本，模拟数据生成

cd /export/servers/shells
vim server.sh

#!/bin/bash
while true
do  
 date >> /export/servers/taillogs/access.log; 
 date >> /export/servers/taillogs/web.log;
 date >> /export/servers/taillogs/nginx.log;
  sleep 0.5;
done

顺序启动服务
node03启动flume实现数据收集

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin

bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

node01与node02启动flume实现数据监控

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin


bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

node01与node02启动生成文件脚本

cd /export/shells
sh server.sh

在这里插入图片描述

Flume最典型的7种用法

案例一： 接收telent数据

案例二 ：采集目录到HDFS

案例三 采集文件到HDFS

案例四：两个agent级联

案例五：故障转移（高可用）

案例六：负载均衡load balancer

案例七：flume过滤器

案例一：接收telent数据

案例二：采集目录到HDFS

案例三采集文件到HDFS