Flume

最新推荐文章于 2022-03-31 20:35:21 发布

Doflying223

最新推荐文章于 2022-03-31 20:35:21 发布

阅读量438

点赞数

文章标签： flume

本文链接：https://blog.csdn.net/weixin_56614846/article/details/118939656

版权

1. Flume定义

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。基于流式架构。
Flume支持定制各类数据发送方，用于收集各类型的数据；同时支持各种数据接受方，用于最终存储数据。一般的采集需求通过对Flume的简单配置即可实现，针对特殊场景也具备良好的自定义扩展功能。因此Flume可以适用于大部分的日常采集场景。
Flume目前有两个版本。0.9X和1.X。

Flume 0.9X版本的统称Flume OG
Flume1.X版本的统称Flume NG（next generation）。

由于Flume NG经过核心组件、核心配置以及代码架构重构，与Flume OG有很大不同，使用时请注意区分。改动的另一原因是将Flume纳入 apache 旗下，Cloudera Flume 改名为 Apache Flume。

2. Flume主要架构——重点

在这里插入图片描述

Agent
一个独立的Flume进程，包含组件Source、 Channel、 Sink。（Agent使用JVM 运行Flume。每台机器运行一个agent，但是可以在一个agent中包含多个sources和sinks。）
Source
数据收集组件，收集后传递给Channel，Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。
Sink
Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
Channel
Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。
Flume自带两种Channel：Memory Channel和File Channel。

Memory Channel和**File Channel区别：
Memory Channel是内存中的队列，所以程序死亡、机器宕机或者重启都会导致数据丢失。
File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

Event
Flume数据传输的基本单元，以Event的形式将数据从源头送至目的地。
Event由Header和Body两部分组成，Header用来存放该event的一些属性，为K-V结构，Body用来存放该条数据，形式为字节数组。

3. Flume入门案例

监控端口数据官方案例

需求：使用Flume监听一个端口，收集该端口数据，并打印到控制台。
需求分析：远程→远程登录的端口输入数据，本地登录端口节点：Flume的Source监听拉取数据→Channel→Sink→输出到终端

测试准备：

(1).安装Netcat

sudo yum -y install telnet.*

Netcat 是一款简单的Unix工具，使用UDP和TCP协议。它是一个可靠的容易被其他程序所启用的后台操作工具，同时它也被用作网络的测试工具或黑客工具。使用它你可以轻易的建立任何连接

(2).判断端口是否被占用

sudo netstat -tnlp | grep 4444

(3).创建Flume Agent配置文件flume-netcat-logger.conf
在tmp目录下创建flume-job文件夹并进入flume-job文件夹

mkdir ~/tmp/flume-job

(3.1).在job文件夹下创建Flume Agent配置文件flume-netcat-logger.conf。

vim flume-netcat-logger.conf

(3.2).在flume-netcat-logger.conf文件中添加如下内容。

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 4444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（4）先开启flume监听端口
第一种写法

flume-ng agent --conf /home/offcn/apps/flume-1.9.0/conf --name a1 --conf-file flume-netcat-logger.conf -Dflume.root.logger=INFO,console

第二种写法：

flume-ng agent -c /home/offcn/apps/flume-1.9.0/conf -n a1 -f flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数	解释
-conf/-c：	表示flume配置文件存储的目录
-name/-n：	表示给agent起名为a1
-conf-file/-f：	flume本次启动读取的采集方案是在flume-job文件夹下的flume-telnet.conf文件。
-Dflume.root.logger=INFO,console ：	-D表示flume运行时动态修改flume.root.logger参数属性值，并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。

（5）使用telcat工具向本机的4444端口发送内容

telnet localhost 4444
aaa
OK
bbb
OK
ccc
OK

最后在Flume监听页面观察接收数据情况

在这里插入图片描述

实时监控单个文件

案例需求：实时监控日志文件，并上传到HDFS中
需求分析：向log文件中追加数据，Flume监听拉取→Channel→Sink→下沉到hdfs
实现：
(1)创建文件flume-file-hdfs.conf

vim flume-file-hdfs.conf

(2) 添加如下内容

# 描述信息
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 参数
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/tmp/flume-job/xxx.log


a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000


a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path =hdfs://node-1:8020/flume-datas/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = logs-
# 开启本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# 是否开启文件夹关于时间上的舍弃
a1.sinks.k1.hdfs.round = true
# 向下舍弃的值
a1.sinks.k1.hdfs.roundValue = 10
# 向下舍弃的单位
a1.sinks.k1.hdfs.roundUnit = minute

#多久生成一个新的文件（秒）
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小（字节）
a1.sinks.k1.hdfs.rollSize = 134217728
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0


a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

对于所有与时间相关的转义序列，Event Header中必须存在以 “timestamp”的key（除非hdfs.useLocalTimeStamp设置为true，此方法会使用TimestampInterceptor自动添加timestamp）。a3.sinks.k3.hdfs.useLocalTimeStamp = true

参数	默认值	参数解析
hdfs.rollInterval	30	当前文件写入达到该值时间后触发滚动创建新文件（0表示不按照时间来分割文件），单位：秒
hdfs.rollSize	1024	当前文件写入达到该大小后触发滚动创建新文件（0表示不根据文件大小来分割文件），单位：字节
hdfs.rollCount	10	当前文件写入Event达到该数量后触发滚动创建新文件（0表示不根据 Event 数量来分割文件）
hdfs.round	false	是否应将时间戳向下舍入（如果为true，则影响除 %t 之外的所有基于时间的转义符）
hdfs.roundValue	1	向下舍入（小于当前时间）的这个值的最高倍（单位取决于下面的 hdfs.roundUnit ）例子：假设当前时间戳是18:32:01，hdfs.roundUnit = minute 如果roundValue=5，则时间戳会取为：18:30 如果roundValue=7，则时间戳会取为：18:28 如果roundValue=10，则时间戳会取为：18:30
hdfs.roundUnit	second	向下舍入的单位，可选值： second 、 minute 、 hour

（3）运行Flume

flume-ng agent -c /home/hadoop/apps/flume-1.9.0/conf -n a1 -f flume-file-hdfs.conf -Dflume.root.logger=INFO,console

（4）运行报错

（4）运行报错
2099-09-09 09:09:00,821 ERROR hdfs.HDFSEventSink: process failed
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
        at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
        at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:221)
        at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:572)
        at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:412)
        at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
        at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
        at java.lang.Thread.run(Thread.java:748)

问题解决：
[ 将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.2.1]

cd apps/flume-1.9.0/lib/
mv guava-11.0.2.jar guava-11.0.2.jar.bak

实时监控目录下多个新文件
案例需求：使用Flume监听整个目录的文件，并上传至HDFS
需求分析：向目录中添加文件→目录，Source监听目录拉取目录数据→Channel→Sink→下沉到HDFS
实现
1）创建配置文件flume-dir-hdfs.conf

vim flume-dir-hdfs.conf

2)添加如下内容

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/offcn/tmp/upload/
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node-1:8020/flume-datas/upload/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a1.sinks.k1.hdfs.rollSize = 114217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3）启动监控文件夹命令

flume-ng agent -c /home/hadoop/apps/flume-1.9.0/conf -n a1 -f flume-dir-hdfs.conf -Dflume.root.logger=INFO,console

说明：在使用Spooling Directory Source时，不要在监控目录中创建并持续修改文件；上传完成的文件会以.COMPLETED结尾；被监控文件夹每500毫秒扫描一次文件变动。

4）向upload文件夹中添加文件

mkdir upload

5）查看HDFS上的数据

Exec source适用于监控一个实时追加的文件，不能实现断点续传；Spooldir Source适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步；而Taildir Source适合用于监听多个实时追加的文件，并且能够实现断点续传。

案例需求：使用Flume监听整个目录的实时追加文件，并上传至HDFS
实现步骤
（1）创建配置文件flume-taildir-hdfs.conf

vim flume-taildir-hdfs.conf

( 2 )添加数据

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/offcn/tmp/flume-job/tail_dir.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/offcn/tmp/taildir/.*file.*
a1.sources.r1.filegroups.f2 = /home/offcn/tmp/taildir/.*log.*

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node-1:8020/flume-datas/taildir/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a1.sinks.k1.hdfs.rollSize = 114217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）启动监控文件夹命令

flume-ng agent -c /home/hadoop/apps/flume-1.9.0/conf -n a1 -f flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console

（4）向files文件夹中追加内容

mkdir taildir
echo hello >> file1.txt
echo hello >> file2.txt

（5）查看HDFS上的数据

Taildir说明：
Taildir Source维护了一个json格式的position File，其会定期的往position File中更新每个文件读取到的最新的位置，因此能够实现断点续传。

[
	{"inode":206840033,"pos":18,"file":"/home/hadoop/tmp/taildir/file1.txt"},
	{"inode":206840036,"pos":18,"file":"/home/hadoop/tmp/taildir/file2.txt"}
]

<font color=red>注：Linux中储存文件元数据的区域就叫做inode，每个inode都有一个号码，操作系统用inode号码来识别不同的文件，Unix/Linux系统内部不使用文件名，而使用inode号码来识别文件。

4. Flume的执行流程

在这里插入图片描述
重要组件：
1）ChannelSelector
ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。
ReplicatingSelector会将同一个Event发往所有的Channel，Multiplexing会根据相应的原则，将不同的Event发往不同的Channel。
2）SinkProcessor
SinkProcessor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor
DefaultSinkProcessor对应的是单个的Sink
LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group，
LoadBalancingSinkProcessor可以实现负载均衡的功能，
FailoverSinkProcessor可以错误恢复的功能。

5.Flume事务

探索事物，我们基于SpoolingDirectorySource,MemoryChannel,HdfsSink三个组件，对Flume传输数据的事务镜像分析。Flume事务详细的处理方式将会不同。普通情况下。用MemoryChannel就好了，大多数场景下就用这个。FileChannel速度慢，尽管提供日志级别的数据恢复，可是普通情况下，不断电MenoryChannel是不会数据丢失的。

Flume提供事务操作，保证用户的数据的可靠性主要体现在如下：
同个节点内，Source写入数到Channel，在一个批次没的数据出现异常，则不写入到Channel。已经接收到的部分数据直接抛弃，靠上一个节点重发数据。
数据在传输到下个节点是（一般是批量数据），假设接收节点出现异常，比方网络异常。则回滚这一批数据。因此有可能导致数据重发。

（1）编程模型：ChannelProcessor

for (Channel reqChannel : reqChannelQueue.keySet()) {
  Transaction tx = reqChannel.getTransaction();
  Preconditions.checkNotNull(tx, "Transaction object must not be null");
  try {
    //FIXME AW 注释:开启事务
    tx.begin();

    List<Event> batch = reqChannelQueue.get(reqChannel);

    for (Event event : batch) {
      //FIXME AW 注释:向缓冲区添加event
      reqChannel.put(event);
    }
    //FIXME AW 注释:提交事务
    tx.commit();
  } catch (Throwable t) {
    //FIXME AW 注释:事务回滚
    tx.rollback();
    if (t instanceof Error) {
      LOG.error("Error while writing to required channel: " + reqChannel, t);
      throw (Error) t;
    } else if (t instanceof ChannelException) {
      throw (ChannelException) t;
    } else {
      throw new ChannelException("Unable to put batch on required " +
          "channel: " + reqChannel, t);
    }
  } finally {
    if (tx != null) {
      //FIXME AW 注释:关闭事务
      tx.close();
    }
  }
}

（2）Flume中的Put事务

Put事务流程:

事务开始的时候会调用一个doPut方法，doPut方法将一批数据放在putList中。
putList向Channel发送数据之前先检查Channel的容量是否放的下，如果放不下一个都不放，只能DoRollback回滚；数据批的大小取决于配置参数batch size的值；
putList的大小取决于配置Channel的参数transaction capacity 的大小，该参数大小就体现在putList上；（Channel的另一个参数capacity指的是Channel的容量）；
数据顺利的放到putList之后，接下来可以调用doConmmit方法，把putList中的Event放到Channel中，成功放完之后就清空putList;
注意：
在doCommit提交之后，事务在想Channel存放数据的过程中，事务容易出问题。如Sink取数据慢，而Source放数据速度快，容易造成Channel中的数据的积压，如果putList中的数据存放不进去此时会调用doRollback方法DoRollback会进行两项操作
1、将putList清空
2、抛出ChannelException异常
source会捕捉到doRollback抛出的异常，然后source就将刚才的一批数据重新采集，然后重新开始一个新的事务，这就是事务的回滚。

（3）Flume中的 Take 事务

take事务同样也有takeList，HDFS sink篇日志有一个batch size，这个参数决定Sink从Channel取数据的时候一次取多少个，所以该batch size得小于takeList的大小，而takeList的大小取决于transaction capacity的大小，同样是Channe中的参数。

take事务流程：
1、事务开始后，doTake方法会将channel中的event剪切到takeList中。
2、如果后面接的是HDFS Sink的话，在把Channel中的event剪切到takeList中的同时也往写入HDFS的IO缓冲流中放一份event（数据写入HDFS是先写入IO缓冲流后Flush到HDFS）；
3、当他是taskList中存放了batch size数量的event之后，就会调用doCommit方法，doCommit方法会做两个操作：

针对HDFS sink，手动调用IO流的flush方法，将IO流缓冲区的数据写入到HDFS中；
清空taskList中的数据；

注意：flush到HDFS的时候容易出问题。flush到HDFS的时候，可能由于网络原因出现超时导致数据传输失败，这个时候调用doRollback方法来回滚，回滚的时候由于taskList中海油备份的数据所以将taskList中的数据原封不动地还给Channel，这个时候完成了回滚。
但是，如果flush到

HDFS的时候数据flush了一半之后出问题了，这意味着已经有一般的数据已经发送到HDFS上面了，现在出了问题，同样需要调用doRollback方法来进行回滚，回滚并没有“一半”之说，它只会把整个taskList中的数据返回给Channel，然后继续进行数据的读写。这样开启下一个事务的时候容易造成数据的重复问题。

6 .Flume架构——深度解析

（1）简单串联

在这里插入图片描述这种模式是将多个flume顺序连接起来了，从最初的source开始到最终sink传送的目的存储系统。此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

（2）复制和多路复用

在这里插入图片描述

Flume支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel中，或者将不同数据分发到不同的channel中，sink可以选择传送到不同的目的地。

案例实操：

[ ] 需求：使用Flume-1监控文件变动，将变动信息同时传递给Flume-2和Flume-3，Flume-2将信息写入到HDFS中，Flume-3将信息写入到本地的文件系统中；

实操

mkdir group1

创建配置文件，需要一个监控接收的source，两个channel两个sink，分别输入给Flume-2和Flume-3；

vim flume-file-flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/tmp/flume-job/xxx.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink端的avro是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-1 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = node-1
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2.2 配置上级Flume输出的Source，输出是到HDFS的Sink。

vim flume-flume-hdfs.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = node-1
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://node-1:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

2.3 配置上级Flume输出的Source，输出是到本地目录的Sink。

vim flume-flume-dir.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = node-1
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /home/hadoop/tmp/flume-job/group1

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。

执行配置文件:分别启动对应的flume进程

flume-ng agent --conf /home/hadoop/apps/flume-1.9.0/conf --name a3 --conf-file flume-flume-dir.conf -Dflume.root.logger=INFO,console
flume-ng agent --conf /home/hadoop/apps/flume-1.9.0/conf --name a2 --conf-file flume-flume-hdfs.conf -Dflume.root.logger=INFO,console
flume-ng agent --conf /home/hadoop/apps/flume-1.9.0/conf --name a1 --conf-file flume-file-flume.conf -Dflume.root.logger=INFO,console

查看结果

（3）负载均衡和故障转移

在这里插入图片描述

案例需求
使用Flume1监控一个端口，其sink组中的sink分别对接Flume2和Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

实现步骤

创建文件夹

mkdir group2

创建配置
写入配置
配置1

vim flume-netcat-flume.conf

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-1
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = node-1
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

配置2

vim flume-flume-console1.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = node-1
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置3

vim flume-flume-console2.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = node-1
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

执行配置

 flume-ng agent --conf /home/offcn/apps/flume-1.9.0/conf --name a3 --conf-file flume-flume-console2.conf -Dflume.root.logger=INFO,console
flume-ng agent --conf /home/offcn/apps/flume-1.9.0/conf --name a2 --conf-file flume-flume-console1.conf -Dflume.root.logger=INFO,console

flume-ng agent --conf /home/offcn/apps/flume-1.9.0/conf --name a1 --conf-file flume-netcat-flume.conf -Dflume.root.logger=INFO,console

查看结果
注：使用jps -ml查看Flume进程。

telnet localhost 4444

查看Flume2及Flume3的控制台打印日志,发现同一时刻只有一个输出
将Flume2 kill，观察Flume3的控制台打印情况。
kill掉Flume3后控制台FLume2可以正常输出。

（4）聚合

在这里插入图片描述

案例需求：
node-1上的Flume-1监控文件/home/hadoop/tmp/flume-job/xxx.log，
node-1上的Flume-2监控某一个端口的数据流，
Flume-1与Flume-2将数据发送给node-1上的Flume-3，Flume-3将最终数据打印**到控制台。

实现步骤：

创建文件夹
创建配置
配置1

vim flume1.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/tmp/flume-job/xxx.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-1
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

配置2

vim flume2.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = localhost
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = node-1
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置3

vim flume3-flume-logger.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = node-1
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

写入配置
执行配置

flume-ng agent –conf /home/offcn/apps/flume-1.9.0/conf --name a3 --conf-file flume3.conf -Dflume.root.logger=INFO,console
flume-ng agent –conf /home/offcn/apps/flume-1.9.0/conf --name a2 --conf-file flume2.conf -Dflume.root.logger=INFO,console
flume-ng agent –conf /home/offcn/apps/flume-1.9.0/conf --name a1 --conf-file flume1.conf -Dflume.root.logger=INFO,console