尚硅谷Flume

小新学java

已于 2023-06-10 21:13:33 修改

阅读量246

点赞数

分类专栏： hadoop生态圈文章标签： flume

于 2023-06-08 22:21:00 首次发布

本文链接：https://blog.csdn.net/m0_63961750/article/details/131114533

版权

hadoop生态圈专栏收录该内容

5 篇文章 2 订阅

订阅专栏

2.1.1 配置好 flume-netcat-logger.conf 文件

2.2 实时监控单个追加文件

2.2.1 配置好 flume-file-hdfs.conf 文件 (exec不能断点续传)

2.3 实时监控目录下多个新文件

2.3.1 配置好 flume-dir-hdfs.conf 文件（spooldir不能监控动态变化的文件）

2.4 实时监控目录下的多个追加文件(Taildir既能断点续传又能动态监控)

2.4.1 配置好 flume-taildir-hdfs.conf 文件

4.1 你是如何实现 Flume 数据传输的监控的

4.2 Flume 的 Source，Sink，Channel 的作用？你们 Source 是什么类型？

4.3 Flume 的 Channel Selectors

4.4 Flume 参数调优

4.5 Flume 的事务机制

4.6 Flume 采集数据会丢失吗?

一、Flume 概述

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统，Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接收方（可定制）的能力。

1.1 Flume 定义

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。

1.2 Flume 基础架构

1.2.1 Agent（代理）

Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。
Agent 主要有 3 个部分组成，Source、Channel、Sink。一个独立的Flume进程，包含组件Source、Channel、Sink。每台机器运行一个Agent，但是在一个Agent中可以包含多个Source、Channel和Sink。

1.2.2 Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种

格式的日志数据，包括 avro 、thrift、 exec 、jms、 spooling directory 、 netcat 、 taildir 、

sequence generator、syslog、http、legacy。

1.2.3 Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。
Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。

1.2.4 Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。
Flume 自带两种 Channel：Memory Channel 和 File Channel。
Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。
File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

1.2.5 Event（事件）

传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。包括数据头和数据体通常约4kB。
Event 由 Header 和 Body 两部分组成，Header 用来存放该 event 的一些属性，为 K-V 结构，Body 用来存放该条数据，形式为字节数组。

二、Flume 入门

2.1 监控端口数据官方案例

1）案例需求：

使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。

2）需求分析：

演示：

2.1.1 配置好 flume-netcat-logger.conf 文件

1）先开启 flume 监听端口

[hsw@hadoop102 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明：

--conf/-c：表示配置文件存储在 conf/目录

--name/-n：表示给 agent 起名为 a1

--conf-file/-f：flume 本次启动读取的配置文件是在 job 文件夹下的 flume-telnet.conf

文件。

-Dflume.root.logger=INFO,console ：-D 表示 flume 运行时动态修改 flume.root.logger

参数属性值，并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、info、warn、

error。

2）使用 netcat 工具向本机的 44444 端口发送内容

[hsw@hadoop102 ~]$ nc localhost 44444
hello
OK
java
OK

3）在 Flume 监听页面观察接收数据情况

2.2 实时监控单个追加文件

1）案例需求：实时监控 Hive 日志，并上传到 HDFS 中

2）需求分析 ：

2.2.1 配置好 flume-file-hdfs.conf 文件 (exec不能断点续传)

1) 运行 Flume

[hsw@hadoop102 flume]$ bin/flume-ng agent -n a2 -c conf/ -f job/flume-file-hdfs.conf

2) 开启 Hadoop 和 Hive 并操作 Hive 产生日志

[hsw@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hsw@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
[hsw@hadoop102 hive]$ bin/hive
hive (default)>

3) 在 HDFS 上查看文件。

2.3 实时监控目录下多个新文件

1）案例需求：使用 Flume 监听整个目录的文件，并上传至 HDFS

2）需求分析：

2.3.1 配置好 flume-dir-hdfs.conf 文件（spooldir不能监控动态变化的文件）

1）启动监控文件夹命令

[hsw@hadoop102 flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf

2）向 upload 文件夹中添加文件

在/opt/module/flume 目录下创建 upload 文件夹

[hsw@hadoop102 flume]$ mkdir upload

向 upload 文件夹中添加文件

[hsw@hadoop102 upload]$ touch 1.txt
[hsw@hadoop102 upload]$ touch 1.tmp
[hsw@hadoop102 upload]$ touch 1.log

3) 查看 HDFS 上的数据

2.4 实时监控目录下的多个追加文件(Taildir既能断点续传又能动态监控)

Exec source 适用于监控一个实时追加的文件，不能实现断点续传；Spooldir Source

适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步；而 Taildir Source

适合用于监听多个实时追加的文件，并且能够实现断点续传。

1）案例需求:使用 Flume 监听整个目录的实时追加文件，并上传至 HDFS

2）需求分析:

2.4.1 配置好 flume-taildir-hdfs.conf 文件

1）启动监控文件夹命令

[hsw@hadoop102 flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/flume-taildir-hdfs.conf

2）向 files 文件夹中追加内容

在/opt/module/flume 目录下创建 files 文件夹

[hsw@hadoop102 flume]$ mkdir files

向 upload 文件夹中添加文件

[hsw@hadoop102 files]$ echo hello >> file1.txt
[hsw@hadoop102 files]$ echo hello world >> file2.txt

4）查看 HDFS 上的数据

Taildir 说明：

Taildir Source 维护了一个 json 格式的 position File，其会定期的往 position File

中更新每个文件读取到的最新的位置，因此能够实现断点续传。

Position File 的格式如下：

{"inode":55420315,"pos":18,"file":"/opt/module/flume/files/file1.txt"}
{"inode":401283,"pos":6,"file":"/opt/module/flume/files2/log.txt"}

注： Linux 中储存文件元数据的区域就叫做 inode ，每个 inode 都有一个号码，操作系统

用 inode 号码来识别不同的文件， Unix/Linux 系统内部不使用文件名，而使用 inode 号码来

识别文件。

三、Flume 进阶

3.1 Flume 事物

3.2 Flume Agent 内部原理

重要组件：

1）ChannelSelector

ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating （复制）和 Multiplexing （多路复用）。

ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。

2）SinkProcessor

SinkProcessor 共有三种类型

DefaultSinkProcessor
LoadBalancingSinkProcessor
FailoverSinkProcessor

其中：

DefaultSinkProcessor 对应的是单个的 Sink
LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group
LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以错误恢复的功能。

3.3 Flume 拓扑结构

3.3.1 简单串联

这种模式是将多个 flume 顺序连接起来了，从最初的 source 开始到最终 sink 传送的

目的存储系统。此模式不建议桥接过多的 flume 数量， flume 数量过多不仅会影响传输速

率，而且一旦传输过程中某个节点 flume 宕机，会影响整个传输系统。

3.3.2 复制和多路复用

Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个

channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的

地。

3.3.3 负载均衡和故障转移

Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor

可以实现负载均衡和错误恢复的功能。

3.3.4 聚合

这种模式是我们最常见的，也非常实用，日常 web 应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题，每台服务器部署一个 flume 采集日志，传送到一个集中收集日志的flume，再由此 flume 上传到 hdfs、hive、hbase 等，进行日志分析。

3.4 Flume 企业开发案例

3.4.1 复制和多路复用

1）案例需求

使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储

到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local

FileSystem。

2）需求分析：(单数据源多出口案例)

3） 实现步骤：

（1）准备工作

在/opt/module/flume/job 目录下创建 group1 文件夹

[hsw@hadoop102 job]$ cd group1/

在/opt/module/datas/目录下创建 flume3 文件夹

[hsw@hadoop102 datas]$ mkdir flume3

（2）创建 flume-file-flume.conf

配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume-flume-hdfs 和 flume-flume-dir。

配置好 flume-file-flume.conf 文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置好 flume-flume-hdfs.conf 文件 配置上级 Flume 输出的 Source，输出是到 HDFS 的 Sink

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H

#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-

#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true

#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1

#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour

#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true

#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100

#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream

#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30

#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700

#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置好 flume-flume-dir.conf 文件 配置上级 Flume 输出的 Source，输出是到本地目录的 Sink

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。

（3）执行配置文件

分别启动对应的 flume 进程：flume-flume-dir，flume-flume-hdfs，flume-file-flume

[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf

（4）启动 Hadoop 和 Hive

[hsw@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hsw@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
[hsw@hadoop102 hive]$ bin/hive
hive (default)>

（5）检查 HDFS 上数据和检查/opt/module/datas/flume3 目录中数据

3.4.2 负载均衡和故障转移

1）案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用

FailoverSinkProcessor，实现故障转移的功能。

2）需求分析 （故障转移案例）

3）实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group2 文件夹

[hsw@hadoop102 job]$ cd group2/

（2）创建 flume-netcat-flume.conf

配置 1 个 netcat source 和 1 个 channel、1 个 sink group（2 个 sink），分别输送给 flume-flume-console1 和 flume-flume-console2。

配置好 flume-netcat-flume.conf 文件

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

若是负载均衡：只需要改

配置好flume-flume-console1.conf 文件配置上级 Flume 输出的 Source，输出是到本地控制台

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置好 flume-flume-console2.conf 文件配置上级 Flume 输出的 Source，输出是到本地控制台

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

（3）执行配置文件

[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

（4）使用 netcat 工具向本机的 44444 端口发送内容

$ nc localhost 44444

（5）查看 Flume2 及 Flume3 的控制台打印日志

（6）将 Flume2 kill，观察 Flume3 的控制台打印情况。当优先级高的发生故障时，优先级低的才会去拉去数据。

注：使用 jps -ml 查看 Flume 进程。

3.4.3 聚合

1）案例需求：

hadoop102 上的 Flume-1 监控文件/opt/module/group.log，

hadoop103 上的 Flume-2 监控某一个端口的数据流，

Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印

到控制台。

2）需求分析（多数据源汇总案例）

3）实现步骤

（1）准备工作

分发 Flume

[hsw@hadoop102 module]$ xsync flume

在 hadoop102、hadoop103 以及 hadoop104 的/opt/module/flume/job 目录下创建一个

group3 文件夹。

[hsw@hadoop102 job]$ mkdir group3
[hsw@hadoop103 job]$ mkdir group3
[hsw@hadoop104 job]$ mkdir group3

（2）创建 flume1-logger-flume.conf

配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级 Flume

配置好 flume1-logger-flume.conf 文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

配置好 flume2-netcat-flume.conf 文件配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置好 flume3-flume-logger.conf 文件配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

（3）执行配置文件

分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf， flume1-logger-flume.conf

[hsw@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group3/flume1-logger-flume.conf
[hsw@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group3/flume2-netcat-flume.conf

（4）在 hadoop103 上向/opt/module 目录下的 group.log 追加内容

[hsw@hadoop103 module]$ echo 'hello' > group.log

（5）在 hadoop102 上向 44444 端口发送数据

[hsw@hadoop102 flume]$ telnet hadoop102 44444

（6）检查 hadoop104 上数据

3.5 自定义 Interceptor

1）案例需求

使用 Flume 采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析（ Interceptor和Multiplexing ChannelSelector案例）

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing的原理是，根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel 中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予不同的值。

在该案例中，我们以端口数据模拟日志，以是否包含”atguigu”模拟不同类型的日志，我们需要自定义 interceptor 区分数据中是否包含”atguigu”，将其分别发往不同的分析系统（Channel）。

3）实现步骤

（1）创建一个 maven 项目，并引入以下依赖。

<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>

（2）定义 TypeInterceptor 类并实现 Interceptor 接口。

public class TypeInterceptor implements Interceptor {
    //声明一个集合用于存放拦截器处理后的事件
    private List<Event> addHeaderEvens = new ArrayList<>();


    @Override
    public void initialize() {

    }

    //单个事件处理方法
    @Override
    public Event intercept(Event event) {
        //1. 获取header&body
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());

        //2. 根据body中是否包含"atguigu"添加不同的头信息
        if(body.contains("atguigu")){
            headers.put("type","atguigu");
        }else{
            headers.put("type","other");
        }
        return event;
    }

    //多个实际处理方法
    @Override
    public List<Event> intercept(List<Event> events) {
        //1. 清空集合
        addHeaderEvens.clear();

        //2. 遍历events
        for(Event event:events){
            addHeaderEvens.add(intercept(event));
        }

        //3.返回数据
        return addHeaderEvens;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

（3）编辑 flume 配置文件

为 hadoop102 上的 Flume1 配置 1 个 netcat source，1 个 sink group（2 个 avro sink），并配置相应的 TypeInterceptor 和 interceptor。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.atguigu = c1
a1.sources.r1.selector.mapping.other = c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

为 hadoop103 上的 Flume4 配置一个 avro source 和一个 logger sink

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

为 hadoop104 上的 Flume3 配置一个 avro source 和一个 logger sink

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

（4）分别在 hadoop102，hadoop103，hadoop104 上启动 flume 进程，注意先后顺序。

（5）在 hadoop102 使用 netcat 向 localhost:44444 发送字母和数字。

（6）观察 hadoop103 和 hadoop104 打印的日志。发现有atguigu的只会再hadoop103上打印，其余的都是再hadoop104上打印。

3.6 Flume 数据流监控

3.6.1 Ganglia 的安装与部署

Ganglia 由 gmond、gmetad 和 gweb 三部分组成。

gmond（Ganglia Monitoring Daemon）是一种轻量级服务，安装在每台需要收集指标数据的节点主机上。使用 gmond，你可以很容易收集很多系统指标数据，如 CPU、内存、磁盘、网络和活跃进程的数据等。
gmetad（Ganglia Meta Daemon）整合所有信息，并将其以 RRD 格式存储至磁盘的服务。
gweb（Ganglia Web）Ganglia 可视化工具，gweb 是一种利用浏览器显示 gmetad 所存储数据的 PHP 前端。在 Web 界面中以图表方式展现集群的运行状态下收集的多种不同指标数据。

1）安装 ganglia

（1）规划

hadoop102: web gmetad gmod

hadoop103: gmod

hadoop104: gmod

（2）在 102 103 104 分别安装 epel-release

[hsw@hadoop102 flume]$ sudo yum -y install epel-release

（3）在 102 安装

[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmetad 
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-web
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmond

（4）在 103 和 104 安装

[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmond

2）在 102 修改配置文件/etc/httpd/conf.d/ganglia.conf

[hsw@hadoop102 flume]$ sudo vim /etc/httpd/conf.d/ganglia.conf

3）在 102 修改配置文件/etc/ganglia/gmetad.conf

[hsw@hadoop102 flume]$ sudo vim /etc/ganglia/gmetad.conf

修改为： data_source "my cluster" hadoop102

4）在 102 103 104 修改配置文件/etc/ganglia/gmond.conf

[hsw@hadoop102 flume]$ sudo vim /etc/ganglia/gmond.conf

cluster {
 name = "my cluster" #改这里
 owner = "unspecified"
 latlong = "unspecified"
 url = "unspecified"
}
udp_send_channel {
 #bind_hostname = yes # Highly recommended, soon to be default.
 # This option tells gmond to use a source address
 # that resolves to the machine's hostname. Without
 # this, the metrics may appear to come from any
 # interface and the DNS names associated with
 # those IPs will be used to create the RRDs.
 # mcast_join = 239.2.11.71
 # 数据发送给 hadoop102
 host = hadoop102   #改这里
 port = 8649
 ttl = 1
}

udp_recv_channel {
 # mcast_join = 239.2.11.71
 port = 8649
# 接收来自任意连接的数据
 bind = 0.0.0.0   #改这里
 retry_bind = true
 # Size of the UDP buffer. If you are handling lots of metrics you really
 # should bump it up to e.g. 10MB or even higher.
 # buffer = 10485760
}

5）在 102 修改配置文件/etc/selinux/config

[hsw@hadoop102 flume]$ sudo vim /etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted

尖叫提示： selinux 生效需要重启，如果此时不想重启，可以临时生效之：

[hsw@hadoop102 flume]$ sudo setenforce 0

6）启动 ganglia

（1）在 102 103 104 启动

[hsw@hadoop102 flume]$ sudo systemctl start gmond

（2）在 102 启动

[hsw@hadoop102 flume]$ sudo systemctl start httpd
[hsw@hadoop102 flume]$ sudo systemctl start gmetad

7）打开网页浏览 ganglia 页面

http://hadoop102/ganglia

尖叫提示：如果完成以上操作依然出现权限不足错误，请修改/var/lib/ganglia 目录

的权限：

[hsw@hadoop102 flume]$ sudo chmod -R 777 /var/lib/ganglia

3.6.2 操作 Flume 测试监控

1）启动 Flume 任务

[hsw@hadoop102 flume]$ bin/flume-ng agent \
-c conf/ \
-n a1 \
-f job/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=ganglia \
-Dflume.monitoring.hosts=hadoop102:8649

2）发送数据观察 ganglia 监测图

[hsw@hadoop102 flume]$ nc localhost 44444

四、企业真实面试题

4.1 你是如何实现 Flume 数据传输的监控的

使用第三方框架 Ganglia 实时监控 Flume。

4.2 Flume 的 Source，Sink，Channel 的作用？你们 Source 是什么类型？

1）作用
（1）Source 组件是专门用来收集数据的，可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy
（2）Channel 组件对采集到的数据进行缓存，可以存放在 Memory 或 File 中。
（3）Sink 组件是用于把数据发送到目的地的组件，目的地包括 Hdfs、Logger、avro、thrift、ipc、file、Hbase、solr、自定义。
2）我公司采用的 Source 类型为：
（1）监控后台日志：exec
（2）监控后台产生日志的端口：netcat

4.3 Flume 的 Channel Selectors

4.4 Flume 参数调优

1）Source

增加 Source 个（使用 Tair Dir Source 时可增加 FileGroups 个数）可以增大 Source的读取数据的能力。例如：当某一个目录产生的文件过多时需要将这个文件目录拆分成多个文件目录，同时配置好多个 Source 以保证 Source 有足够的能力获取到新产生的数据。

batchSize 参数决定 Source 一次批量运输到 Channel 的 event 条数，适当调大这个参数可以提高 Source 搬运 Event 到 Channel 时的性能。

2）Channel

type 选择 memory 时 Channel 的性能最好，但是如果 Flume 进程意外挂掉可能会丢失数据。type 选择 file 时 Channel 的容错性更好，但是性能上会比 memory channel 差。

使用 file Channel 时 dataDirs 配置多个不同盘下的目录可以提高性能。

Capacity 参数决定 Channel 可容纳最大的 event 条数。transactionCapacity 参数决定每次 Source 往 channel 里面写的最大 event 条数和每次 Sink 从 channel 里面读的最大event 条数。 transactionCapacity 需要大于 Source 和 Sink 的 batchSize 参数。

3）Sink

增加 Sink 的个数可以增加 Sink 消费 event 的能力。Sink 也不是越多越好够用就行，过多的 Sink 会占用系统资源，造成系统资源不必要的浪费。

batchSize 参数决定 Sink 一次批量从 Channel 读取的 event 条数，适当调大这个参数可以提高 Sink 从 Channel 搬出 event 的性能。

4.5 Flume 的事务机制

Flume 的事务机制（类似数据库的事务机制）：Flume 使用两个独立的事务分别负责从 Soucrce 到 Channel，以及从 Channel 到 Sink 的事件传递。

比如 spooling directory source 为文件的每一行创建一个事件，一旦事务中所有的事件全部传递到 Channel 且提交成功，那么 Soucrce 就将该文件标记为完成。

同理，事务以类似的方式处理从 Channel 到 Sink 的传递过程，如果因为某种原因使得事件无法记录，那么事务将会回滚。且所有的事件都会保持到 Channel 中，等待重新传递。

4.6 Flume 采集数据会丢失吗?

根据 Flume 的架构原理，Flume 是不可能丢失数据的，其内部有完善的事务机制，Source 到 Channel 是事务性的，Channel 到 Sink 是事务性的，因此这两个环节不会出现数据的丢失，唯一可能丢失数据的情况是 Channel 采用 memoryChannel，agent 宕机导致数据丢失，或者 Channel 存储数据已满，导致 Source 不再写入，未写入的数据丢失。

Flume 不会丢失数据，但是有可能造成数据的重复，例如数据已经成功由 Sink 发出，但是没有接收到响应，Sink 会再次发送数据，此时可能会导致数据的重复。