Flume

最新推荐文章于 2021-02-03 10:35:47 发布

马本不想再等了

最新推荐文章于 2021-02-03 10:35:47 发布

阅读量496

点赞数

文章标签：大数据

本文链接：https://blog.csdn.net/qq_42180284/article/details/103944955

版权

Flume

文章目录

Flume

一、Flume 概述

1.1 Flume 定义

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。

Flume 最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到 HDFS。

1.2 Flume 基础架构

在这里插入图片描述

Agent

Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。

Agent 主要有3个部分组成，Source、Channel、Sink。
Source

Source是负责接收数据到Flume Agent的组件。

类型包括：avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、taildir。
Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

类型包括：hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个 Sink 的读取操作。

Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。

File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
Event

传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。Event 由 Header 和 Body 两部分组成，Header 用来存放该 event 的一些属性，为 K-V 结构，Body 用来存放该条数据，形式为字节数组。

二、Flume 快速入门

2.1 Flume 安装部署

将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 的 /opt/software 目录下

解压 apache-flume-1.7.0-bin.tar.gz 到 /opt/module/ 目录下

tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

修改 apache-flume-1.7.0-bin 的名称为 flume

mv apache-flume-1.7.0-bin flume

将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh，并配置 flume-env.sh 文件

mv flume-env.sh.template flume-env.sh
vi flume-env.sh
-- 修改 JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144

2.2 入门案例

① 监控端口数据

netcat-logger.conf

需求：使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。

安装 netcat 工具

sudo yum install -y nc

判断 4444 端口是否被占用

sudo netstat -tunlp | grep 44444

创建 FlumeAgent 配置文件 netcat-logger.conf

创建 flume-netcat-logger.conf 文件，并配置

mkdir jobs
vim netcat-logger.conf

#给 source、channel、sink 命名
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#描述 source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

#描述 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#描述 sink
a1.sinks.k1.type = logger

#指定 source与channel、sink与channel 的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开启 flume 监听端口

bin/flume-ng agent -c conf/ -n a1 -f jobs/netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明：

-c：表示配置文件存储在conf/目录

-n：表示给agent起名为a1

-f：flume本次启动读取的配置文件是在job文件夹下的flume-telnet.conf文件。

-Dflume.root.logger=INFO,console ：-D表示flume运行时动态修改flume.root.logger参数属性值，并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。

在 hadoop103 使用 netcat 工具向 hadoop102 的 4444 端口发送内容

nc hadoop102 44444
hello
maben

② 实时监控单个文件

exec-hdfs.conf

需求：使用 Flume 采集一个文件的新增内容，并输出到 HDFS

准备：Hadoop 相关 jar 包

commons-configuration-1.6.jar、
hadoop-auth-2.7.2.jar、
hadoop-common-2.7.2.jar、
hadoop-hdfs-2.7.2.jar、
commons-io-2.4.jar、
htrace-core-3.1.0-incubating.jar

将以上 jar 包拷贝到 /opt/module/flume/lib 文件夹下

创建 exec-hdfs.conf 文件

vim exec-hdfs.conf
// 添加如下内容
#给 source、channel、sink 命名
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#描述 source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/datas/data.log

#描述 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#描述 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 5
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

#指定 source与channel、sink与channel 的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

运行 Flume

bin/flume-ng agent -c conf/ -n a2 -f jobs/exec-hdfs.conf

使用 data-producer.jar 模拟生成日志

java -jar data-producer.jar ./data.log

在 hdfs 中查看采集的文件。

③ 监控多个新文件

spool-hdfs.conf

需求：使用 Flume 监听整个目录的文件，并上传至 HDFS

创建配置文件 spool-hdfs.conf

vim spool-hdfs.conf
#给 source、channel、sink 命名
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#描述 source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/module/datas/files
a1.sources.r1.ignorePattern = \\S*\\.tmp 	# tmp 结尾的文件不监听

#描述 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#描述 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 5
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

#指定 source与channel、sink与channel 的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开启 flume

bin/flume-ng agent -c conf/ -n a3 -f jobs/spool-hdfs.conf

说明：在使用 Spooling Directory Source 时，不要在监控目录中创建并持续修改文件，上传完成的文件会以.COMPLETED结尾。

向 files 文件夹中添加文件

mkdir files
# tmp 结尾的文件不监听
mv test.tmp files/
mv test1.txt files/

在 HDFS 上看采集的数据

④ 实时监控多个文件

taildir-hdfs.conf

Exec source 适用于监控一个实时追加的文件，但不能保证数据不丢失；Spooldir Source 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控。

而 Taildir Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

需求：使用 Flume 监听整个目录的实时追加文件，并上传至 HDFS

创建配置文件 taildir-hdfs.conf

#给 source、channel、sink 命名
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#描述 source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 
a1.sources.r1.filegroups.f1 = /opt/module/datas/tailfiles/file.*	# 监控 file 开头的文件
a1.sources.r1.positionFile = /opt/module/datas/tailfiles/taildir_position.josn


#描述 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#描述 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 5
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

#指定 source与channel、sink与channel 的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动 flume

bin/flume-ng agent -c conf/ -n a1 -f jobs/taildir-hdfs.conf

向 tailfiles 文件夹中追加内容

mkdir files
echo hello >> file1.txt
echo hello >> file2.txt

在 HDFS 查看数据

说明：Taildir Source维护了一个json格式的position File，其会定期的往position File中更新每个文件读取到的最新的位置，因此能够实现断点续传。

2.3 Flume Agent 配置文件步骤

① source/channel/sink 命名

# a1 代表 agent 的名字，也是启动 flume 时 -name 参数的值。
# sources/channels/sinks 都是负数，可以配置多个
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

② sinkgroup 的配置

# Failover Sink Processor的配置
# g1 组名
a1.sinkgroups = g1
# 配置 g1 组中的 sink
a1.sinkgroups.g1.sinks = k1 k2
# 配置 sink processor （有三种：default、load_balance、failover）
a1.sinkgroups.g1.processor.type = failover
# 配置 failover 模式下后续 sink 启用的优先级
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
# failover time 的上限，单位为 millis
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Load balancing Sink Processor 的配置
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

③ source 的配置

# netcat Source 监测一个端口
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# arvo source 的配置
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

# exec source 实时监测一个文件内容
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/datas/data.log
# 当 command 较为复杂时，例如：有管道符，grep 等要加上 shell 配置，才可以生效
a1.sources.tailsource-1.shell = /bin/bash -c

# spooldir source 监控多个新文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/module/datas/files
a1.sources.r1.ignorePattern = \\S*\\.tmp 	# tmp 结尾的文件不监听

# TAILDIR Source 实时监测一个文件夹
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 
a1.sources.r1.filegroups.f1 = /opt/module/datas/tailfiles/file.*	# 监控 file 开头的文件
a1.sources.r1.positionFile = /opt/module/datas/tailfiles/taildir_position.josn

④ channel 的配置

# memory channel 配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

# file channel 配置
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint
a1.channels.c1.dataDirs = /opt/module/flume/data

⑤ sink 的配置

# logger sink 的配置
a1.sinks.k1.type = logger

# avro sink 的配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

# file roll sink 的配置
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /var/log/flume

# hbase sink 的配置
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

# hdfs sink 的配置
#描述 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 5
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

⑥ source/sink 与 channel 的关系

# source r1 对应的 channel 为 c1
a1.sources.r1.channels = c1
# sink k1 对应的 channel 为 c1
a1.sinks.k1.channel = c1

三、Flume 进阶

3.1 Flume 事务

在这里插入图片描述

3.2 Flume Agent 内部原理

在这里插入图片描述

ChannelSelector

ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。

SinkProcessor

SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor、LoadBalancingSinkProcessor 和 FailoverSinkProcessor。DefaultSinkProcessor 对应的是单个的 Sink，LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负载均衡的功能， FailoverSinkProcessor 可以实现故障转移的功能。

详细原理图

在这里插入图片描述

3.3 Flume 拓扑结构

简单串联

在这里插入图片描述

复制和多路复用

Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地。

在这里插入图片描述

负载均衡和故障转移

Flume 支持使用将多个 sink 逻辑上分到一个 sink 组，sink 组配合不同的 SinkProcessor （LoadBalancingSinkProcessor 和 FileoverSinkProcessor）可以实现负载均衡和错误恢复的功能。

注意：第一层 Flume 只需要一个 channel，并且要显示的声明 sinkgroups。

在这里插入图片描述

聚合（Consolidation）

每台服务器部署一个 flume 采集日志，传送到一个集中收集日志的 flume，再由此 flume 上传到 hdfs、hive、hbase 等，进行日志分析。

在这里插入图片描述

3.4 开发案例

复制和多路复用

需求：使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。

分析：

在这里插入图片描述

步骤：

在 /opt/module/flume/job 目录下创建 group1 文件夹
```
mkdir group1
```
在 /opt/module/datas/ 目录下创建 flume3 文件夹
```
mkdir flume3
```

创建 flume-file-flume.conf

vim flume-file-flume.conf

配置1个接收日志文件的 source 和两个 channel、两个sink，分别输送给 avro-hdfs 和 avro- 。

# 给 source、channel、sink 命名
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# 描述 sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# 描述 sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142

# 描述 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 1000

# 指定 source与channel、sink与channel 的关系
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

创建 flume-flume-hdfs.conf

vim flume-flume-hdfs.conf

配置上级 Flume 输出的 Source，输出是到 HDFS 的 Sink。

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建 flume-flume-dir.conf

vim flume-flume-dir.conf

配置上级 Flume 输出的 Source，输出是到本地目录的 Sink。

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/datas/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 10000
a3.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录(flume3 前面已经创建)，如果该目录不存在，并不会创建新的目录。

分别开启三个 flume

bin/flume-ng agent -n a3 -c conf/ -f jobs/group1/flume-flume-dir.conf 
bin/flume-ng agent -n a2 -c conf/ -f jobs/group1/flume-flume-hdfs.conf
bin/flume-ng agent -n a1 -c conf/ -f jobs/group1/flume-file-flume.conf

开启 hive ，产生 log 日志，并在本地 datas/flume3 和 hdfs /flume 查看
```
# 开启 hive
hive
```

在这里插入图片描述

总结：在 arvo sink 和 arvo source 的配置中，sink 的 hostname 要配置后一个 arvo source 所在的节点（可以理解为接收消息的服务器）的 hostname，而后面接收消息的 flume 中的 arvo source 的 bind 值，要配置自身的主机名。

例如：

# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142

# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

# source 端的 avro 是一个数据接收服务
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4142

负载均衡和故障转移

需求：使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用 FailoverSinkProcessor，实现故障转移的功能。

分析：

在这里插入图片描述

步骤：

在 /opt/module/flume/job 目录下创建 group2 文件夹
```
mkdir group2
```

创建 flume-netcat-flume.conf

vim flume-netcat-flume.conf

配置1个 netcat source 和1个 channel、1个 sink group（2个sink），分别输送给 flume-flume-console1 和 flume-flume-console2。

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# 描述 sinkgroups
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
# failover time 的上限，单位为 millis
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1.sinks.k2.channel = c1


创建 flume-flume-console1.conf

```shell
vim flume-flume-console1.conf

配置上级 Flume 输出的 Source，输出是到本地控制台。

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建 flume-flume-console2.conf

vim flume-flume-console2.conf

配置上级 Flume 输出的 Source，输出是到本地控制台。

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

分别在 hadoop102 开启 flume-netcat-flume，在 hadoop103 开启 flume-flume-console1，在 hadoop104 开启 flume-flume-console2。

bin/flume-ng agent -c conf/ -n a3 -f jobs/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a2 -f jobs/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f jobs/group2/flume-netcat-flume.conf

在hadoop103使用 netcat 工具向hadoop102的 44444 端口发送内容

nc localhost 44444
maben
maben

在hadoop102 和 hadoop104 上查看 flume2 和 flume3 控制台打印的日志

使用 jps -ml 找到打印日志的 flume 并关掉它，再向 hadoop102 发送消息，另一个 flume 会打印日志

总结：默认配置 flume3 优先级高于 flume2，故 flume3 先工作，当 hadoop104 上的 flume3 挂掉后，hadoop103 上的 fume2 接替工作，但是当 flume3 重启后，从下一个 batch 开始，flume3 会继续成为最优先的，并继续工作。

聚合（Consolidation）

需求：hadoop102 上的 Flume-1 监控文件 /opt/module/group.log，hadoop103 上的 Flume-2 监控某一个端口的数据流，Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台。

分析：
在这里插入图片描述

步骤：

分发 flume ，并在hadoop102、hadoop103以及hadoop104的/opt/module/flume/job目录下创建一个group3文件夹。

xsync flume-1.7.0-bin
xsync group3

创建 flume1-logger-flume.conf

vim flume1-logger-flume.conf

配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级 Flume。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

创建 flume2-netcat-flume.conf

vim flume2-netcat-flume.conf

配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建 flume3-flume-logger.conf

vim flume3-flume-logger.conf

配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台。

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，flume1-logger-flume.conf。

bin/flume-ng agent -n a3 -c conf/ -f jobs/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a2 -c conf/ -f jobs/group3/flume2-netcat-flume.conf

bin/flume-ng agent -n a1 -c conf/ -f jobs/group3/flume1-logger-flume.conf


向 hadoop102 的 /opt/module/group.log 下追加内容，观察 flume3 的采集结果

```shell
echo "chengfan" >> group.log

在这里插入图片描述

向 hadoop103 的 44444 端口发送消息，观察 flume3 的采集结果

nc hadoop103 44444
maben

在这里插入图片描述

3.5 自定义 Interceptor

Flume 拓扑结构中的 Multiplexing 结构，会根据 event 中 Header 的 key值，将不同的 event 发送到不同的Channel 中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予不同的值。

需求：以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义 interceptor 区分数字和字母，将其分别发往不同的分析系统（Channel）。

分析：

在这里插入图片描述

步骤：

定义 CustomInterceptor 类

创建一个 maven 项目，并引入以下依赖

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.7.0</version>
</dependency>

自定义 CustomInterceptor 类并实现 Interceptor 接口

package com.atguigu.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

public class CustomInterceptor implements Interceptor {

    @Override
    public void initialize() {
    }

    @Override
    public Event intercept(Event event) {
        byte[] body = event.getBody();
        if (body[0] <= 'z' && body[0] >= 'a') {
            event.getHeaders().put("type", "letter");
        } else if (body[0] >= '0' && body[0] <= '9') {
            event.getHeaders().put("type", "number");
        }
        return event;
    }

    // 处理一批数据 batch
    @Override
    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    @Override
    public void close() {
    }

    public static class Builder implements Interceptor.Builder {

        @Override
        public Interceptor build() {
            return new CustomInterceptor();
        }

        @Override
        public void configure(Context context) {
        }
    }
}

打包后命名为 Flume-CustomInterceptor.jar 并上传至 flume 的 lib 目录下

编辑 flume 配置文件

为 hadoop102 上的 flume1 配置 flume1-netcat-flume.conf 其中包含 1个 netcat source，1个 sink group（2个 avro sink），并配置相应的 ChannelSelector 和 interceptor。

mkdir multiplexing
cd multiplexing
vim flume1-netcat-flume.cof

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# 配置 Interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.CustomInterceptor$Builder

# 配置 ChannelSelector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

为 hadoop103 上的 flume2 配置一个 flume2-flume-logger.conf 其中包含一个 avro source 和一个 logger sink。

mkdir multiplexing
cd multiplexing
vim flume2-flume-logger.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 10000

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

为 hadoop104 上的 Flume3 配置一个 flume3-flume-logger.conf 其中包含一个 avro source 和一个 logger sink。

mkdir multiplexing
cd multiplexing
vim flume3-flume-logger.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4242

# Describe the sink
a3.sinks.k1.type = logger

# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 10000

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

分别在 hadoop102，hadoop103，hadoop104 上启动 flume 进程，注意先后顺序。

[maben996@hadoop104 flume-1.7.0]$ bin/flume-ng agent -n a3 -c conf/ -f jobs/multiplexing/flume3-flume-logger.conf -Dflume.root.logger=info,console
[maben996@hadoop103 flume-1.7.0]$ bin/flume-ng agent -n a2 -c conf/ -f jobs/multiplexing/flume2-flume-logger.conf -Dflume.root.logger=info,console
[maben996@hadoop102 flume-1.7.0]$ bin/flume-ng agent -n a1 -c conf/ -f jobs/multiplexing/flume1-netcat-flume.conf

向 hadoop102 44444 端口发送数据测试结果
```
nc hadoop102 44444
m
d
s
f
a
1
2
2
3
4
5
```
在 hadoop103 的 flume2 只可以看到字母的输出

在这里插入图片描述

在 hadoop104 的 flume3 只可以看到数字的输出

在这里插入图片描述

四、 Flume 监控之 Ganglia

补充：使用 JSON Reporting 监控 Flume

$ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

进入 http://:34545/metrics 查看 Flume 的 Josn 格式的信息。

在这里插入图片描述

4.1 Ganglia 的安装与部署

Ganglia 由 gmond、gmetad 和 gweb三部分组成。

gmond（Ganglia Monitoring Daemon）是一种轻量级服务，安装在每台需要收集指标数据的节点主机上。使用 gmond，你可以很容易收集很多系统指标数据，如CPU、内存、磁盘、网络和活跃进程的数据等。

gmetad（Ganglia Meta Daemon）整合所有信息，并将其以 RRD 格式存储至磁盘的服务。

gweb（Ganglia Web）Ganglia 可视化工具，gweb 是一种利用浏览器显示 gmetad 所存储数据的 PHP 前端。在 Web 界面中以图表方式展现集群的运行状态下收集的多种不同指标数据。

安装 httpd 服务与 php
```
sudo yum -y install httpd php
```

安装其他依赖

sudo yum -y install rrdtool perl-rrdtool rrdtool-devel
sudo yum -y install apr-devel

安装 ganglia

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install ganglia-gmetad 
sudo yum -y install ganglia-web
sudo yum -y install ganglia-gmond

修改配置文件 /etc/httpd/conf.d/ganglia.conf

sudo vim /etc/httpd/conf.d/ganglia.conf

修改为红色的配置

# Ganglia monitoring system php web frontend
Alias /ganglia /usr/share/ganglia
<Location /ganglia>
  Order deny,allow
  Deny from all
  Allow from all
  # Allow from 127.0.0.1
  # Allow from ::1
  # Allow from .example.com
</Location>

修改配置文件/etc/ganglia/gmetad.conf

sudo vim /etc/ganglia/gmetad.conf

修改为

data_source "hadoop102" 192.168.12.102

修改配置文件/etc/ganglia/gmond.conf

sudo vim /etc/ganglia/gmond.conf

修改为

cluster {
  name = "hadoop102"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  # mcast_join = 239.2.11.71
  host = 192.168.12.102
  port = 8649
  ttl = 1
}
udp_recv_channel {
  # mcast_join = 239.2.11.71
  port = 8649
  bind = 192.168.12.102
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

修改配置文件/etc/selinux/config

sudo vim /etc/selinux/config

修改为i

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

提示：selinux 设置后，需要重启生效，如果此时不想重启，可以临时生效之：

sudo setenforce 0

启动 ganglia

sudo service httpd start
sudo service gmetad start
sudo service gmond start

打开网页浏览 ganglia 页面
```
http://192.168.12.102/ganglia
```
提示：如果完成以上操作依然出现权限不足错误，请修改 /var/lib/ganglia 目录的权限
```
sudo chmod -R 777 /var/lib/ganglia
```

4.2 操作 Flume 测试监控

修改 /opt/module/flume/conf 目录下的 flume-env.sh 配置

JAVA_OPTS="-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=192.168.12.102:8649
-Xms100m
-Xmx200m"

启动 Flume 任务

bin/flume-ng agent -n a1 -c conf/ -f jobs/netcat-logger.conf -Dflume.root.logger==INFO,console -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=192.168.12.102:8649

发送数据观察 ganglia 监测图

nc hadoop102 44444
maben
maben
chengfan

关于 Channel 的参数说明：

字段（图表名称）	字段含义
EventPutAttemptCount	source 尝试写入 channel 的事件总数量
EventPutSuccessCount	成功写入 channel 且提交的事件总数量
EventTakeAttemptCount	sink 尝试从 channel 拉取事件的总数量。这不意味着每次事件都被返回，因为 sink 拉取的时候 channel 可能没有任何数据。
EventTakeSuccessCount	sink 成功读取的事件的总数量
StartTime	channel 启动的时间（毫秒）
StopTime	channel 停止的时间（毫秒）
ChannelSize	目前 channel 中事件的总数量
ChannelFillPercentage	channel 占用百分比
ChannelCapacity	channel 的容量

在这里插入图片描述

五、 Flume 高级之自定义 MySQLSource

5.1 自定义 Source 说明

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。

需求：实时监控 MySQL，从 MySQL 中获取数据传输到 HDFS 或者其他存储框架，所以此时需要我们自己实现 MySQLSource。

5.2 自定义 MySQLSource 结构

在这里插入图片描述

5.3 自定义 MySQLSource 步骤

根据官方说明，自定义 MySqlSource 需要继承 AbstractSource 类并实现 Configurable 和 PollableSource 接口。

实现相应方法：

getBackOffSleepIncrement() // 暂不用

getMaxBackOffSleepInterval() // 暂不用

configure(Context context) // 初始化 context

process() // 获取数据（从MySql获取数据，业务处理比较复杂，所以我们定义一个专门的类 —— SQLSourceHelper 来处理跟 MySql 的交互），封装成 Event 并写入 Channel，这个方法被循环调用

stop()//关闭相关的资源

5.4 代码实现

1. 创建 maven 项目，导入 pom 依赖

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.27</version>
    </dependency>
</dependencies>

2. 添加配置信息

在 ClassPath 下添加 jdbc.properties 和 log4j. properties

dbDriver=com.mysql.jdbc.Driver
dbUrl=jdbc:mysql://hadoop102:3306/mysqlsource?useUnicode=true&characterEncoding=utf-8
dbUser=root
dbPassword=00000

#--------console-----------
log4j.rootLogger=info,myconsole,myfile
log4j.appender.myconsole=org.apache.log4j.ConsoleAppender
log4j.appender.myconsole.layout=org.apache.log4j.SimpleLayout
#log4j.appender.myconsole.layout.ConversionPattern =%d [%t] %-5p [%c] - %m%n

#log4j.rootLogger=error,myfile
log4j.appender.myfile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.myfile.File=/tmp/flume.log
log4j.appender.myfile.layout=org.apache.log4j.PatternLayout
log4j.appender.myfile.layout.ConversionPattern =%d [%t] %-5p [%c] - %m%n

3. SQLSourceHelper

1）属性说明：

属性	说明（括号中为默认值）
runQueryDelay	查询时间间隔（10000）
batchSize	缓存大小（100）
startFrom	查询语句开始id（0）
currentIndex	查询语句当前id，每次查询之前需要查元数据表
recordSixe	查询返回条数
table	监控的表名
columnsToSelect	查询字段（*）
customQuery	用户传入的查询语句
query	查询语句
defaultCharsetResultSet	编码格式（UTF-8）

2）方法说明：

方法	说明
SQLSourceHelper(Context context)	构造方法，初始化属性及获取JDBC连接
InitConnection(String url, String user, String pw)	获取JDBC连接
checkMandatoryProperties()	校验相关属性是否设置（实际开发中可增加内容）
buildQuery()	根据实际情况构建sql语句，返回值String
executeQuery()	执行sql语句的查询操作，返回值List<List>
getAllRows(List<List> queryResult)	将查询结果转换为String，方便后续操作
updateOffset2DB(int size)	根据每次查询结果将offset写入元数据表
execSql(String sql)	具体执行sql语句方法
getStatusDBIndex(int startFrom)	获取元数据表中的offset
queryOne(String sql)	获取元数据表中的offset实际sql语句执行方法
close()	关闭资源

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fHSo4Fip-1578806429646)(C:\Users\maben\AppData\Roaming\Typora\typora-user-images\1565235793332.png)]

package com.atguigu.source;

import org.apache.flume.Context;
import org.apache.flume.conf.ConfigurationException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.sql.*;
import java.text.ParseException;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

public class SQLSourceHelper {

    private static final Logger LOG = LoggerFactory.getLogger(SQLSourceHelper.class);

    private int runQueryDelay, //两次查询的时间间隔
            startFrom,            //开始id
            currentIndex,	        //当前id
            recordSixe = 0,      //每次查询返回结果的条数
            maxRow;                //每次查询的最大条数

    private String table,       //要操作的表
            columnsToSelect,     //用户传入的查询的列
            customQuery,          //用户传入的查询语句
            query,                 //构建的查询语句
            defaultCharsetResultSet;//编码集

    //上下文，用来获取配置文件
    private Context context;

    //为定义的变量赋值（默认值），可在flume任务的配置文件中修改
    private static final int DEFAULT_QUERY_DELAY = 10000;
    private static final int DEFAULT_START_VALUE = 0;
    private static final int DEFAULT_MAX_ROWS = 2000;
    private static final String DEFAULT_COLUMNS_SELECT = "*";
    private static final String DEFAULT_CHARSET_RESULTSET = "UTF-8";

    private static Connection conn = null;
    private static PreparedStatement ps = null;
    private static String connectionURL, connectionUserName, connectionPassword;

    //加载静态资源
static {

        Properties p = new Properties();

        try {
            p.load(SQLSourceHelper.class.getClassLoader().getResourceAsStream("jdbc.properties"));
            connectionURL = p.getProperty("dbUrl");
            connectionUserName = p.getProperty("dbUser");
            connectionPassword = p.getProperty("dbPassword");
            Class.forName(p.getProperty("dbDriver"));

        } catch (IOException | ClassNotFoundException e) {
            LOG.error(e.toString());
        }
    }

    //获取 JDBC 连接
    private static Connection InitConnection(String url, String user, String pw) {
        try {
            Connection conn = DriverManager.getConnection(url, user, pw);
            if (conn == null)
                throw new SQLException();
            return conn;
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return null;
    }

    //构造方法
		SQLSourceHelper(Context context) throws ParseException {

        //初始化上下文
        this.context = context;

        //有默认值参数：获取flume任务配置文件中的参数，读不到的采用默认值
        this.columnsToSelect = context.getString("columns.to.select", DEFAULT_COLUMNS_SELECT);

        this.runQueryDelay = context.getInteger("run.query.delay", DEFAULT_QUERY_DELAY);

        this.startFrom = context.getInteger("start.from", DEFAULT_START_VALUE);

        this.defaultCharsetResultSet = context.getString("default.charset.resultset", DEFAULT_CHARSET_RESULTSET);

        //无默认值参数：获取flume任务配置文件中的参数
        this.table = context.getString("table");
        this.customQuery = context.getString("custom.query");

        connectionURL = context.getString("connection.url");

        connectionUserName = context.getString("connection.user");

        connectionPassword = context.getString("connection.password");

        conn = InitConnection(connectionURL, connectionUserName, connectionPassword);

        //校验相应的配置信息，如果没有默认值的参数也没赋值，抛出异常
        checkMandatoryProperties();

        //获取当前的id
        currentIndex = getStatusDBIndex(startFrom);

        //构建查询语句
        query = buildQuery();
    }

    //校验相应的配置信息（表，查询语句以及数据库连接的参数）
private void checkMandatoryProperties() {

        if (table == null) {
            throw new ConfigurationException("property table not set");
        }

        if (connectionURL == null) {
            throw new ConfigurationException("connection.url property not set");
        }

        if (connectionUserName == null) {
            throw new ConfigurationException("connection.user property not set");
        }

        if (connectionPassword == null) {
            throw new ConfigurationException("connection.password property not set");
        }
    }

    //构建sql语句
private String buildQuery() {

        String sql = "";

        //获取当前id
        currentIndex = getStatusDBIndex(startFrom);
        LOG.info(currentIndex + "");

        if (customQuery == null) {
            sql = "SELECT " + columnsToSelect + " FROM " + table;
        } else {
            sql = customQuery;
        }

        StringBuilder execSql = new StringBuilder(sql);

        //以id作为offset
        if (!sql.contains("where")) {
            execSql.append(" where ");
            execSql.append("id").append(">").append(currentIndex);

            return execSql.toString();
        } else {
            int length = execSql.toString().length();

            return execSql.toString().substring(0, length - String.valueOf(currentIndex).length()) + currentIndex;
        }
    }

    //执行查询
List<List<Object>> executeQuery() {

        try {
            //每次执行查询时都要重新生成sql，因为id不同
            customQuery = buildQuery();

            //存放结果的集合
            List<List<Object>> results = new ArrayList<>();

            if (ps == null) {
                //
                ps = conn.prepareStatement(customQuery);
            }

            ResultSet result = ps.executeQuery(customQuery);

            while (result.next()) {

                //存放一条数据的集合（多个列）
                List<Object> row = new ArrayList<>();

                //将返回结果放入集合
                for (int i = 1; i <= result.getMetaData().getColumnCount(); i++) {
                    row.add(result.getObject(i));
                }

                results.add(row);
            }

            LOG.info("execSql:" + customQuery + "\nresultSize:" + results.size());

            return results;
        } catch (SQLException e) {
            LOG.error(e.toString());

            // 重新连接
            conn = InitConnection(connectionURL, connectionUserName, connectionPassword);

        }

        return null;
    }

    //将结果集转化为字符串，每一条数据是一个list集合，将每一个小的list集合转化为字符串
List<String> getAllRows(List<List<Object>> queryResult) {

        List<String> allRows = new ArrayList<>();

        if (queryResult == null || queryResult.isEmpty())
            return allRows;

        StringBuilder row = new StringBuilder();

        for (List<Object> rawRow : queryResult) {

            Object value = null;

            for (Object aRawRow : rawRow) {

                value = aRawRow;

                if (value == null) {
                    row.append(",");
                } else {
                    row.append(aRawRow.toString()).append(",");
                }
            }

            allRows.add(row.toString());
            row = new StringBuilder();
        }

        return allRows;
    }

    //更新offset元数据状态，每次返回结果集后调用。必须记录每次查询的offset值，为程序中断续跑数据时使用，以id为offset
    void updateOffset2DB(int size) {
        //以source_tab做为KEY，如果不存在则插入，存在则更新（每个源表对应一条记录）
        String sql = "insert into flume_meta(source_tab,currentIndex) VALUES('"
                + this.table
                + "','" + (recordSixe += size)
                + "') on DUPLICATE key update source_tab=values(source_tab),currentIndex=values(currentIndex)";

        LOG.info("updateStatus Sql:" + sql);

        execSql(sql);
    }

    //执行sql语句
private void execSql(String sql) {

        try {
            ps = conn.prepareStatement(sql);

            LOG.info("exec::" + sql);

            ps.execute();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    //获取当前id的offset
private Integer getStatusDBIndex(int startFrom) {

        //从flume_meta表中查询出当前的id是多少
        String dbIndex = queryOne("select currentIndex from flume_meta where source_tab='" + table + "'");

        if (dbIndex != null) {
            return Integer.parseInt(dbIndex);
        }

        //如果没有数据，则说明是第一次查询或者数据表中还没有存入数据，返回最初传入的值
        return startFrom;
    }

    //查询一条数据的执行语句(当前id)
private String queryOne(String sql) {

        ResultSet result = null;

        try {
            ps = conn.prepareStatement(sql);
            result = ps.executeQuery();

            while (result.next()) {
                return result.getString(1);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }

        return null;
    }

    //关闭相关资源
void close() {

        try {
            ps.close();
            conn.close();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    int getCurrentIndex() {
        return currentIndex;
    }

    void setCurrentIndex(int newValue) {
        currentIndex = newValue;
    }

    int getRunQueryDelay() {
        return runQueryDelay;
    }

    String getQuery() {
        return query;
    }

    String getConnectionURL() {
        return connectionURL;
    }

    private boolean isCustomQuerySet() {
        return (customQuery != null);
    }

    Context getContext() {
        return context;
    }

    public String getConnectionUserName() {
        return connectionUserName;
    }

    public String getConnectionPassword() {
        return connectionPassword;
    }

    String getDefaultCharsetResultSet() {
        return defaultCharsetResultSet;
    }
}

MySqlSource

package com.atguigu.source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.ParseException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class SQLSource extends AbstractSource implements Configurable, PollableSource {

    //打印日志
private static final Logger LOG = LoggerFactory.getLogger(SQLSource.class);

    //定义sqlHelper
    private SQLSourceHelper sqlSourceHelper;

    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }

    @Override
public void configure(Context context) {

        try {
            //初始化
            sqlSourceHelper = new SQLSourceHelper(context);
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }

    @Override
public Status process() throws EventDeliveryException {

        try {
            //查询数据表
            List<List<Object>> result = sqlSourceHelper.executeQuery();

            //存放event的集合
            List<Event> events = new ArrayList<>();

            //存放event头集合
            HashMap<String, String> header = new HashMap<>();

            //如果有返回数据，则将数据封装为event
            if (!result.isEmpty()) {

                List<String> allRows = sqlSourceHelper.getAllRows(result);

                Event event = null;

                for (String row : allRows) {
                    event = new SimpleEvent();
                    event.setBody(row.getBytes());
                    event.setHeaders(header);
                    events.add(event);
                }

                //将event写入channel
                this.getChannelProcessor().processEventBatch(events);

                //更新数据表中的offset信息
                sqlSourceHelper.updateOffset2DB(result.size());
            }

            //等待时长
            Thread.sleep(sqlSourceHelper.getRunQueryDelay());

            return Status.READY;
        } catch (InterruptedException e) {
            LOG.error("Error procesing row", e);

            return Status.BACKOFF;
        }
    }

    @Override
public synchronized void stop() {

        LOG.info("Stopping sql source {} ...", getName());

        try {
            //关闭资源
            sqlSourceHelper.close();
        } finally {
            super.stop();
        }
    }
}

5.5 测试

1. Jar 包准备

将 MySql 驱动包放入 Flume 的 lib 目录下

cp /opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar /opt/module/flume-1.7.0-bin/lib/

打包项目并将 Jar 放入到 Flume 的 lib 项目下

2. 配置文件准备

创建配置文件并打开

vim mysql.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.atguigu.source.SQLSource  
a1.sources.r1.connection.url = jdbc:mysql://192.168.12.102:3306/mysqlsource
a1.sources.r1.connection.user = root  
a1.sources.r1.connection.password = 000000  
a1.sources.r1.table = student  
a1.sources.r1.columns.to.select = *  
#a1.sources.r1.incremental.column.name = id  
#a1.sources.r1.incremental.value = 0 
a1.sources.r1.run.query.delay=5000

# Describe the sink
a1.sinks.k1.type = logger

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. MySql 表准备

创建MySqlSource数据库
```
CREATE DATABASE mysqlsource；
```

在 MySqlSource 数据库下创建数据表 Student 和元数据表 Flume_meta

CREATE TABLE `student` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `flume_meta` (
`source_tab` varchar(255) NOT NULL,
`currentIndex` varchar(255) NOT NULL,
PRIMARY KEY (`source_tab`)
);

向数据表中添加数据

insert into flume_meta(source_tab,currentIndex) values(1,'zhangsan');
insert into flume_meta(source_tab,currentIndex) values(2,'lisi');
insert into flume_meta(source_tab,currentIndex) values(3,'wangwu');
insert into flume_meta(source_tab,currentIndex) values(4,'zhaoliu');

4. 测试并查看结果

开启 flume

bin/flume-ng agent -c conf/ -n a1 -f jobs/mysql.conf -Dflume.root.logger=INFO,console

测试失败！！！

六、知识补充（重点）

6.1 如何实现 Flume 数据传输的监控

使用第三方框架 Ganglia 实时监控Flume。

6.2 Flume 的 Source，Sink，Channel 的作用？常用的 Source 是什么类型？

Source 组件是用来收集数据，可以处理各种类型，各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence、syslog、http、legacy

Channel 组件是对采集到的数据进行缓存，可以存放在 Memory 或 File 中。

Sink 组件是用于把数据发送到目的地的组件，目的地包括 Hdfs、Logger、avro、thrift、ipc、file、Hbase、solr、自定义。

常用 taildir Source 监控后台日志；或用 netcat Source监控后台产生日志的端口

6.3 Flume 的 Channel Selectors

Channel Selectors，可以让不同的项目日志通过不同的 Channel 到不同的 Sink 中去。在 Source 中可以设置的 Channel Selectors 有两种类型，Replicating Channel Selector (default) 和 Multiplexing Channel Selector

这两种 Selector 的区别是：Replicating 会将 source 过来的 events 发往所有的 channel，而 Multiplexing 可以选择该发往哪些 Channel。

使用 Multiplexing Channel Selector 要配置拦截器（Interceptor）使用，自定义拦截器，判断 Event 的 Header。

6.4 Flume 参数调优

Source ：

① 增加 Source 个（使用 Tail Dir Source 时可增加 FileGroups 个数）可以增大 Source 的读取数据的能力。

② batchSize 参数决定 Source 一次批量运输到 Channel 的 event 条数，适当调大这个参数可以提高 Source 搬运 Event 到 Channel 时的性能。

Channel：

① type 选择 memory 时 Channel 的性能最好，但是如果 Flume 进程意外挂掉可能会丢失数据。type 选择 file 时 Channel 的容错性更好，但是性能上会比 memory channel 差。

② 使用 file Channel 时 dataDirs 配置多个不同盘下的目录可以提高性能。

③ Capacity 参数决定 Channel 可容纳最大的 event 条数。transactionCapacity 参数决定每次 Source 往 Channel 里面写的最大 event 条数和每次 Sink 从 Channel 里面读的最大 event 条数。transactionCapacity 需要大于 Source 和 Sink 的 batchSize 参数。

Sink：

① 增加 Sink 的个数可以增加 Sink 消费 event 的能力。Sink 也不是越多越好够用就行，过多的 Sink 会占用系统资源，造成系统资源不必要的浪费。

② batchSize 参数决定 Sink 一次批量从 Channel 读取的 event 条数，适当调大这个参数可以提高 Sink 从 Channel 搬出 event 的性能。

6.5 Flume 的事务机制

Flume 的事务机制（类似数据库的事务机制）：Flume 使用两个独立的事务分别负责从 Soucrce 到 Channel（Put 事务），以及从 Channel 到 Sink 的 Event 传递（Take 事务）。因为某种原因使得事件无法记录，那么事务将会回滚。

6.6 Flume 采集数据会丢失吗？

不会，Channel 存储可以存储在 File 中，且数据传输由事务来控制。

但是有可能会造成数据重复，因为 Take 事务中的 doTake 方法向 TakeList 中添加数据时，同时会将数据写出，当事务回滚时，写出的数据不会被删除，导致数据重复。

currentIndex) values(3,‘wangwu’);
insert into flume_meta(source_tab,currentIndex) values(4,‘zhaoliu’);


#### 4. 测试并查看结果

	开启 flume

```shell
bin/flume-ng agent -c conf/ -n a1 -f jobs/mysql.conf -Dflume.root.logger=INFO,console

测试失败！！！

六、知识补充（重点）

6.1 如何实现 Flume 数据传输的监控

使用第三方框架 Ganglia 实时监控Flume。

6.2 Flume 的 Source，Sink，Channel 的作用？常用的 Source 是什么类型？

Source 组件是用来收集数据，可以处理各种类型，各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence、syslog、http、legacy

Channel 组件是对采集到的数据进行缓存，可以存放在 Memory 或 File 中。

Sink 组件是用于把数据发送到目的地的组件，目的地包括 Hdfs、Logger、avro、thrift、ipc、file、Hbase、solr、自定义。

常用 taildir Source 监控后台日志；或用 netcat Source监控后台产生日志的端口

6.3 Flume 的 Channel Selectors

这两种 Selector 的区别是：Replicating 会将 source 过来的 events 发往所有的 channel，而 Multiplexing 可以选择该发往哪些 Channel。

使用 Multiplexing Channel Selector 要配置拦截器（Interceptor）使用，自定义拦截器，判断 Event 的 Header。

6.4 Flume 参数调优

Source ：

① 增加 Source 个（使用 Tail Dir Source 时可增加 FileGroups 个数）可以增大 Source 的读取数据的能力。

② batchSize 参数决定 Source 一次批量运输到 Channel 的 event 条数，适当调大这个参数可以提高 Source 搬运 Event 到 Channel 时的性能。

Channel：

② 使用 file Channel 时 dataDirs 配置多个不同盘下的目录可以提高性能。

Sink：

① 增加 Sink 的个数可以增加 Sink 消费 event 的能力。Sink 也不是越多越好够用就行，过多的 Sink 会占用系统资源，造成系统资源不必要的浪费。

② batchSize 参数决定 Sink 一次批量从 Channel 读取的 event 条数，适当调大这个参数可以提高 Sink 从 Channel 搬出 event 的性能。

6.5 Flume 的事务机制

6.6 Flume 采集数据会丢失吗？

不会，Channel 存储可以存储在 File 中，且数据传输由事务来控制。

解决办法：可以在事务中插入唯一标识，后续处理可以使用这些唯一标识符删除重复数据。

马本不想再等了

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flume

Flume一、Flume 概述1.1 Flume 定义 Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。 Flume 最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到 HDFS。1.2 Flume 基础架构AgentAgent 是一个 JVM 进程，它以事件的形式将数...
复制链接

扫一扫