大数据之 Flume 进阶 (第三章)

最新推荐文章于 2024-09-15 14:56:10 发布

小坏讲微服务

最新推荐文章于 2024-09-15 14:56:10 发布

阅读量247

点赞数 2

分类专栏：小坏讲大数据 ( flume) 第四阶段文章标签： flume hadoop hive

本文链接：https://blog.csdn.net/qq_42082701/article/details/120729852

版权

小坏讲大数据 ( flume) 第四阶段专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Flume 进阶

一、Flume 事务
二、Flume Agent 内部原理
- 1、ChannelSelector
- 2、SinkProcessor
三、Flume 拓扑结构
四、Flume 企业开发案例
- 1、复制和多路复用
五、负载均衡和故障转移
六、聚合
七、自定义Interceptor
八、自定义 Source

一、Flume 事务

Flume事务
在这里插入图片描述

二、Flume Agent 内部原理

Flume Agent内部原理
在这里插入图片描述

1、ChannelSelector

ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating（复制）和 Multiplexing（多路复用）。

ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。

2、SinkProcessor

SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor 、
LoadBalancingSinkProcessor 和 FailoverSinkProcessor

DefaultSinkProcessor 对应的是单个的 Sink ， LoadBalancingSinkProcessor 和
FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以错误恢复的功能。

三、Flume 拓扑结构

1、简单串联

在这里插入图片描述

这种模式是将多个 flume 顺序连接起来了，从最初的 source 开始到最终 sink 传送的目的存储系统。此模式不建议桥接过多的 flume 数量， flume 数量过多不仅会影响传输速率，而且一旦传输过程中某个节点 flume 宕机，会影响整个传输系统。

2、复制和多路复用

在这里插入图片描述

Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个
channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地

3、负载均衡和故障转移

在这里插入图片描述

Flume 支持使用将多个sink 逻辑上分到一个sink 组，sink 组配合不同的SinkProcessor
可以实现负载均衡和错误恢复的功能。

4、聚合

在这里插入图片描述

这种模式是我们最常见的，也非常实用，日常 web 应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题，每台服务器部署一个 flume 采集日志，传送到一个集中收集日志的

flume，再由此 flume 上传到 hdfs、hive、hbase 等，进行日志分析。

四、Flume 企业开发案例

1、复制和多路复用

1、案例需求

使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local
FileSystem。

2、需求分析

单数据源多出口案例（选择器）

在这里插入图片描述

3、实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group1 文件夹

[atguigu@hadoop102 job]$ cd group1/

在/opt/module/datas/目录下创建 flume3 文件夹

[atguigu@hadoop102 datas]$ mkdir flume3

（2）创建 flume-file-flume.conf

配置 1 个接收日志文件的source 和两个 channel、两个 sink，分别输送给 flume-flume-
hdfs 和 flume-flume-dir

编辑配置文件

[atguigu@hadoop102 group1]$ vim flume-file-flume.conf

添加如下内容

a1.sources = r1 
a1.sinks = k1 k2
 a1.channels = c1 c2
# 将数据流复制给所有 channel 
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log 
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = hadoop102 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
c2 a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

（3）创建 flume-flume-hdfs.conf

配置上级 Flume 输出的 Source，输出是到 HDFS 的Sink

编辑配置文件

[atguigu@hadoop102 group1]$ vim flume-flume-hdfs.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1 
a2.channels = c1

# Describe/configure the source # source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink 
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9820/flume2/%Y%m%d/%H

#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round  =   true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue =  1 
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel 
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume-flume-dir.conf

配置上级 Flume 输出的 Source，输出是到本地目录的 Sink。

编辑配置文件

[atguigu@hadoop102 group1]$ vim flume-flume-dir.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source 
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。

（5）执行配置文件

分别启动对应的 flume 进程：flume-flume-dir，flume-flume-hdfs，flume-file-flume。

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf

（6）启动Hadoop 和 Hive

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh [atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
[atguigu@hadoop102 hive]$ bin/hive hive (default)>

（7）检查 HDFS 上数据

在这里插入图片描述

（8）检查/opt/module/datas/flume3 目录中数据

[atguigu@hadoop102 flume3]$ ll
总用量 8
-rw-rw-r--. 1 atguigu atguigu 5942 5 月	22 00:09 1526918887550-3

五、负载均衡和故障转移

1、故障转移

1）案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用
FailoverSinkProcessor，实现故障转移的功能。

2）需求分析

故障转移案例

在这里插入图片描述

3）实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group2 文件夹 
[atguigu@hadoop102 job]$ cd group2/

（2）创建 flume-netcat-flume.conf

配置 1 个 netcat source 和 1 个 channel、1 个 sink group（2 个 sink），分别输送给 flume-flume-console1 和 flume-flume-console2。

编辑配置文件

[atguigu@hadoop102 group2]$ vim flume-netcat-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups.g1.processor.type = failover 
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1
k2 a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

（3）创建 flume-flume-console1.conf

配置上级 Flume 输出的 Source，输出是到本地控制台。

编辑配置文件

 [atguigu@hadoop102 group2]$ vim flume-flume-console1.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume-flume-console2.conf

配置上级 Flume 输出的 Source，输出是到本地控制台

编辑配置文件

[atguigu@hadoop102 group2]$ vim flume-flume-console2.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1 
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

（5）执行配置文件

分别开启对应配置文件： flume-flume-console2 ， flume-flume-console1 ， flume- netcat-flume。

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -
Dflume.root.logger=INFO,console

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -
Dflume.root.logger=INFO,console

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

（6）使用 netcat 工具向本机的 44444 端口发送内容

$ nc localhost 44444

（7）查看Flume2 及 Flume3 的控制台打印日志

（8）将 Flume2 kill，观察 Flume3 的控制台打印情况。

注：使用 jps -ml 查看 Flume 进程。

总结：只有Flume2挂的时候，才切换Flume3来接收、Flume3启动就在切换到Flume2接收

2、负载均衡

只是在故障转移复制出来一个配置修改、启动和测试无异

1）参考文档

配置文档

2）复制一个故障转移的配置进行修改

在这里插入图片描述

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#负载均衡配置
a1.sinkgroups.g1.processor.type = load_balance


# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102

a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

3）启动测试

[hadoop@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -
Dflume.root.logger=INFO,console

[hadoop@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -
Dflume.root.logger=INFO,console

[hadoop@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume2.conf

六、聚合

1、案例需求

hadoop102 上的Flume-1 监控文件/opt/module/group.log， 

hadoop103 上的Flume-2 监控某一个端口的数据流， 

Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数
据打印到控制台。

2）需求分析

多数据源汇总案例

在这里插入图片描述

3）实现步骤

(1) 准备工作

分发 Flume 
[atguigu@hadoop102 module]$ xsync flume

在 hadoop102、hadoop103 以及 hadoop104 的/opt/module/flume/job 目录下创建一个
group3 文件夹。

[atguigu@hadoop102 job]$ mkdir group3 
[atguigu@hadoop103 job]$ mkdir group3 
[atguigu@hadoop104 job]$ mkdir group3

（2）创建 flume1-logger-flume.conf

配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级Flume。
在 hadoop102 上编辑配置文件

[atguigu@hadoop102 group3]$ vim flume1-logger-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）创建 flume2-netcat-flume.conf

配置 Source 监控端口 44444 数据流，配置Sink 数据到下一级Flume：

在 hadoop103 上编辑配置文件

[atguigu@hadoop102 group3]$ vim flume2-netcat-flume.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume3-flume-logger.conf

配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台。

在 hadoop104 上编辑配置文件
[atguigu@hadoop104 group3]$ touch flume3-flume-logger.conf
[atguigu@hadoop104 group3]$ vim flume3-flume-logger.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro

a3.sources.r1.port = 4141

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

（5）执行配置文件

分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，
flume1-logger-flume.conf。

hadoop104 启动命令

bin/flume-ng agent -c conf/ -n a3 -f job/group3/flume3-flume-logger.conf  -Dflume.root.logger=INFO,console

hadoop103 启动命令

bin/flume-ng agent -c conf/ -n a2 -f job/group3/flume2-netcat-flume.conf

hadoop102启动命令

bin/flume-ng agent -c conf/ -n a1 -f job/group3/flume1-logger-flume.conf

（6）在 hadoop103 上向/opt/module 目录下的 group.log 追加内容

[atguigu@hadoop103 module]$ echo 'hello' > group.log

（7）在 hadoop103 上向 44444 端口发送数据

[atguigu@hadoop102 flume]$ nc hadoop103 44444

在这里插入图片描述

（8）检查 hadoop104 上数据

在这里插入图片描述

七、自定义Interceptor

1）案例需求

使用 Flume 采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing的原理是，根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel
不同的值。

在该案例中，我们以端口数据模拟日志，以是否包含”atguigu”模拟不同类型的日志，我们需要自定义 interceptor 区分数据中是否包含”atguigu”，将其分别发往不同的分析系统（Channel）。

Interceptor和Multiplexing ChannelSelector案例

在这里插入图片描述

3）实现步骤

（1）创建一个 maven 项目，并引入以下依赖。

<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>

（2）定义 CustomInterceptor 类并实现 Interceptor 接口。

package com.atguigu.interceptor;

import org.apache.flume.Context; import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList; import java.util.List; import java.util.Map;

public class TypeInterceptor implements Interceptor {

//声明一个存放事件的集合
private List<Event> addHeaderEvents;

@Override
//初始化存放事件的集合
addHeaderEvents = new ArrayList<>();
}

//单个事件拦截@Override
public Event intercept(Event event) {

//1.获取事件中的头信息
Map<String, String> headers = event.getHeaders();

//2.获取事件中的 body 信息
String body = new String(event.getBody());

//3.根据 body 中是否有"atguigu"来决定添加怎样的头信息
if (body.contains("atguigu")) {
//4.添加头信息headers.put("type", "first");
} else {
//4.添加头信息headers.put("type", "second");
}

return event;
}

//批量事件拦截@Override
public List<Event> intercept(List<Event> events) {

//1.清空集合addHeaderEvents.clear();

//2.遍历 events
for (Event event : events) {
//3.给每一个事件添加头信息addHeaderEvents.add(intercept(event));
}

//4.返回结果
return addHeaderEvents;
}

@Override
public void close() {
}
public static class Builder implements Interceptor.Builder { @Override
public Interceptor build() { return new TypeInterceptor();
}

@Override
public void configure(Context context) {

}
}

}

（3）编辑 flume 配置文件

为 hadoop102 上的 Flume1 配置 1 个 netcat source，1 个 sink grou（p 2 个 avro sink），
并配置相应的ChannelSelector 和 interceptor。

# Name the components on this agent
 a1.sources = r1
a1.sinks = k1 k2
 a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =
com.atguigu.flume.interceptor.CustomInterceptor$Builder 
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.first = c1
a1.sources.r1.selector.mapping.second = c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
c2 a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

为 hadoop103 上的 Flume4 配置一个 avro source 和一个 logger sink。

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

为 hadoop104 上的 Flume3 配置一个 avro source 和一个 logger sink。

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

（4）分别在 hadoop102，hadoop103，hadoop104 上启动 flume 进程，注意先后顺序。

（5）在 hadoop102 使用netcat 向 localhost:44444 发送字母和数字。

（6）观察 hadoop103 和hadoop104 打印的日志。

八、自定义 Source

1）介绍

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence
generator、syslog、http、legacy。官方提供的 source 类型已经很多，但是有时候并不能满足实际开发当中的需求，此时我们就需要根据实际需求自定义某些 source。

官方也提供了自定义 source 的接口：

https://flume.apache.org/FlumeDeveloperGuide.html#source 根据官方说明自定义
MySource 需要继承 AbstractSource 类并实现 Configurable 和 PollableSource 接口。

实现相应方法： 
getBackOffSleepIncrement() //backoff 步长 

getMaxBackOffSleepInterval()//backoff 最长时间
configure(Context context)//初始化 context（读取配置文件内容） 
process()//获取数据封装成 event 并写入 channel，这个方法将被循环调用。 

使用场景：读取 MySQL 数据或者其他文件系统。

2）需求

使用 flume 接收数据，并给每条数据添加前缀，输出到控制台。前缀可从 flume 配置文件中配置。

自定义Source需求
在这里插入图片描述

3）分析

自定义Source需求分析
在这里插入图片描述

4）编码

（1）导入 pom 依赖

<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>

（2）编写代码

package com.atguigu;

import org.apache.flume.Context;
import org.apache.flume.EventDeliveryException; import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable; import org.apache.flume.event.SimpleEvent; import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

//定义配置文件将来要读取的字段private Long delay; private String field;

//初始化配置信息@Override
public void configure(Context context) { delay = context.getLong("delay");
field = context.getString("field", "Hello!");
}

@Override
public Status process() throws EventDeliveryException {

try {
//创建事件头信息
HashMap<String, String> hearderMap = new HashMap<>();
//创建事件
SimpleEvent event = new SimpleEvent();
//循环封装事件
for (int i = 0; i < 5; i++) {
//给事件设置头信息event.setHeaders(hearderMap);
//给事件设置内容
event.setBody((field + i).getBytes());
// 将 事 件 写 入 channel getChannelProcessor().processEvent(event); Thread.sleep(delay);
}
} catch (Exception e) { e.printStackTrace();
return Status.BACKOFF;
}
return Status.READY;
}

@Override
public long getBackOffSleepIncrement() { return 0;
}

@Override
public long getMaxBackOffSleepInterval() { return 0;
}
}

5）测试

（1）打包

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下。

（2）配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.atguigu.MySource
a1.sources.r1.delay = 1000
#a1.sources.r1.field = atguigu

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）开启任务

[atguigu@hadoop102 flume]$ pwd
/opt/module/flume
[atguigu@hadoop102 flume]$ bin/flume-ng agent -c conf/ -f job/mysource.conf -n a1 -Dflume.root.logger=INFO,console