文章目录
一 基础
1.1 概述
Flume 是一种分布式的、高可靠的、高可用的服务,用于高效收集、聚合和移动大量日志数据。它具有基于数据流的简单灵活的架构。它具有可调整的可靠性机制以及许多故障转移和恢复机制,具有健壮性和容错性。它使用一个简单的可扩展数据模型,允许在线分析应用程序。
1.2 架构
1.2.1 Agent
Agent是一个JVM进程,它以事件的形式将数据从源头送达目的地。
Agent主要由三个部分组成:Source,Channel,Sink。
Agent 内部原理:
-
ChannelSelector(选择器):
ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型, 分别是 Replicating(复制)和 Multiplexing(多路复用)。
ReplicatingSelector 会将同一个 Event 发往所有的 Channel,Multiplexing 会根据相 应的原则,将不同的 Event 发往不同的 Channel。
-
SinkProcessor :
SinkProcessor 共 有 三 种 类 型 , 分 别 是 DefaultSinkProcessor 、 LoadBalancingSinkProcessor 和 FailoverSinkProcessor
DefaultSinkProcessor 对 应 的 是 单 个 的 Sink , LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group,LoadBalancingSinkProcessor 可以实现负载均衡的功能,FailoverSinkProcessor 可以错误恢复的功能(高可用、故障转移)。
1.2.2 Source
Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、taildir、 sequence generator、syslog、http、legacy。
- avro:接受来自其他flume的数据
- exec:执行shell命令,可以用来获取本地数据
- netcat:监控端口数据
1.2.3 Channel
Channel 是位于 Source 和 Sink 之间的缓冲区。因此,Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的,可以同时处理几个 Source 的写入操作和几个 Sink 的读取操作。
Flume 自带两种 Channel:Memory Channel 和 File Channel。
- Memory Channel:是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么 Memory Channel 就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失。
- File Channel:将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
1.2.4 Sink
Sink 不断地轮询 Channel 中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。
Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定 义。
1.2.5 Event
传输单元,Flume 数据传输的基本单元,以 Event 的形式将数据从源头送至目的地。 Event 由 Header 和 Body 两部分组成,Header 用来存放该 event 的一些属性,为 K-V 结构, Body 用来存放该条数据,形式为字节数组。
1.3 事务
1.4 拓扑结构
1.4.1 简单串联
这种模式是将多个 flume 顺序连接起来了,从最初的 source 开始到最终 sink 传送的 目的存储系统。此模式不建议桥接过多的 flume 数量, flume 数量过多不仅会影响传输速 率,而且一旦传输过程中某个节点 flume 宕机,会影响整个传输系统。
1.4.2 复制和多路复用
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中,或者将不同数据分发到不同的 channel 中,sink 可以选择传送到不同的目的地。
1.4.3 负载均衡和故障转移
Flume支持使用将多个sink逻辑上分到一个sink组,sink组配合不同的SinkProcessor 可以实现负载均衡和错误恢复的功能。
1.4.4 聚合
这种模式是我们最常见的,也非常实用,日常 web 应用通常分布在上百个服务器,大者甚至上千个、上万个服务器。产生的日志,处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题,每台服务器部署一个 flume 采集日志,传送到一个集中收集日志的flume,再由此 flume 上传到 hdfs、hive、hbase 等,进行日志分析。
二 下载安装
前提:Java 运行时环境 - Java 1.8 或更高版本
-
从官网下载
-
解压
-
添加环境变量
export FLUME_HOME=/usr/local/big_data/apache-flume-1.9.0-bin export PATH=$PATH:$FLUME_HOME/bin
-
删除 lib 目录下的 guava-11.0.2.jar,因为它与hadoop的 guava 版本冲突。
rm $FLUME_HOME/lib/guava-11.0.2.jar
三 案例
3.1 官方案例
使用Flume监听一个端口,收集该端口数据,并打印到控制台。
# 安装netcat工具,一个轻量级通信工具
yum install nc -y
# 判断端口是否被占用
netstat -nlp | grep 44444
# 在 flume 根目录下创建
mkdir job
cd job
vim flume-netcat-logger.conf
# 定义 Agent 上的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置 source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 配置 sink
a1.sinks.k1.type = logger
# 配置 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 将source和sink绑定到channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 启动 flume
bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console
# 启动 netcat,可以在 flume 接收到在 netcat 输入的数据
nc localhost 44444
3.2 复制和多路复用
使用 Flume-1 监控文件变动,Flume-1 将变动内容传递给 Flume-2 和 Flume-3,Flume-2 负责存储到 HDFS。同时Flume-3 负责输出到 Local FileSystem。
-
Flume-1:flume-file-flume.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 将数据流复制给所有 channel a1.sources.r1.selector.type = replicating # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F $HIVE_HOME/logs/hive.log a1.sources.r1.shell = /bin/bash -c # Describe the sink # sink 端的 avro 是一个数据发送者 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop102 a1.sinks.k1.port = 4141 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop102 a1.sinks.k2.port = 4142 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
Flume-2: flume-flume-hdfs.conf
# Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source # source 端的 avro 是一个数据接收服务 a2.sources.r1.type = avro a2.sources.r1.bind = hadoop102 a2.sources.r1.port = 4141 # Describe the sink a2.sinks.k1.type = hdfs a2.sinks.k1.hdfs.path = hdfs://mycluster/flume2/%Y%m%d/%H #上传文件的前缀 a2.sinks.k1.hdfs.filePrefix = flume2- #是否按照时间滚动文件夹 a2.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a2.sinks.k1.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a2.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k1.hdfs.rollInterval = 30 #设置每个文件的滚动大小大概是 128M a2.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a2.sinks.k1.hdfs.rollCount = 0 # Describe the channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
Flume-3:flume-flume-dir.conf
# Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c2 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = hadoop102 a3.sources.r1.port = 4142 # Describe the sink a3.sinks.k1.type = file_roll # 输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录 a3.sinks.k1.sink.directory = /tmp/data/flume3 # Describe the channel a3.channels.c2.type = memory a3.channels.c2.capacity = 1000 a3.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c2 a3.sinks.k1.channel = c2
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 -f $FLUME_HOME/job/group1/flume-flume-dir.conf
flume-ng agent -c $FLUME_HOME/conf/ -n a2 -f $FLUME_HOME/job/group1/flume-flume-hdfs.conf
flume-ng agent -c $FLUME_HOME/conf/ -n a1 -f $FLUME_HOME/job/group1/flume-file-flume.conf
3.3 负载均衡和故障转移
使用 Flume1 监控一个端口,其 sink 组中的 sink 分别对接 Flume2 和 Flume3,采用 FailoverSinkProcessor,实现故障转移的功能。
-
Flume1:flume-netcat-flume.conf
# Name the components on this agent a1.sources = r1 a1.channels = c1 a1.sinkgroups = g1 a1.sinks = k1 k2 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # 负载均衡 # a1.sinkgroups.g1.processor.type = load_balance # a1.sinkgroups.g1.processor.backoff = true # 默认为轮询。random为随机 # a1.sinkgroups.g1.processor.selector = random # 故障转移 a1.sinkgroups.g1.processor.type = failover # 优先级 a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 10 a1.sinkgroups.g1.processor.maxpenalty = 10000 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop102 a1.sinks.k1.port = 4141 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop102 a1.sinks.k2.port = 4142 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1
-
Flume2:flume-flume-console1.conf
# Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source a2.sources.r1.type = avro a2.sources.r1.bind = hadoop102 a2.sources.r1.port = 4141 # Describe the sink a2.sinks.k1.type = logger # Describe the channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
Flume3:flume-flume-console2.conf
# Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c2 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = hadoop102 a3.sources.r1.port = 4142 # Describe the sink a3.sinks.k1.type = logger # Describe the channel a3.channels.c2.type = memory a3.channels.c2.capacity = 1000 a3.channels.c2.tranactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c2 a3.sinks.k1.channel = c2
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 \
-f $FLUME_HOME/job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf/ -n a2 \
-f $FLUME_HOME/job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf/ -n a1 -f $FLUME_HOME/job/group2/flume-netcat-flume.conf
3.4 聚合
hadoop102 上的 Flume-1 监控文件/tmp/data/group.log,hadoop103 上的 Flume-2 监控某一个端口的数据流,Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3,Flume-3 将最终数据打印到控制台。
-
Flume-1:flume1-logger-flume.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /tmp/data/group.log a1.sources.r1.shell = /bin/bash -c # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop104 a1.sinks.k1.port = 4141 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
Flume-2:flume2-netcat-flume.conf
# Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source a2.sources.r1.type = netcat a2.sources.r1.bind = hadoop103 a2.sources.r1.port = 44444 # Describe the sink a2.sinks.k1.type = avro a2.sinks.k1.hostname = hadoop104 a2.sinks.k1.port = 4141 # Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
Flume-3:flume3-flume-logger.conf
# Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c1 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = hadoop104 a3.sources.r1.port = 4141 # Describe the sink a3.sinks.k1.type = logger # Describe the channel a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 \
-f $FLUME_HOME/job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf/ -n a2 \
-f $FLUME_HOME/job/group3/flume2-netcat-flume.conf
flume-ng agent -c $FLUME_HOME/conf/ -n a1 \
-f $FLUME_HOME/job/group3/flume1-logger-flume.conf
# 测试
echo 'hello' >> /tmp/data/group.log
nc hadoop103 44444
3.5 自定义拦截器
在hadoop102启动flume-1监控 hadoop102 的端口 44444,如果数据中包含‘a’,则在 hadoop103 启动flume-2将数据打印到控制台,如果数据中包含‘null’,则直接过滤掉,其他数据在hadoop104启动flume-3打印到控制台。
3.5.1 编码
-
引入依赖
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <dependencies> <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version>1.9.0</version> </dependency> </dependencies> </project>
-
编写过滤器
package com.guoli.flume.interceptor; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.util.ArrayList; import java.util.List; import java.util.Map; /** * 自定义拦截器 * 1. 实现 Interceptor 接口 * 2. 编写构造器内部类 * * @author guoli * @data 2022-02-26 19:02 */ public class TypeInterceptor implements Interceptor { /** * 声明一个存放拦截器处理完成的事件的集合 */ private List<Event> addHeaderEvents; /** * 初始化 */ @Override public void initialize() { // 初始化存放拦截器处理完成的事件的集合 addHeaderEvents = new ArrayList<>(); } /** * 处理单个事件 * * @param event 单个事件 * @return 处理完的单个事件 */ @Override public Event intercept(Event event) { // 1.获取事件中的头信息 Map<String, String> headers = event.getHeaders(); // 2.获取事件中的 body 信息 String body = new String(event.getBody()); // 3.如果 body 包含 null ,则过滤掉 if (body.contains("null")) { return null; } // 4.根据 body 中是否有"a"来决定添加怎样的头信息 if (body.contains("a")) { // 4.添加头信息 headers.put("type", "a"); } return event; } /** * 处理批量事件 * * @param events 批量事件 * @return 处理完的批量事件 */ @Override public List<Event> intercept(List<Event> events) { // 1.清空集合 addHeaderEvents.clear(); // 2.遍历 events for (Event event : events) { // 3.给每一个事件添加头信息 addHeaderEvents.add(intercept(event)); } // 4.返回结果 return addHeaderEvents; } @Override public void close() { } /** * 构造器内部类 */ public static class Builder implements Interceptor.Builder { @Override public Interceptor build() { return new TypeInterceptor(); } @Override public void configure(Context context) { } } }
-
打包后,放入 $FLUME_HOME/lib 目录下
3.5.2 配置
-
flume-1:flume1-netcat-flume.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # 拦截器配置 # 指定拦截器,可以多个 a1.sources.r1.interceptors = i1 # 指定拦截器 i1 的全类名 a1.sources.r1.interceptors.i1.type = com.guoli.flume.interceptor.TypeInterceptor$Builder a1.sources.r1.selector.type = multiplexing # 指定拦截器中使用的 header 中的 key a1.sources.r1.selector.header = type # 指定拦截器中使用的 header 中的 value 为 a 的数据所绑定的 channel a1.sources.r1.selector.mapping.a = c1 # 指定数据默认绑定的 channel a1.sources.r1.selector.default = c2 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop103 a1.sinks.k1.port = 4141 a1.sinks.k2.type=avro a1.sinks.k2.hostname = hadoop104 a1.sinks.k2.port = 4242 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Use a channel which buffers events in memory a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
flume-2:flume2-flume-logger.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.sources.r1.type = avro a1.sources.r1.bind = hadoop103 a1.sources.r1.port = 4141 a1.sinks.k1.type = logger a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sinks.k1.channel = c1 a1.sources.r1.channels = c1
-
flume-3:flume3-flume-logger.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.sources.r1.type = avro a1.sources.r1.bind = hadoop104 a1.sources.r1.port = 4242 a1.sinks.k1.type = logger a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sinks.k1.channel = c1 a1.sources.r1.channels = c1
3.5.3 测试
# hadoop104启动
flume-ng agent -n a1 -c $FLUME_HOME/conf \
-f $FLUME_HOME/job/group4/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
# hadoop103启动
flume-ng agent -n a1 -c $FLUME_HOME/conf \
-f $FLUME_HOME/job/group4/flume2-flume-logger.conf -Dflume.root.logger=INFO,console
# hadoop102启动
flume-ng agent -n a1 -c $FLUME_HOME/conf -f $FLUME_HOME/job/group4/flume1-netcat-flume.conf
# hadoop102启动
nc localhost 44444
3.6 自定义source
自定义source,内容为0-4,在配置文件配置前缀,输出到控制台。
3.6.1 编码
-
引入依赖
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <dependencies> <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version>1.9.0</version> </dependency> </dependencies> </project>
-
编写自定义source
package com.guoli.flume.source; import org.apache.flume.Context; import org.apache.flume.PollableSource; import org.apache.flume.conf.Configurable; import org.apache.flume.event.SimpleEvent; import org.apache.flume.source.AbstractSource; import java.util.HashMap; import java.util.Map; /** * 自定义source * 在所有内容前面添加前缀 * * @author guoli * @data 2022-02-27 12:35 */ public class MySource extends AbstractSource implements Configurable, PollableSource { /** * source写入channel的等待时间 */ private long delay; /** * 前缀 */ private String prefix; @Override public Status process() { try { // 创建事件头信息 Map<String, String> headerMap = new HashMap<>(); // 创建事件 SimpleEvent event = new SimpleEvent(); // 循环封装事件 for (int i = 0; i < 5; i++) { // 给事件设置头信息 event.setHeaders(headerMap); // 给事件设置内容 event.setBody((prefix + i).getBytes()); // 将事件写入 channel getChannelProcessor().processEvent(event); Thread.sleep(delay); } } catch (Exception e) { e.printStackTrace(); return Status.BACKOFF; } return Status.READY; } @Override public long getBackOffSleepIncrement() { return 0; } @Override public long getMaxBackOffSleepInterval() { return 0; } /** * 从配置文件获取信息 * * @param context 配置文件上下文 */ @Override public void configure(Context context) { delay = context.getLong("delay"); prefix = context.getString("field", "Hello!"); } }
-
打包后,放入 $FLUME_HOME/lib 目录下
3.6.2 配置
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = com.guoli.flume.source.MySource
a1.sources.r1.delay = 1000
a1.sources.r1.prefix = hello
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.6.3 测试
flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f $FLUME_HOME/job/flume-my-source.conf \
-Dflume.root.logger=INFO,console
3.7 自定义sink
使用 flume 接收数据,并在 Sink 端给每条数据添加前缀和后缀,输出到控制台。前后缀可在 flume 任务配置文件中配置。
3.7.1 编码
-
引入依赖
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <dependencies> <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version>1.9.0</version> </dependency> </dependencies> </project>
-
编写自定义sink
package com.guoli.flume.sink; import groovy.util.logging.Slf4j; import org.apache.flume.Channel; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.Transaction; import org.apache.flume.conf.Configurable; import org.apache.flume.sink.AbstractSink; import org.mortbay.log.Log; /** * 自定义sink * 添加前缀和后缀 * * @author guoli * @data 2022-02-27 14:21 */ @Slf4j public class MySink extends AbstractSink implements Configurable { /** * 前缀 */ private String prefix; /** * 后缀 */ private String suffix; @Override public Status process() { // 声明返回值状态信息 Status status; // 获取当前 Sink 绑定的 Channel Channel channel = getChannel(); // 获取事务 Transaction transaction = channel.getTransaction(); // 声明事件 Event event; //开启事务 transaction.begin(); // 读取 Channel 中的事件,直到读取到事件结束循环 do { event = channel.take(); } while (event == null); try { // 处理事件(打印) Log.info(prefix + new String(event.getBody()) + suffix); // 事务提交 transaction.commit(); status = Status.READY; } catch (Exception e) { // 遇到异常,事务回滚 transaction.rollback(); status = Status.BACKOFF; } finally { //关闭事务 transaction.close(); } return status; } @Override public void configure(Context context) { // 读取配置文件内容,有默认值 prefix = context.getString("prefix", "hello:"); // 读取配置文件内容,无默认值 suffix = context.getString("suffix"); } }
-
打包后,放入 $FLUME_HOME/lib 目录下
3.7.2 配置
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = com.guoli.flume.sink.MySink
a1.sinks.k1.prefix = hello:
a1.sinks.k1.suffix = :world
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.7.3 测试
flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f$FLUME_HOME/job/flume-my-sink.conf \
-Dflume.root.logger=INFO,console
context) {
// 读取配置文件内容,有默认值
prefix = context.getString(“prefix”, “hello:”);
// 读取配置文件内容,无默认值
suffix = context.getString(“suffix”);
}
}
3. 打包后,放入 $FLUME_HOME/lib 目录下
### 3.7.2 配置
~~~shell
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = com.guoli.flume.sink.MySink
a1.sinks.k1.prefix = hello:
a1.sinks.k1.suffix = :world
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.7.3 测试
flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f$FLUME_HOME/job/flume-my-sink.conf \
-Dflume.root.logger=INFO,console