Flume学习笔记

一 基础

1.1 概述

Flume 是一种分布式的、高可靠的、高可用的服务,用于高效收集、聚合和移动大量日志数据。它具有基于数据流的简单灵活的架构。它具有可调整的可靠性机制以及许多故障转移和恢复机制,具有健壮性和容错性。它使用一个简单的可扩展数据模型,允许在线分析应用程序。

1.2 架构

Agent component diagram

1.2.1 Agent

Agent是一个JVM进程,它以事件的形式将数据从源头送达目的地。

Agent主要由三个部分组成:Source,Channel,Sink。

Agent 内部原理:

在这里插入图片描述

  • ChannelSelector(选择器):

    ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型, 分别是 Replicating(复制)和 Multiplexing(多路复用)。

    ReplicatingSelector 会将同一个 Event 发往所有的 Channel,Multiplexing 会根据相 应的原则,将不同的 Event 发往不同的 Channel。

  • SinkProcessor :

    SinkProcessor 共 有 三 种 类 型 , 分 别 是 DefaultSinkProcessor 、 LoadBalancingSinkProcessor 和 FailoverSinkProcessor

    DefaultSinkProcessor 对 应 的 是 单 个 的 Sink , LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group,LoadBalancingSinkProcessor 可以实现负载均衡的功能,FailoverSinkProcessor 可以错误恢复的功能(高可用、故障转移)。

1.2.2 Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、taildir、 sequence generator、syslog、http、legacy。

  • avro:接受来自其他flume的数据
  • exec:执行shell命令,可以用来获取本地数据
  • netcat:监控端口数据

1.2.3 Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此,Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的,可以同时处理几个 Source 的写入操作和几个 Sink 的读取操作。

Flume 自带两种 Channel:Memory Channel 和 File Channel。

  • Memory Channel:是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么 Memory Channel 就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失。
  • File Channel:将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

1.2.4 Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。

Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定 义。

1.2.5 Event

传输单元,Flume 数据传输的基本单元,以 Event 的形式将数据从源头送至目的地。 Event 由 Header 和 Body 两部分组成,Header 用来存放该 event 的一些属性,为 K-V 结构, Body 用来存放该条数据,形式为字节数组。

1.3 事务

在这里插入图片描述

1.4 拓扑结构

1.4.1 简单串联

在这里插入图片描述

这种模式是将多个 flume 顺序连接起来了,从最初的 source 开始到最终 sink 传送的 目的存储系统。此模式不建议桥接过多的 flume 数量, flume 数量过多不仅会影响传输速 率,而且一旦传输过程中某个节点 flume 宕机,会影响整个传输系统。

1.4.2 复制和多路复用

在这里插入图片描述

Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个 channel 中,或者将不同数据分发到不同的 channel 中,sink 可以选择传送到不同的目的地。

1.4.3 负载均衡和故障转移

在这里插入图片描述

Flume支持使用将多个sink逻辑上分到一个sink组,sink组配合不同的SinkProcessor 可以实现负载均衡和错误恢复的功能。

1.4.4 聚合

在这里插入图片描述

这种模式是我们最常见的,也非常实用,日常 web 应用通常分布在上百个服务器,大者甚至上千个、上万个服务器。产生的日志,处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题,每台服务器部署一个 flume 采集日志,传送到一个集中收集日志的flume,再由此 flume 上传到 hdfs、hive、hbase 等,进行日志分析。

二 下载安装

前提:Java 运行时环境 - Java 1.8 或更高版本

  1. 官网下载

  2. 解压

  3. 添加环境变量

    export FLUME_HOME=/usr/local/big_data/apache-flume-1.9.0-bin
    export PATH=$PATH:$FLUME_HOME/bin
    
  4. 删除 lib 目录下的 guava-11.0.2.jar,因为它与hadoop的 guava 版本冲突。

    rm $FLUME_HOME/lib/guava-11.0.2.jar
    

三 案例

3.1 官方案例

使用Flume监听一个端口,收集该端口数据,并打印到控制台。

# 安装netcat工具,一个轻量级通信工具
yum install nc -y
# 判断端口是否被占用
netstat -nlp | grep 44444
# 在 flume 根目录下创建
mkdir job
cd job
vim flume-netcat-logger.conf
# 定义 Agent 上的组件
a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

# 配置 source
a1.sources.r1.type  =  netcat
a1.sources.r1.bind  =  localhost
a1.sources.r1.port  =  44444

# 配置 sink
a1.sinks.k1.type  =  logger

# 配置 channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  = 100

# 将source和sink绑定到channel
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1
# 启动 flume
bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console

# 启动 netcat,可以在 flume 接收到在 netcat 输入的数据
nc localhost 44444

3.2 复制和多路复用

使用 Flume-1 监控文件变动,Flume-1 将变动内容传递给 Flume-2 和 Flume-3,Flume-2 负责存储到 HDFS。同时Flume-3 负责输出到 Local FileSystem。

  • Flume-1:flume-file-flume.conf

    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1 k2
    a1.channels = c1 c2
    
    # 将数据流复制给所有 channel
    a1.sources.r1.selector.type = replicating
    # Describe/configure the source
    a1.sources.r1.type = exec
    a1.sources.r1.command = tail -F $HIVE_HOME/logs/hive.log
    a1.sources.r1.shell = /bin/bash -c
    
    # Describe the sink
    # sink 端的 avro 是一个数据发送者
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = hadoop102
    a1.sinks.k1.port = 4141
    a1.sinks.k2.type = avro
    a1.sinks.k2.hostname = hadoop102
    a1.sinks.k2.port = 4142
    
    # Describe the channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    a1.channels.c2.type = memory
    a1.channels.c2.capacity = 1000
    a1.channels.c2.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1 c2
    a1.sinks.k1.channel = c1
    a1.sinks.k2.channel = c2
    
  • Flume-2: flume-flume-hdfs.conf

    # Name the components on this agent
    a2.sources = r1
    a2.sinks = k1
    a2.channels = c1
    
    # Describe/configure the source
    # source 端的 avro 是一个数据接收服务
    a2.sources.r1.type = avro
    a2.sources.r1.bind = hadoop102
    a2.sources.r1.port = 4141
    
    # Describe the sink
    a2.sinks.k1.type = hdfs
    a2.sinks.k1.hdfs.path = hdfs://mycluster/flume2/%Y%m%d/%H
    #上传文件的前缀
    a2.sinks.k1.hdfs.filePrefix = flume2-
    #是否按照时间滚动文件夹
    a2.sinks.k1.hdfs.round = true
    #多少时间单位创建一个新的文件夹
    a2.sinks.k1.hdfs.roundValue = 1
    #重新定义时间单位
    a2.sinks.k1.hdfs.roundUnit = hour
    #是否使用本地时间戳
    a2.sinks.k1.hdfs.useLocalTimeStamp = true
    #积攒多少个 Event 才 flush 到 HDFS 一次
    a2.sinks.k1.hdfs.batchSize = 100
    #设置文件类型,可支持压缩
    a2.sinks.k1.hdfs.fileType = DataStream
    #多久生成一个新的文件
    a2.sinks.k1.hdfs.rollInterval = 30
    #设置每个文件的滚动大小大概是 128M
    a2.sinks.k1.hdfs.rollSize = 134217700
    #文件的滚动与 Event 数量无关
    a2.sinks.k1.hdfs.rollCount = 0
    
    # Describe the channel
    a2.channels.c1.type = memory
    a2.channels.c1.capacity = 1000
    a2.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a2.sources.r1.channels = c1
    a2.sinks.k1.channel = c1
    
  • Flume-3:flume-flume-dir.conf

    # Name the components on this agent
    a3.sources = r1
    a3.sinks = k1
    a3.channels = c2
    
    # Describe/configure the source
    a3.sources.r1.type = avro
    a3.sources.r1.bind = hadoop102
    a3.sources.r1.port = 4142
    
    # Describe the sink
    a3.sinks.k1.type = file_roll
    # 输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录
    a3.sinks.k1.sink.directory = /tmp/data/flume3
    
    # Describe the channel
    a3.channels.c2.type = memory
    a3.channels.c2.capacity = 1000
    a3.channels.c2.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a3.sources.r1.channels = c2
    a3.sinks.k1.channel = c2
    
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 -f $FLUME_HOME/job/group1/flume-flume-dir.conf
flume-ng agent -c $FLUME_HOME/conf/ -n a2 -f $FLUME_HOME/job/group1/flume-flume-hdfs.conf
flume-ng agent -c $FLUME_HOME/conf/ -n a1 -f $FLUME_HOME/job/group1/flume-file-flume.conf

3.3 负载均衡和故障转移

使用 Flume1 监控一个端口,其 sink 组中的 sink 分别对接 Flume2 和 Flume3,采用 FailoverSinkProcessor,实现故障转移的功能。

  • Flume1:flume-netcat-flume.conf

    # Name the components on this agent
    a1.sources = r1
    a1.channels = c1
    a1.sinkgroups = g1
    a1.sinks = k1 k2
    
    # Describe/configure the source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 44444
    
    # 负载均衡
    # a1.sinkgroups.g1.processor.type = load_balance
    # a1.sinkgroups.g1.processor.backoff = true
    # 默认为轮询。random为随机
    # a1.sinkgroups.g1.processor.selector = random
    
    # 故障转移
    a1.sinkgroups.g1.processor.type = failover
    # 优先级
    a1.sinkgroups.g1.processor.priority.k1 = 5
    a1.sinkgroups.g1.processor.priority.k2 = 10
    a1.sinkgroups.g1.processor.maxpenalty = 10000
    
    # Describe the sink
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = hadoop102
    a1.sinks.k1.port = 4141
    a1.sinks.k2.type = avro
    a1.sinks.k2.hostname = hadoop102
    a1.sinks.k2.port = 4142
    
    # Describe the channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinkgroups.g1.sinks = k1 k2
    a1.sinks.k1.channel = c1
    a1.sinks.k2.channel = c1
    
  • Flume2:flume-flume-console1.conf

    # Name the components on this agent
    a2.sources = r1
    a2.sinks = k1
    a2.channels = c1
    
    # Describe/configure the source
    a2.sources.r1.type = avro
    a2.sources.r1.bind = hadoop102
    a2.sources.r1.port = 4141
    
    # Describe the sink
    a2.sinks.k1.type = logger
    
    # Describe the channel
    a2.channels.c1.type = memory
    a2.channels.c1.capacity = 1000
    a2.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a2.sources.r1.channels = c1
    a2.sinks.k1.channel = c1
    
  • Flume3:flume-flume-console2.conf

    # Name the components on this agent
    a3.sources = r1
    a3.sinks = k1
    a3.channels = c2
    
    # Describe/configure the source
    a3.sources.r1.type = avro
    a3.sources.r1.bind = hadoop102
    a3.sources.r1.port = 4142
    
    # Describe the sink
    a3.sinks.k1.type = logger
    
    # Describe the channel
    a3.channels.c2.type = memory
    a3.channels.c2.capacity = 1000
    a3.channels.c2.tranactionCapacity = 100
    
    # Bind the source and sink to the channel
    a3.sources.r1.channels = c2
    a3.sinks.k1.channel = c2
    
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 \
-f $FLUME_HOME/job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console

flume-ng agent -c $FLUME_HOME/conf/ -n a2 \
-f $FLUME_HOME/job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console

flume-ng agent -c $FLUME_HOME/conf/ -n a1 -f $FLUME_HOME/job/group2/flume-netcat-flume.conf

3.4 聚合

hadoop102 上的 Flume-1 监控文件/tmp/data/group.log,hadoop103 上的 Flume-2 监控某一个端口的数据流,Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3,Flume-3 将最终数据打印到控制台。

  • Flume-1:flume1-logger-flume.conf

    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # Describe/configure the source
    a1.sources.r1.type = exec
    a1.sources.r1.command = tail -F /tmp/data/group.log
    a1.sources.r1.shell = /bin/bash -c
    
    # Describe the sink
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = hadoop104
    a1.sinks.k1.port = 4141
    
    # Describe the channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    
  • Flume-2:flume2-netcat-flume.conf

    # Name the components on this agent
    a2.sources = r1
    a2.sinks = k1
    a2.channels = c1
    
    # Describe/configure the source
    a2.sources.r1.type = netcat
    a2.sources.r1.bind = hadoop103
    a2.sources.r1.port = 44444
    
    # Describe the sink
    a2.sinks.k1.type = avro
    a2.sinks.k1.hostname = hadoop104
    a2.sinks.k1.port = 4141
    
    # Use a channel which buffers events in memory
    a2.channels.c1.type = memory
    a2.channels.c1.capacity = 1000
    a2.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a2.sources.r1.channels = c1
    a2.sinks.k1.channel = c1
    
  • Flume-3:flume3-flume-logger.conf

    # Name the components on this agent
    a3.sources = r1
    a3.sinks = k1
    a3.channels = c1
    
    # Describe/configure the source
    a3.sources.r1.type = avro
    a3.sources.r1.bind = hadoop104
    a3.sources.r1.port = 4141
    
    # Describe the sink
    a3.sinks.k1.type = logger
    
    # Describe the channel
    a3.channels.c1.type = memory
    a3.channels.c1.capacity = 1000
    a3.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a3.sources.r1.channels = c1
    a3.sinks.k1.channel = c1
    
# 启动
flume-ng agent -c $FLUME_HOME/conf/ -n a3 \
-f $FLUME_HOME/job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console

flume-ng agent -c $FLUME_HOME/conf/ -n a2 \
-f $FLUME_HOME/job/group3/flume2-netcat-flume.conf

flume-ng agent -c $FLUME_HOME/conf/ -n a1 \
-f $FLUME_HOME/job/group3/flume1-logger-flume.conf


# 测试
echo 'hello' >> /tmp/data/group.log
nc hadoop103 44444

3.5 自定义拦截器

在hadoop102启动flume-1监控 hadoop102 的端口 44444,如果数据中包含‘a’,则在 hadoop103 启动flume-2将数据打印到控制台,如果数据中包含‘null’,则直接过滤掉,其他数据在hadoop104启动flume-3打印到控制台。

3.5.1 编码

  1. 引入依赖

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <dependencies>
            <dependency>
                <groupId>org.apache.flume</groupId>
                <artifactId>flume-ng-core</artifactId>
                <version>1.9.0</version>
            </dependency>
        </dependencies>
    </project>
    
  2. 编写过滤器

    package com.guoli.flume.interceptor;
    
    import org.apache.flume.Context;
    import org.apache.flume.Event;
    import org.apache.flume.interceptor.Interceptor;
    
    import java.util.ArrayList;
    import java.util.List;
    import java.util.Map;
    
    /**
     * 自定义拦截器
     * 1. 实现 Interceptor 接口
     * 2. 编写构造器内部类
     *
     * @author guoli
     * @data 2022-02-26 19:02
     */
    public class TypeInterceptor implements Interceptor {
        /**
         * 声明一个存放拦截器处理完成的事件的集合
         */
        private List<Event> addHeaderEvents;
    
        /**
         * 初始化
         */
        @Override
        public void initialize() {
            // 初始化存放拦截器处理完成的事件的集合
            addHeaderEvents = new ArrayList<>();
        }
    
        /**
         * 处理单个事件
         *
         * @param event 单个事件
         * @return 处理完的单个事件
         */
        @Override
        public Event intercept(Event event) {
            // 1.获取事件中的头信息
            Map<String, String> headers = event.getHeaders();
            // 2.获取事件中的 body 信息
            String body = new String(event.getBody());
            // 3.如果 body 包含 null ,则过滤掉
            if (body.contains("null")) {
                return null;
            }
            // 4.根据 body 中是否有"a"来决定添加怎样的头信息
            if (body.contains("a")) {
                // 4.添加头信息
                headers.put("type", "a");
            }
            return event;
        }
    
        /**
         * 处理批量事件
         *
         * @param events 批量事件
         * @return 处理完的批量事件
         */
        @Override
        public List<Event> intercept(List<Event> events) {
            // 1.清空集合
            addHeaderEvents.clear();
            // 2.遍历 events
            for (Event event : events) {
                // 3.给每一个事件添加头信息
                addHeaderEvents.add(intercept(event));
            }
            // 4.返回结果
            return addHeaderEvents;
        }
    
        @Override
        public void close() {
    
        }
    
        /**
         * 构造器内部类
         */
        public static class Builder implements Interceptor.Builder {
            @Override
            public Interceptor build() {
                return new TypeInterceptor();
            }
    
            @Override
            public void configure(Context context) {
    
            }
        }
    }
    
  3. 打包后,放入 $FLUME_HOME/lib 目录下

3.5.2 配置

  • flume-1:flume1-netcat-flume.conf

    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1 k2
    a1.channels = c1 c2
    
    # Describe/configure the source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 44444
    # 拦截器配置
    # 指定拦截器,可以多个
    a1.sources.r1.interceptors = i1
    # 指定拦截器 i1 的全类名
    a1.sources.r1.interceptors.i1.type = com.guoli.flume.interceptor.TypeInterceptor$Builder
    a1.sources.r1.selector.type = multiplexing
    # 指定拦截器中使用的 header 中的 key
    a1.sources.r1.selector.header = type
    # 指定拦截器中使用的 header 中的 value 为 a 的数据所绑定的 channel
    a1.sources.r1.selector.mapping.a = c1
    # 指定数据默认绑定的 channel
    a1.sources.r1.selector.default = c2
    
    # Describe the sink
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = hadoop103
    a1.sinks.k1.port = 4141
    a1.sinks.k2.type=avro
    a1.sinks.k2.hostname = hadoop104
    a1.sinks.k2.port = 4242
    
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    # Use a channel which buffers events in memory
    a1.channels.c2.type = memory
    a1.channels.c2.capacity = 1000
    a1.channels.c2.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1 c2
    a1.sinks.k1.channel = c1
    a1.sinks.k2.channel = c2
    
  • flume-2:flume2-flume-logger.conf

    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    a1.sources.r1.type = avro
    a1.sources.r1.bind = hadoop103
    a1.sources.r1.port = 4141
    
    a1.sinks.k1.type = logger
    
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    a1.sinks.k1.channel = c1
    a1.sources.r1.channels = c1
    
  • flume-3:flume3-flume-logger.conf

    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    a1.sources.r1.type = avro
    a1.sources.r1.bind = hadoop104
    a1.sources.r1.port = 4242
    
    a1.sinks.k1.type = logger
    
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    a1.sinks.k1.channel = c1
    a1.sources.r1.channels = c1
    

3.5.3 测试

# hadoop104启动
flume-ng agent -n a1 -c $FLUME_HOME/conf \
-f $FLUME_HOME/job/group4/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
# hadoop103启动
flume-ng agent -n a1 -c $FLUME_HOME/conf \
-f $FLUME_HOME/job/group4/flume2-flume-logger.conf -Dflume.root.logger=INFO,console
# hadoop102启动
flume-ng agent -n a1 -c $FLUME_HOME/conf -f $FLUME_HOME/job/group4/flume1-netcat-flume.conf

# hadoop102启动
nc localhost 44444

3.6 自定义source

官方文档

自定义source,内容为0-4,在配置文件配置前缀,输出到控制台。

3.6.1 编码

  1. 引入依赖

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <dependencies>
            <dependency>
                <groupId>org.apache.flume</groupId>
                <artifactId>flume-ng-core</artifactId>
                <version>1.9.0</version>
            </dependency>
        </dependencies>
    </project>
    
  2. 编写自定义source

    package com.guoli.flume.source;
    
    import org.apache.flume.Context;
    import org.apache.flume.PollableSource;
    import org.apache.flume.conf.Configurable;
    import org.apache.flume.event.SimpleEvent;
    import org.apache.flume.source.AbstractSource;
    
    import java.util.HashMap;
    import java.util.Map;
    
    /**
     * 自定义source
     * 在所有内容前面添加前缀
     *
     * @author guoli
     * @data 2022-02-27 12:35
     */
    public class MySource extends AbstractSource implements Configurable, PollableSource {
        /**
         * source写入channel的等待时间
         */
        private long delay;
    
        /**
         * 前缀
         */
        private String prefix;
    
        @Override
        public Status process() {
            try {
                // 创建事件头信息
                Map<String, String> headerMap = new HashMap<>();
                // 创建事件
                SimpleEvent event = new SimpleEvent();
                // 循环封装事件
                for (int i = 0; i < 5; i++) {
                    // 给事件设置头信息
                    event.setHeaders(headerMap);
                    // 给事件设置内容
                    event.setBody((prefix + i).getBytes());
                    // 将事件写入 channel
                    getChannelProcessor().processEvent(event);
                    Thread.sleep(delay);
                }
            } catch (Exception e) {
                e.printStackTrace();
                return Status.BACKOFF;
            }
            return Status.READY;
        }
    
        @Override
        public long getBackOffSleepIncrement() {
            return 0;
        }
    
        @Override
        public long getMaxBackOffSleepInterval() {
            return 0;
        }
    
        /**
         * 从配置文件获取信息
         *
         * @param context 配置文件上下文
         */
        @Override
        public void configure(Context context) {
            delay = context.getLong("delay");
            prefix = context.getString("field", "Hello!");
        }
    }
    
  3. 打包后,放入 $FLUME_HOME/lib 目录下

3.6.2 配置

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.guoli.flume.source.MySource
a1.sources.r1.delay = 1000
a1.sources.r1.prefix = hello

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.6.3 测试

flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f $FLUME_HOME/job/flume-my-source.conf \
-Dflume.root.logger=INFO,console

3.7 自定义sink

使用 flume 接收数据,并在 Sink 端给每条数据添加前缀和后缀,输出到控制台。前后缀可在 flume 任务配置文件中配置。

3.7.1 编码

  1. 引入依赖

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <dependencies>
            <dependency>
                <groupId>org.apache.flume</groupId>
                <artifactId>flume-ng-core</artifactId>
                <version>1.9.0</version>
            </dependency>
        </dependencies>
    </project>
    
  2. 编写自定义sink

    package com.guoli.flume.sink;
    
    import groovy.util.logging.Slf4j;
    import org.apache.flume.Channel;
    import org.apache.flume.Context;
    import org.apache.flume.Event;
    import org.apache.flume.Transaction;
    import org.apache.flume.conf.Configurable;
    import org.apache.flume.sink.AbstractSink;
    import org.mortbay.log.Log;
    
    /**
     * 自定义sink
     * 添加前缀和后缀
     *
     * @author guoli
     * @data 2022-02-27 14:21
     */
    @Slf4j
    public class MySink extends AbstractSink implements Configurable {
        /**
         * 前缀
         */
        private String prefix;
    
        /**
         * 后缀
         */
        private String suffix;
    
        @Override
        public Status process() {
            // 声明返回值状态信息
            Status status;
            // 获取当前 Sink 绑定的 Channel
            Channel channel = getChannel();
            // 获取事务
            Transaction transaction = channel.getTransaction();
            // 声明事件
            Event event;
            //开启事务
            transaction.begin();
            // 读取 Channel 中的事件,直到读取到事件结束循环
            do {
                event = channel.take();
            } while (event == null);
            try {
                // 处理事件(打印)
                Log.info(prefix + new String(event.getBody()) + suffix);
                // 事务提交
                transaction.commit();
                status = Status.READY;
            } catch (Exception e) {
                // 遇到异常,事务回滚
                transaction.rollback();
                status = Status.BACKOFF;
            } finally {
                //关闭事务
                transaction.close();
            }
            return status;
        }
    
        @Override
        public void configure(Context context) {
            // 读取配置文件内容,有默认值
            prefix = context.getString("prefix", "hello:");
            // 读取配置文件内容,无默认值
            suffix = context.getString("suffix");
        }
    }
    
  3. 打包后,放入 $FLUME_HOME/lib 目录下

3.7.2 配置

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = com.guoli.flume.sink.MySink
a1.sinks.k1.prefix = hello:
a1.sinks.k1.suffix = :world

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.7.3 测试

flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f$FLUME_HOME/job/flume-my-sink.conf \
-Dflume.root.logger=INFO,console

context) {
// 读取配置文件内容,有默认值
prefix = context.getString(“prefix”, “hello:”);
// 读取配置文件内容,无默认值
suffix = context.getString(“suffix”);
}
}


3. 打包后,放入 $FLUME_HOME/lib 目录下

### 3.7.2 配置

~~~shell
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = com.guoli.flume.sink.MySink
a1.sinks.k1.prefix = hello:
a1.sinks.k1.suffix = :world

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.7.3 测试

flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f$FLUME_HOME/job/flume-my-sink.conf \
-Dflume.root.logger=INFO,console
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值