大数据笔记--Flume（第二篇）

属性	解释
type	必须是hdfs
hdfs.path	数据在HDFS上的存储路径
hdfs.rollInterval	指定文件的滚动的间隔时间
hdfs.fileType	指定文件的存储类型：DataSteam(文本)，SequenceFile(序列)，CompressedStream(压缩)

③、案例

编写hdfssink.conf格式文件

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置HDFS Sink
# 类型必须是hdfs
a1.sinks.k1.type = hdfs
# 指定数据在HDFS上的存储路径
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flumedata
# 指定文件的存储类型
a1.sinks.k1.hdfs.fileType = DataStream
# 指定文件滚动的间隔时间
a1.sinks.k1.hdfs.rollInterval = 3600

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f hdfssink.conf -Dflume.root.logger=INFO,console

新开窗口输入：

可以看到生成一个文件

查看文件

2、Logger Sink

①、概述

Logger Sink是将Flume收集到的数据打印到控制台上

在打印的时候，为了防止过多的数据将屏幕占满，所以要求body部分的数据不能超过16个字节，超过的部分不打印

Logger Sink在打印的时候，对中文支持不好

②、配置属性

属性	解释
type	必须是logger
maxBytesToLog	指定body部分打印的字节数

3、File Roll Sink

①、概述

File Roll Sink将数据写到本地磁盘上

同HDFS Sink类似，File Roll Sink在往磁盘上写的时候，也有一个滚动的间隔时间，同样是30s，因此在磁盘上同样会形成大量的小文件

②、配置属性

属性	解释
type	必须是file_roll
sink.directory	指定数据的存储目录
sink.rollInterval	指定文件滚动的间隔时间

③、案例

编写filerollsink.conf格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置File Roll Sink
# 类型必须是file_roll
a1.sinks.k1.type = file_roll
# 指定数据在磁盘上的存储目录
a1.sinks.k1.sink.directory = /home/flumedata
# 指定文件的滚动间隔时间
a1.sinks.k1.sink.rollInterval = 3600

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f filerollsink.conf -Dflume.root.logger=INFO,console

在另一窗口执行

nc hadoop01 8090

输入：hello等数据

4、Null Sink

①、概述

Null Sink会抛弃所有接收到的数据

②、配置属性

属性	解释
type	必须是null

③、案例

编写nullsink.conf格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置Null Sink
# 类型必须是null
a1.sinks.k1.type = null

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1f

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f nullsink.conf -Dflume.root.logger=INFO,console

在另一窗口执行

nc hadoop01 8090

输入：hello等数据

5、AVRO Sink

①、概述

AVRO Sink会将数据利用AVRO序列化之后写出到指定的节点的指定端口

AVRO Sink结合AVRO Source实现多级、扇入、扇出流动效果

②、配置属性

属性	解释
type	必须是avro
hostname	数据要发往的主机的主机名或者IP
port	数据要发往的主机的接收端口

③、多级流动

i、第一个节点hadoop01

vim duoji.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置多级流动
# 类型必须是avro
a1.sinks.k1.type = avro
# 指定主机名或者IP
a1.sinks.k1.hostname = hadoop02
# 指定端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

ii、第二个节点hadoop02

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置多级流动
# 类型必须是avro
a1.sinks.k1.type = avro
# 指定主机名或者IP
a1.sinks.k1.hostname = hadoop03
# 指定端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

iii、第三个节点hadoop03

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

iv、启动Flume，启动的时候，谁接收数据，就先启动谁，此时先启动hadoop03，然后启动hadoop02与hadoop01

../bin/flume-ng agent -n a1 -c ../conf -f duoji.conf -Dflume.root.logger=INFO,console

在hadoop01新复制的窗口运行一下

此时hadoop03会接收到数据

④、扇入流动

i、第一个节点和第二个节点

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置多级流动
# 类型必须是avro
a1.sinks.k1.type = avro
# 指定主机名或者IP
a1.sinks.k1.hostname = hadoop03
# 指定端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

ii、第三个节点

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

iii、启动flume

../bin/flume-ng agent -n a1 -c ../conf -f shanru.conf -Dflume.root.logger=INFO,console

可以看出对中文不是很支持

⑤、扇出节点

i、第一个节点

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

ii、第二个和第三个节点

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

iii、启动flume

../bin/flume-ng agent -n a1 -c ../conf -f shanchu.conf -Dflume.root.logger=INFO,console

会出现hadoop02与hadoop03都会出现

二、Custom Sink

1、概述

定义一个类实现Sink接口，考虑到需要获取配置属性，所以同样需要实现Configurable接口

不同于自定义Source，自定Sink需要考虑事务问题

2、事物

①、Source收集到数据之后，会通过doPut操作将数据放到队列PutList(本质上是一个阻塞式队列)中

②、PutList会试图将数据推送到Channel中。如果PutList成功将数据放到了Channel中，那么执行doCommit操作；反之执行doRollback操作

③、Channel有了数据之后，会将数据通过doTake操作推送到TakeList中

④、TakeList会将数据推送给Sink，如果Sink写出成功，那么执行doCommit；反之执行doRollback

3、自定义Sink步骤

①、构建Maven工程，导入对应的POM依赖

<dependencies>
    <!--Flume 的核心包-->
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.9.0</version>
    </dependency>
    <!--Flume 的开发工具包-->
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-sdk</artifactId>
        <version>1.9.0</version>
    </dependency>
    <!--Flume 的配置包-->
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-configuration</artifactId>
        <version>1.9.0</version>
    </dependency>
</dependencies>

②、定义一个类继承AbstractSink，实现Sink接口和Configurable接口，覆盖configure，start，process和stop方法

package org.example.flume.sink;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;

import java.io.FileNotFoundException;
import java.io.PrintStream;
import java.util.Map;

// 模拟：File Roll Sink -> 将数据写到本地磁盘上
public class AuthSink extends AbstractSink implements Sink, Configurable {
    private String path;
    private PrintStream ps;
    // 获取用户指定的属性
    @Override
    public void configure(Context context) {
        //获取指定的存储路径
        path = context.getString("path");
        //判断用户是否指定了这个属性
        if (path==null)
            throw new IllegalArgumentException("必须指定Path属性！！！");
    }

    // 启动Sink
    @Override
    public synchronized void start() {
        try {
            //构建流用于将数据写道磁盘上
            ps = new PrintStream(path +"/"+System.currentTimeMillis());
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
    }

    // 处理逻辑需要覆盖在这个方法中
    @Override
    public Status process() throws EventDeliveryException {
        //获取Sink对应的Channel
        Channel c = this.getChannel();
        // 获取事物
        Transaction t = c.getTransaction();
        // 开启事物
        t.begin();
        // 获取数据
        Event e;
        try{
            while((e=c.take())!=null){
                // 获取headers
                Map<String,String> headers = e.getHeaders();
                // 写出headers部分数据
                for (Map.Entry<String,String>h:headers.entrySet()){
                    ps.println("\t"+h.getKey()+":"+h.getValue());
                }
                //获取body
                byte[] body = e.getBody();
                // 写出body数据
                ps.println("body");
                ps.println("\t"+new String(body));
            }
            // 如果循环正常结束，那么说明数据正常写出
            // 提交事务
            t.commit();
            return Status.READY;
        }catch (Exception ex){
            // 如果循环失败，那么进入catch块
            // 回滚事务
            t.rollback();
            return Status.BACKOFF;
        }finally {
            //无论成功与否，都需要关闭事物
            t.close();
        }
    }

    @Override
    public synchronized void stop() {
        if(ps != null)
            ps.close();
    }
}

③、完成之后打成jar包放到Flume安装目录的lib目录下

④、编写格式文件

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

# 配置自定义Sink
# 类型必须是类的全路径名
a1.sinks.k1.type = org.example.flume.sink.AuthSink
# 指定文件的存储路径
a1.sinks.k1.path = /home/flumedata

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

⑤、启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f authsink.conf -Dflume.root.logger=INFO,console

进入另一个窗口测试查看

三、Channel

1、Memory Channel

①、概述

Memory Channel将数据临时存储到内存的指定队列中

如果不指定，则队列大小默认是100，即在队列中最多允许同时存储100条数据。如果队列被占满，那么后来的数据就会被阻塞。实际过程中，一般会将这个值调剂为10W~30W，如果数据量比较大，也可以考虑调剂为50W

Channel可以批量接收Source的数据，也可以将数据批量发送给Sink，那么默认情况下，每一批数据是100条。实际过程中，一般会将这个值调节为1000~3000，如果Channel的容量为50W，那么此时一般将批量调剂为5000

Memory Channel是将数据存储在内存中，所以不可靠，但是读写速度快，因此适应于要求速度但不要求可靠性的场景

②、配置属性

属性	解释
type	memory
capacity	指定的队列的容量
transactionCapacity	数据的批的量

③、案例

编写文件，添加配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

# 配置Memory Channel
# 类型必须是memory
a1.channels.c1.type = memory
# 指定Channel的容量
a1.channels.c1.capacity = 100000
# 指定Channel的批的量
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f memorychannel.conf -Dflume.root.logger=INFO,console

2、File Channel

①、概述

File Channel将数据临时存储到本地的磁盘上

File Channel不会丢失数据，但是读写速度慢，适应于要求可靠性但是不要求速度的场景

如果不指定，那么默认情况下，File Channel会将数据临时存储到~/.flume/file-channel/data

为了File Channel占用过多的磁盘，那么默认情况下，允许在磁盘上最多存储100W条数据

②、配置属性

属性	解释
type	必须是file
dataDirs	指定在磁盘上临时存储的位置

③、案例

编写格式文件，添加内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

# 配置File Channel
# 类型必须是file
a1.channels.c1.type = file
# 指定数据在磁盘上的存储位置
a1.channels.c1.dataDirs = /home/flumedata

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f filechannel.conf -Dflume.root.logger=INFO,console

3、JDBC Channel

JDBC Channel会将数据临时存储到数据库中，理论上JDBC Channel的读写速度要略高于File Channel，但是低于Memory Channel

到目前为止，这个JDBC Channel只支持Derby数据库。基于Derby的特性(微型 - 存储的数据少，单连接 - 只允许单用户操作)，所以实际过程中很少使用这个数据库，因此实际生产过程中，几乎弃用JDBC Channel

4、Spillable Memory Channel

Spillable Memory Channel 会先试图将数据存储到内存中。如果内存队列一旦被塞满，此时这个Channel不会阻塞，而是转而将数据临时存储到磁盘上

到目前位置，这个Channel处于实验阶段，不推荐使用，大约在04年就开始实验，到现在也没上线。

四、Selector

1、概述

Selector本身是Source的子组件，决定了将数据分发给哪个Channel

Selector中提供了两种模式：

replicating：复制。将数据复制之后发送给每一个节点

multiplexing：路由/多路复用。根据headers中的指定字段决定将数据发送给哪一个Channel

如果不指定，那么默认使用的就是复制模式

2、配置属性

属性	解释
selector.type	可以是replicating或者multiplexing
selector.header	如果是multiplexing，那么需要指定监听的字段
selector.mapping.*	如果是multiplexing，那么需要指定监听字段匹配的只
selector.default	如果是multiplexing，那么在所有值不匹配的情况下数据发送的Channel

3、案例

编写格式文件selector.conf

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.s1.type = http
a1.sources.s1.port = 8090
# 指定Selector的类型
a1.sources.s1.selector.type = multiplexing
# 指定要监听的字段
a1.sources.s1.selector.header = kind
# 指定匹配的字段值
a1.sources.s1.selector.mapping.music = c1
a1.sources.s1.selector.mapping.video = c2
# 指定默认值
a1.sources.s1.selector.default = c2

a1.channels.c1.type = memory

a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

启动flume

hadoop02与hadoop03启动扇出节点的格式文件：

cd /home/software/apache-flume-1.9.0-bin/data/
../bin/flume-ng agent -n a1 -c ../conf -f shanchu.conf -Dflume.root.logger=INFO,console
hadoop01启动我们的格式文件：

../bin/flume-ng agent -n a1 -c ../conf -f selector.conf -Dflume.root.logger=INFO,console
新开一个hadoop01窗口，执行：

curl -X POST -d '[{"headers":{"kind":"video"},"body":"video server"}]' http://hadoop01:8090

这时hadoop03:

如果执行：

curl -X POST -d '[{"headers":{"kind":"music"},"body":"video server"}]' http://hadoop01:8090

这是hadoop02会收到：

这就是Selector可以决定将那个指定数据发送给指定channel。

五、Processor

1、Failover Sink Processor

①、概述

Failover Sink Processor将多个Sink绑定到一个组中，同一个组中的Sink需要指定优先级

只要高优先级的Sink存活，那么数据就不会发送给低优先级的Sink

②、配置属性

属性	解释
sinks	要绑定到一个组中的sink
processor.type	必须是failover
processor.priority.<sinkName>	指定Sink的优先级
processor.maxpenalty	等待存活的时间

③、编辑文件failover.conf

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

# 给Sinkgroup起名
a1.sinkgroups = g1
# 给Sinkgroup绑定Sink
a1.sinkgroups.g1.sinks = k1 k2
# 指定Sinkgroup的类型
a1.sinkgroups.g1.processor.type = failover
# 给每一个Sink指定优先级
a1.sinkgroups.g1.processor.priority.k1 = 7
a1.sinkgroups.g1.processor.priority.k2 = 2
# 指定存活等待时间
a1.singroups.g1.processor.maxpenalty = 10000

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f failover.conf -Dflume.root.logger=INFO,console

数据会先给hadoop02，如果hadoop02挂了，会给到hadoop03，如果hadoop02再次启动，还会发给hadoop02.

2、其他Processor

①、Default Processor

在Flume中，如果不指定，那么默认使用的就是Default Processor

在Default Processor的模式下，每一个Sink都对应了一个单独的Sinkgroup，即有几个Sink就有几个Sinkgroup

这个Default Processor不需要进行任何的配置

②、Load Balance Processor

Load Balancing Processor进行负载均衡的Processor，在数据量较大的时候，可以考虑使用

Flume中提供了两种负载均衡的模式：round_robin(轮询)，random(随机)

Flume原生提供的负载均衡的Processor并不好用

六、Interceptor

1、Timestamp Interceptor

①、概述

Timestamp Interceptor是在headers中来添加一个timestamp字段来标记数据被收集的时间

Timestamp Interceptor结合HDFS Sink可以实现数据按天存储

②、配置属性

属性	解释
type	timestamp

③、案例

编写格式文件

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
# 给Interceptor起名
a1.sources.s1.interceptors = i1
# 指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f in.conf -Dflume.root.logger=INFO,console

实现数据按天存放：

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = timestamp

a1.channels.c1.type = memory

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flumedata/date=%Y-%m-%d
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 3600

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f hdfsin.conf -Dflume.root.logger=INFO,console

2、Host Interceptor

①、概述

Host Interceptor是在headers中添加一个字段host

Host Interceptor可以用于标记数据来源于哪一台主机

②、配置属性

属性	解释
type	必须是host

③、案例

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
# 给Interceptor起名
a1.sources.s1.interceptors = i1 i2
# 指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp
# 指定Host Interceptor
a1.sources.s1.interceptors.i2.type = host

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f in.conf -Dflume.root.logger=INFO,console

3、Static Interceptor

①、概述

Static Interceptor是在headers中添加指定字段

可以利用这个Interceptor来标记数据的类型

②、配置属性

属性	解释
type	必须是static
key	指定在headers中的字段名
value	指定在headers中的字段值

③、案例

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

# 给Interceptor起名
a1.sources.s1.interceptors = i1 i2 i3
# 指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp
# 指定Host Interceptor
a1.sources.s1.interceptors.i2.type = host
# 指定Static Interceptor
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = kind
a1.sources.s1.interceptors.i3.value = log

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f in.conf -Dflume.root.logger=INFO,console

4、UUID Interceptor

①、概述

UUID Interceptor是在headers中添加一个id字段

可以用于标记数据的唯一性

②、配置属性

属性	解释
type	必须是org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

③、案例

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
# 给Interceptor起名
a1.sources.s1.interceptors = i1 i2 i3 i4
# 指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp
# 指定Host Interceptor
a1.sources.s1.interceptors.i2.type = host
# 指定Static Interceptor
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = kind
a1.sources.s1.interceptors.i3.value = log
# 指定UUID Interceptor
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f in.conf -Dflume.root.logger=INFO,console

5、Search And Replace Interceptor

①、概述

Search And Replace Interceptor在使用的时候，需要指定正则表达式，会根据正则表达式的规则，将符合正则表达式的数据替换为指定形式的数据

在替换的时候，不会替换headers中的数据，而是会替换body中的数据

②、配置属性

属性	解释
type	必须是search_replace
searchPattern	指定要匹配的正则形式
replaceString	指定要替换的字符串

③、案例

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = http
a1.sources.s1.port = 8090
# 给拦截器起名
a1.sources.s1.interceptors = i1
# 指定类型
a1.sources.s1.interceptors.i1.type = search_replace
a1.sources.s1.interceptors.i1.searchPattern = [0-9]
a1.sources.s1.interceptors.i1.replaceString = *

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f searchin.conf -Dflume.root.logger=INFO,console

curl -X POST -d '[{"headers":{"data":"2022-3-15"},"body":"test1213312"}]' http://hadoop01:8090

会发现haders没有替换，body数据替换了。

6、Regex Filtering Interceptor

①、概述

Regex Filtering Interceptor在使用的时候需要指定正则表达式

属性excludeEvents的值如果不指定，默认是false

如果没有配置excludeEvents的值或者配置excludeEvents的值配置为false，则只有符合正则表达式的数据会留下来，其他不符合正则表达式的数据会被过滤掉；如果excludeEvents的值，那么符合正则表达式的数据会被过滤掉，其他的数据则会被留下来

②、配置属性

属性	解释
type	必须是regex_filter
regex	指定正则表达式
excludeEvents	true或者false

③、案例

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = regex_filter
#匹配所有含数字的字符串
a1.sources.s1.interceports.i1.regex = .*[0-9].*

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动Flume

../bin/flume-ng agent -n a1 -c ../conf -f regexin.conf -Dflume.root.logger=INFO,console

nc hadooop01 8090

七、Custom Interceptor

1、概述

在Flume中，也允许自定义拦截器。但是不同于其他组件，自定义Interceptor的时候，需要再额外覆盖其中的内部接口

2、步骤：

构建Maven工程，导入对应的依赖

自定义一个类实现Interceptor接口，覆盖其中initialize，intercept和close方法

package org.example.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

// 模拟:Timestamp Interceptor
public class AuthInterceptor implements Interceptor {

    // 初始化方法，一般不操作
    @Override
    public void initialize() {

    }

    // 拦截，处理Event
    @Override
    public Event intercept(Event event) {
        // 时间戳在headers中，首先获取时间戳
        Map<String, String> headers = event.getHeaders();
        // 判断headers中原本是否指定了时间戳
        if (headers.containsKey("time") || headers.containsKey("timestamp"))
            //如果原来指定，那么我们就不再修改
            return event;
        //如果没有指定，那么添加一个时间戳
        headers.put("time", System.currentTimeMillis() + "");
        // 需要将headers放回我们的Event 对象中
        event.setHeaders(headers);
        return event;
    }

    // 批处理
    @Override
    public List<Event> intercept(List<Event> list) {
        // 存储处理之后的Event
        List<Event> es = new ArrayList<>();
        for (Event event : list) {
            // 将遍历的数据逐个处理，处理完成之后放到列表中
            es.add(intercept(event));
        }
        return es;
    }

    // 关闭拦截器资源
    @Override
    public void close() {}

    // 在底层调用的时候会调用内部类来构建当前拦截器对象
    public static class Builder implements Interceptor.Builder {
        //产生要使用的拦截器对象
        @Override
        public Interceptor build() {
            return new AuthInterceptor();
        }

        // 获取配置属性
        @Override
        public void configure(Context context) {}
    }
}

定义静态内部类，实现Interceptor.Builder内部接口

打成jar包方法Flume安装目录的lib目录下

编写格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
# 指定拦截器
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = org.example.flume.interceptor.AuthInterceptor$Builder

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动flume

../bin/flume-ng agent -n a1 -c ../conf -f authin.conf -Dflume.root.logger=INFO,console

nc hadoop01 8090

八、Ganglia集群监控

1、概述

Ganglia是UC Berkeley发起的一个开源的集群监控项目，被设计用于测量数以千计的节点性能

Ganglia的核心主要包含三个模块：

        gmond(Ganglia Monitoring Daemon)是一个轻量级的服务，需要安装在每一条需要收集指标数据的主机上，使用gmond来收集系统指标数据，包含CPU，内存、磁盘、网络以及活跃的进程数量等

        gmetad(Ganglia Meta Daemon)用于整合所有的信息，并将这些信息以RRD格式来存储到磁盘上

        gweb(Ganglia Web)是Ganglia所提供的一个可视化工具，用PHP来开发的。提供了WEB页面，在Web界面中以图表的形式来显式集群的运行状态下所收集的到的不同的指标数据

2、安装

①、安装httpd和php服务

yum -y install httpd php

②、安装其他依赖

yum -y install rrdtool perl-rrdtool rrdtool-devel

yum -y install apr-devel

③、安装Epel

下载：epel-release-7-13.noarch.rpm，这个可以从网上找，或者上传

安装：rpm -ivh epel-release-7-13.noarch.rpm

④、安装Ganglia

yum -y install ganglia-gmetad

yum -y install ganglia-gmond

yum -y install ganglia-web

⑤、编辑文件

vim /etc/httpd/conf.d/ganglia.conf

修改内容：

<Location /ganglia>
  # Order deny,allow
  # Deny from all
  # Allow from 127.0.0.1
  # Allow from ::1
  # Allow from .example.com
  Require all granted
</Location>

vim /etc/ganglia/gmetad.conf

修改data_souce属性的值：

vim /etc/ganglia/gmond.conf

修改cluster中的属性：

cluster {
  name = "hadoop01"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

修改udp_send_channel中的属性

udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  # mcast_join = 239.2.11.71
  host=192.168.186.128
  port = 8649
  ttl = 1
}

修改udp_recv_channel中的属性

udp_recv_channel {
  # mcast_join = 239.2.11.71
  port = 8649
  bind = 192.168.186.128
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

vim /etc/selinux/config

将SELINUX的值改为disabled

⑥、重启

reboot

⑦、启动Ganglia

systemctl start httpd

systemctl start gmetad

systemctl start gmond

⑧、通过http://hadoop01/ganglia来访问web页面

3、监控

修改Flume的配置文件

cd /home/software/apache-flume-1.9.0-bin/conf

cp flume-env.sh.template flume-env.sh

vim flume-env.sh

在文件尾部添加配置
export JAVA_HOME=/home/software/jdk1.8.0_131
export JAVA_OPTS="-Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=192.168.186.128:8649 -Xms100m -Xmx200m"
保存退出，重新生效

source flume-env.sh

启动Flume

cd ../data

../bin/flume-ng agent -n a1 -c ../conf -f basic.conf -Dflume.root.logger=INFO,console -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=192.168.186.128:8649

新窗口运行nc hadoop01 8090

在浏览器可以看到监控：

属性：

属性	解释
ChannelCapacity	Channel的容量
ChannelFillPercentage	Channel的利用率
ChannelSize	Channel大小
EventPutAttemptCount	Source试图放入Channel的次数
EventPutSuccessCount	Source向Channel放入数据的成功次数
EventTakeAttemptCount	Channel试图向Sink发送数据的次数
EventTakeSuccessCount	Channel成功向Sink发送数据的次数
startTime	起始时间
stopTime	结束时间