大数据04--Flume框架自定义 Interceptor、Source和Sink，Ganlia实时监控

最新推荐文章于 2023-11-18 11:04:59 发布

粥ou

最新推荐文章于 2023-11-18 11:04:59 发布

阅读量266

点赞数 2

分类专栏：大数据学习文章标签：大数据 flume

大数据学习专栏收录该内容

13 篇文章 1 订阅

订阅专栏

batchSize 参数决定 Source 一次批量运输到 Channel 的 event 条数，适当调大这个参

数可以提高 Source 搬运 Event 到 Channel 时的性能。

type 选择 memory 时 Channel 的性能最好，但是如果 Flume 进程意外挂掉可能会丢失

数据。type 选择 file 时 Channel 的容错性更好，但是性能上会比 memory channel 差。

使用 file Channel 时 dataDirs 配置多个不同盘下的目录可以提高性能。

Capacity 参数决定 Channel 可容纳最大的 event 条数。transactionCapacity 参数决

定每次 Source 往 channel 里面写的最大 event 条数和每次 Sink 从 channel 里面读的最大

event 条数。 transactionCapacity 需要大于 Source 和 Sink 的 batchSize 参数。

增加 Sink 的个数可以增加 Sink 消费 event 的能力。Sink 也不是越多越好够用就行，过多的 Sink 会占用系统资源，造成系统资源不必要的浪费。

batchSize 参数决定 Sink 一次批量从 Channel 读取的 event 条数，适当调大这个参数

可以提高 Sink 从 Channel 搬出 event 的性能。

根据 Flume 的架构原理，Flume 是不可能丢失数据的，其内部有完善的事务机制 ，

Source 到 Channel 是事务性的，Channel 到 Sink 是事务性的，因此这两个环节不会出现数

据的丢失，唯一可能丢失数据的情况是 Channel 采用 memoryChannel，agent 宕机导致数据

丢失，或者 Channel 存储数据已满，导致 Source 不再写入，未写入的数据丢失。

Flume 不会丢失数据，但是有可能造成数据的重复，例如数据已经成功由 Sink 发出，

但是没有接收到响应，Sink 会再次发送数据，此时可能会导致数据的重复。

自定义 Interceptor

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要

发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing

的原理是， 根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel

中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予

不同的值。

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下

package com.atguigu.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class Typeinterceptor implements Interceptor {
    //声明一个集合用于存放拦截器处理后的事件
    private List<Event> addHeaderEvents = new ArrayList<>();
    @Override
    public void initialize() {

    }
    //单个事件处理方法
    @Override
    public Event intercept(Event event) {
        //1.获取header&body
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());

        //2.根据Body中是否包含”atguigu“添加不同的头信息
        if(body.contains("atguigu")){
            headers.put("type","atguigu");
        }else {
            headers.put("type","other");
        }
        //3.返回数据
        return event;
    }
    //批量事件处理方法
    @Override
    public List<Event> intercept(List<Event> list) {
        //1.清空集合
        addHeaderEvents.clear();
        //2.遍历list
        for (Event event : list) {
            addHeaderEvents.add(intercept(event));
        }
        //3.返回数据
        return addHeaderEvents;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new Typeinterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

编辑 flume 配置文件
为 hadoop102 上的 Flume1 配置 1 个 netcat source，1 个 sink group（2 个 avro sink），
并配置相应的 ChannelSelector 和 interceptor

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
#这里得复制全类名$Builder
a1.sources.r1.interceptors.i1.type=com.atguigu.interceptor.Typeinterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.atguigu = c1
a1.sources.r1.selector.mapping.other = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

为 hadoop103 上的 Flume4 配置一个 avro source 和一个 logger sink

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

为 hadoop104 上的 Flume3 配置一个 avro source 和一个 logger sink
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

自定义 Source

Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence、generator、syslog、http、legacy。官方提供的 source 类型已经很多，但是有时候并不能满足实际开发当中的需求，此时我们就需要根据实际需求自定义某些 source

官方提供了自定义 source 的接口：

https://flume.apache.org/FlumeDeveloperGuide.html#source 根据官方说明自定义 MySource 需要继承 AbstractSource 类并实现 Configurable 和 PollableSource 接口

实现相应方法：

getBackOffSleepIncrement() //backoff 步长

getMaxBackOffSleepInterval()//backoff 最长时间

configure(Context context)//初始化 context（读取配置文件内容）

process()//获取数据封装成 event 并写入 channel，这个方法将被循环调用。

使用场景：读取 MySQL 数据或者其他文件系统。

需求:使用 flume 接收数据，并给每条数据添加前缀，输出到控制台。前缀可从 flume 配置文

件中配置。

package com.atguigu.source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class Mysource extends AbstractSource implements Configurable, PollableSource {

    private String prefix;//前缀
    private String subfix;//后缀
    private Long delay;

    @Override
    public void configure(Context context) {

        prefix = context.getString("pre", "pre-");//指定了pre-
        subfix = context.getString("sub");//没指定,没找到，默认的话就是Null
        delay = context.getLong("delay",2000L);

    }
    @Override
    public Status process() throws EventDeliveryException {
        //1.声明event
        Event event = new SimpleEvent();   //接口的实现类
        HashMap<String, String> header = new HashMap<>();
        try {
            for (int i = 0; i < 5; i++) {
                event.setHeaders(header);

                event.setBody((prefix + "atguigu" + i + subfix).getBytes());//需要的是字节数组

                getChannelProcessor().processEvent(event);

            }
            Thread.sleep(delay);
            return Status.READY;
        } catch (Exception e) {
            e.printStackTrace();
            return Status.BACKOFF;
        }
    }

    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }


}

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下

配置文件
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = com.atguigu.source.Mysource
a1.sources.r1.delay = 1000
a1.sources.r1.pre = sb-
#可以不写，找不到key就加上Null
a1.sources.r1.sub = hah
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义 Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent

Sink 是完全事务性的 。在从 Channel 批量删除数据之前，每个 Sink 用 Channel 启动一

个事务。批量事件一旦成功写出到存储系统或下一个 Flume Agent，Sink 就利用 Channel 提

交事务。事务一旦被提交，该 Channel 从自己的内部缓冲区删除事件。

Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、null、HBase、solr自定义。官方提供的 Sink 类型已经很多，但是有时候并不能满足实际开发当中的需求，此

时我们就需要根据实际需求自定义某些 Sink。

官方也提供了自定义 sink 的接口： https://flume.apache.org/FlumeDeveloperGuide.html#sink

根据官方说明自定义 MySink 需要继承 AbstractSink 类并实现 Configurable 接口。

实现相应方法：

configure(Context context)//初始化 context（读取配置文件内容）

process()//从 Channel 读取获取数据（event），这个方法将被循环调用。

使用场景：读取 Channel 数据写入 MySQL 或者其他文件系统。

需求 :

使用 flume 接收数据，并在 Sink 端给每条数据添加前缀和后缀，输出到控制台。前后缀可在 flume 任务配置文件中配置。

package com.atguigu.sink;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Mysink extends AbstractSink implements Configurable {
    private String prefix;
    private String subfix;
    //创建Logger对象
    private Logger logger = LoggerFactory.getLogger(Mysink.class);
    @Override
    public void configure(Context context) {
        prefix= context.getString("pre","pre-");
        subfix = context.getString("sub");
    }
    @Override
    public Status process() throws EventDeliveryException {
        //1.获取channel并开启事务
        Channel channel = getChannel();
        Transaction transaction = channel.getTransaction();
        transaction.begin();

        //2.从channel中抓取数据打印到控制台
        try{
            //2.1抓取数据
            Event event;
            while(true){
                event = channel.take();
                if (event!=null){
                    break;
                }
            }

            //2.2处理数据
            logger.info(prefix + new String(event.getBody())+subfix);
            //2.3提交事务
            transaction.commit();

            return Status.READY;


        }catch (Exception e){
            //回滚
            transaction.rollback();
            return Status.BACKOFF;

        }finally {

            transaction.close();
        }
    }

}

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = com.atguigu.sink.Mysink
#a1.sinks.k1.pre = atguigu:
a1.sinks.k1.sub = :atguigu
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开启任务

[xwt@hadoop102 flume]$ bin/flume-ng agent -c conf/ -f 
job/mysink.conf -n a1 -Dflume.root.logger=INFO,console
[xwt@hadoop102 ~]$ nc localhost 44444

Flume 数据流监控 -Ganlia（第三方框架实时监控）

Ganglia 的安装与部署

Ganglia 由 gmond、gmetad 和 gweb 三部分组成。

gmond（Ganglia Monitoring Daemon）是一种轻量级服务，安装在每台需要收集指标数据的节点主机上。使用 gmond，你可以很容易收集很多系统指标数据，如 CPU、内存、磁盘、网络和活跃进程的数据等。

gmetad（Ganglia Meta Daemon）整合所有信息，并将其以 RRD 格式存储至磁盘的服务。

gweb（Ganglia Web）Ganglia 可视化工具，gweb 是一种利用浏览器显示 gmetad 所存储

数据的 PHP 前端。在 Web 界面中以图表方式展现集群的运行状态下收集的多种不同指标数

据。

安装 ganglia

（1）规划
hadoop102: web gmetad gmod
hadoop103: gmod
hadoop104: gmod

在 102 103 104 分别安装 epel-release

[xwt@hadoop102 flume]$ sudo yum -y install epel-release

（3）在 102 安装
[xwt@hadoop102 flume]$ sudo yum -y install ganglia-gmetad 
[xwt@hadoop102 flume]$ sudo yum -y install ganglia-web
[xwt@hadoop102 flume]$ sudo yum -y install ganglia-gmond
（4）在 103 和 104 安装
[xwt@hadoop103 flume]$ sudo yum -y install ganglia-gmond
[xwt@hadoop104 flume]$ sudo yum -y install ganglia-gmond

在 102 修改配置文件/etc/httpd/conf.d/ganglia.conf

[xwt@hadoop102 flume]$ sudo vim /etc/httpd/conf.d/ganglia.conf

在 102 修改配置文件/etc/ganglia/gmetad.conf

在 102 103 104 修改配置文件/etc/ganglia/gmond.conf

在 102 修改配置文件/etc/selinux/config sudo vim /etc/selinux/config

修改/var/lib/ganglia 目录的权限：

[xwt@hadoop102 flume]$ sudo chmod -R 777 /var/lib/ganglia

启动 ganglia

（1）在 102 103 104 启动
[xwt@hadoop102 flume]$ sudo systemctl start gmond
（2）在 102 启动 
[xwt@hadoop102 flume]$ sudo systemctl start httpd
[xwt@hadoop102 flume]$ sudo systemctl start gmetad

打开网页浏览 ganglia 页面 http://hadoop102/ganglia

操作 Flume 测试监控

[xwt@hadoop102 flume]$ bin/flume-ng agent \ -c conf/ \ -n a1 \ -f job/flume-netcat-logger.conf \ -Dflume.root.logger=INFO,console \ -Dflume.monitoring.type=ganglia \ -Dflume.monitoring.hosts=hadoop102:8649