Flume简介和配置

Flume简介和配置

官网地址:http://flume.apache.org/

Flume是什么

Flume是一个分布式数据收集框架。

Flume是一种分布式的、可靠的、可用的服务,可以有效地收集、聚合和移动大量的日志数据。

收集(collecting): — 数据源 source
聚合(aggregating): — 存储 channel
移动(moving ): — 使用 sink

学习flume其实就是学习source、channel、sink的组合。

flume是框架。框架本身是没有source、channel、sink的组合关系的。框架要是使用source、channel、sink的组合,就必须是我们通过配置文件告诉框架。

学习flume其实就是学习source、channel、sink的组合配置。

channel管道存储的数据一旦被sink,就没有了。

channel是一种被动的状态,只负责存储数据。

Event 和 agent

Flume event被定义为具有字节有效负载(payload)和一 组可选字符串属性的数据流单元。

Flume event = payload(数据) + 属性

Flume agent是一个(JVM)进程,它承载着事件从外部源 流到下一个目的地(hop)所经过的组件。

Flume Sources

NetCat Source

采集网络数据—控制台日志输出
  1. 在/home/hadoop/apps/flume/conf目录下编写配置文件

    vim flume-net-log.conf

    # flume需要配置source、channel、sink
    # 一个flume中可以有多个source、channel、sink
    # 所以source、channel、sink需要取名字
    # 定义source、channel、sink
    # a1是agent的名字
    # r1是source的名字
    # c1是channle的名字
    # s1是sink的名字
    a1.sources = r1
    a1.channels = c1
    a1.sinks = k1
    
    # 配置source是什么数据源
    # 这里的netcat就是服务端,需要安装netcat
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = hadoop101
    a1.sources.r1.port = 6666
    
    # 配置channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 10000
    a1.channels.c1.transactionCapacity = 10000
    a1.channels.c1.byteCapacityBufferPercentage = 20
    a1.channels.c1.byteCapacity = 800000
    
    # 配置sink
    a1.sinks.k1.type = logger
    
    # source写入哪一个channel
    # sink从哪一个channel获取数据
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    
  2. 使用以上的配置文件启动flume

    flume-ng agent --conf ./ --conf-file ./flume-net-log.conf --name a1 -Dflume.root.logger=INFO,console
    
    ## 简化命令
    [hadoop@hadoop101 conf]$ flume-ng agent -n a1 -c ./ -f flume-net-log.conf
    -Dflume.root.logger=INFO,console
    

yum失效解决办法

wget -O /etc/yum.repos.d/CentOS-Base.repo http://file.kangle.odata.cc/repo/Centos6.repo
wget -O /etc/yum.repos.d/epel.repo http://file.kangle.odata.cc/repo/epel6.repo
yum makecache
安装NetCat
  1. 解压netcat安装包netcat-0.7.1.tar.gz(直接解压到当前目录,还需要编译和安装)

    [hadoop@hadoop101 installPkg]$ tar -zxvf netcat-0.7.1.tar.gz
    
  2. 配置安装路径

    [hadoop@hadoop101 netcat-0.7.1]$ ./configure --prefix=/home/hadoop/apps/netcat/
    
  3. 编译和安装(因为src目录下是C语言,编译需要先安装gcc)

    [hadoop@hadoop101 src]$ make && make install
    
  4. 配置环境变量

    [hadoop@hadoop101 bin]$ sudo vim /etc/profile
    ## netcat的环境变量
    export NETCAT_HOME=/home/hadoop/apps/netcat
    export PATH=$PATH:$NETCAT_HOME/bin
    
    [hadoop@hadoop101 bin]$ . /etc/profile
    
  5. netcat以socket客户端的身份启动(这里的hadoop101和端口号6666,是自己创建的flume配置文件flume-net-log.conf里的配置)

    [hadoop@hadoop101 ~]$ nc hadoop101 6666
    hello world
    OK
    
  6. 控制台输出结果

    2020-12-17 14:26:35,639 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/192.168.152.81:6666]
    2020-12-17 14:27:13,644 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
    Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }
    

Exec Source

监控文件数据—控制台日志输出

exec数据源
vim flume-exec-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/access.log

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

向access.log中写入数据,服务端控制台就会输出相应的数据

[hadoop@hadoop101 data]$ echo java > flume.txt
2020-12-17 15:10:54,796 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SOURCE, name: r1 started
2020-12-17 15:11:24,807 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{} body: 6A 61 76 61                                     java }

Spooling Directory Source

  • 与Exec源不同,该源是可靠的,不会丢失数据,即使Flume被重新启动或终止。
  • 该目录中的文件必须是不可变的、唯一命名的文件
  • 文件完成后会重命名文件

flume-spool-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/data/flumeSpool
a1.sources.r1.fileHeader = true
a1.sources.r1.basenameHeader = true

## 忽略以.tmp结尾的文件
a1.sources.r1.ignorePattern = ^.*\\.tmp$

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Taildir Source – 重点

flume1.7新增加的数据源

监视指定的文件,一旦检测到添加到每个文件中的新行,就近乎实时地跟踪它们。如果正在写入新行,则此源将重新尝试读取它们,直到写入完成。

这个源是可靠的,即使在拖尾文件旋转(指flume停止,文件现在依然在写入数据)时也不会丢失数据。

它以JSON格式定期地将每个文件的最后读取位置写入给定位置文件。如果flume由于某种原因停止或停机,它可以从写入现有位置文件的位置重新开始跟踪。

此源文件不会重命名、删除或对被跟踪的文件进行任何修改。目前这个源不支持跟踪二进制文件。它逐行读取文本文件。

flume-taildir-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Channels

Memory Channel

事件存储在内存队列中。可以实现高吞吐;但是flume 失败了数据丢失。

File Channel

https://blogs.apache.org/flume/entry/apache_flume_filechannel

MemoryChannel提供高吞吐量,但在崩溃或断电时会丢失数据。因此,需要建立一个持久的渠道。

FileChannel的目标是提供可靠的高吞吐量通道。FileChannel保证在提交事务时不会由于后续崩溃或断电而丢失任何数据。

taildir-file-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true

# 配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/hadoop/apps/flume/checkpoint
a1.channels.c1.dataDirs = /home/hadoop/apps/flume/data

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Sinks

HDFS Sink

这个sink将事件写入Hadoop分布式文件系统(HDFS)。

它目前支持创建文本和序列文件。它支持两种文件类型的压缩。

可以根据运行时间、数据大小或事件数量定期滚动文件(关闭当前文件并创建一个新文件)。

net-meme-hdfs.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-

# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30

# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

File Roll Sink

在本地文件系统上存储事件

net-mem-file.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

AsyncHBaseSink

此接收器使用异步模型将数据写入HBase。

net-mem-hbase.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = asynchbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume串联

Flume一个数据源对应多个channel,多个sink的叫扇出(fan out);

多个source配一个channel和一个sinks,这叫扇入(fan in);

但是不能同时多个source配多个channel和多个sinks。

multi-agent flow

在这里插入图片描述

第一个flume在hadoop101启动
第二个flume在hadoop102启动

注意:先启动hadoop102的flume

net-mem-avro.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

avro-mem-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 4545

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Multiplexing the flow(多路复用流)

在这里插入图片描述

net-channels-sinks.conf

a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = k1 k2 k3

# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory
a1.channels.c2.type = memory
a1.channels.c3.type = memory

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30
# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

a1.sinks.k2.type = logger

a1.sinks.k3.type = avro
a1.sinks.k3.hostname = hadoop102
a1.sinks.k3.port = 4545

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3

拦截器

Timestamp Interceptor

这个拦截器将在事件头中插入它处理事件的时间(单位为 毫秒)。

net-timestamp.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Host Interceptor

net-timestamp-host.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.hostHeader = hostname
a1.sources.r1.interceptors.i2.useIP = false

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Static Interceptor

自定义header的信息

net-static.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = author
a1.sources.r1.interceptors.i1.value = lee

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义拦截器

  1. 编写一个类实现Interceptor,将文件数据转换成json格式(参照HostInterceptor类的源码进行修改)

    先导入依赖

    <dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.7.0</version>
    </dependency>
    
    <dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.72</version>
    </dependency>
    
    package com.bigdata.demo;
    
    import com.alibaba.fastjson.JSONObject;
    import com.google.common.collect.Lists;
    import org.apache.flume.Context;
    import org.apache.flume.Event;
    import org.apache.flume.interceptor.Interceptor;
    
    import java.io.UnsupportedEncodingException;
    import java.util.HashMap;
    import java.util.List;
    
    public class LogInterceptor implements Interceptor {
    
        private String colName;
        private String separator;
        private HashMap<String,Object> map;
    
        private LogInterceptor(String colName, String separator){
            this.colName = colName;
            this.separator = separator;
        }
        @Override
        public void initialize() {
            map = new HashMap<>();
        }
    
        @Override
        public Event intercept(Event event) {
            map.clear();
            byte[] body = event.getBody();
            try {
                String data = new String(body,"UTF-8");
                String[] datas = data.split(separator);
                String[] fields = colName.split(",");
                if(fields.length != datas.length){
                    return null;
                }
                for (int i = 0; i < datas.length; i++) {
                    map.put(fields[i],datas[i]);
                }
    
                //将map --> json
                String json = JSONObject.toJSONString(map);
                //将json设置到event的body上
                event.setBody(json.getBytes("UTF-8"));
    
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
            return event;
        };
    
        @Override
        public List<Event> intercept(List<Event> events) {
            List<Event> out = Lists.newArrayList();
            for (Event event : events) {
                Event outEvent = intercept(event);
                if (outEvent != null) {
                    out.add(outEvent);
                }
            }
            return out;
        }
    
        @Override
        public void close() {
            //no-op
        }
    
        public static class Builder implements Interceptor.Builder {
            private String colName;
            private String separator;
    
            @Override
            public Interceptor build() {
                return new LogInterceptor(colName,separator);
            }
    
            @Override
            public void configure(Context context) {
                colName = context.getString("colName","");
                separator = context.getString("separator"," ");
            }
        }
    }
    
  2. 将代码打成jar包,添加到flume的lib目录下

  3. 编写配置

    a1.sources = r1
    a1.channels = c1
    a1.sinks = k1
    
    # 配置source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = hadoop101
    a1.sources.r1.port = 6666
    
    # 配置拦截器
    a1.sources.r1.interceptors = i1
    # 这里通过类的全限定名获取的是class文件,不是java文件,内部类前需加$符号
    a1.sources.r1.interceptors.i1.type = com.bigdata.demo.LogInterceptor$Builder
    a1.sources.r1.interceptors.i1.colName =id,name,age
    a1.sources.r1.interceptors.i1.separator =,
    
    # 配置channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 10000
    
    # 配置sink
    a1.sinks.k1.type = file_roll
    a1.sinks.k1.sink.directory = /home/hadoop/data
    
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    

自定义hbase序列化器

  1. 实现AsyncHbaseEventSerializer,将数据写入hbase表中(参照SimpleAsyncHbaseEventSerializer类的源码)

    先导入依赖

    <dependency>
    <groupId>org.apache.flume.flume-ng-sinks</groupId>
    <artifactId>flume-ng-hbase-sink</artifactId>
    <version>1.7.0</version>
    </dependency>
    
    package com.bigdata.demo;
    
    import com.google.common.base.Charsets;
    import org.apache.flume.Context;
    import org.apache.flume.Event;
    import org.apache.flume.FlumeException;
    import org.apache.flume.conf.ComponentConfiguration;
    import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;
    import org.hbase.async.AtomicIncrementRequest;
    import org.hbase.async.PutRequest;
    
    import java.util.ArrayList;
    import java.util.List;
    
    public class LogHbaseEventSerializer implements AsyncHbaseEventSerializer {
        private byte[] table;
        private byte[] cf;
        private byte[] payload;
        private byte[] incrementColumn;
        private byte[] incrementRow;
        private String separator;
        private String pCol;
    
        @Override
        public void initialize(byte[] table, byte[] cf) {
            this.table = table;
            this.cf = cf;
        }
    
        @Override
        public List<PutRequest> getActions() {
            List<PutRequest> actions = new ArrayList<PutRequest>();
            if (pCol != null) {
                byte[] rowKey;
                try {
                    //获取rowkey,使用采集数据中的用户id当作rowkey
                    //得到id
                    String data = new String(payload);
                    String[] strings = data.split(separator);
                    String[] fields = pCol.split(",");
    
                    if(strings.length != fields.length){
                        return actions;
                    }
    
                    String id = strings[0];
                    rowKey = id.getBytes("UTF-8");
    
                    for (int i = 0; i < strings.length; i++) {
                        PutRequest putRequest =  new PutRequest(table, rowKey, cf,
                                fields[i].getBytes("UTF-8"), strings[i].getBytes("UTF-8"));
                        actions.add(putRequest);
                    }
    
                } catch (Exception e) {
                    throw new FlumeException("Could not get row key!", e);
                }
            }
            return actions;
        }
    
        public List<AtomicIncrementRequest> getIncrements() {
            List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
            if (incrementColumn != null) {
                AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
                        incrementRow, cf, incrementColumn);
                actions.add(inc);
            }
            return actions;
        }
    
        @Override
        public void cleanUp() {
            // TODO Auto-generated method stub
    
        }
    
        @Override
        public void configure(Context context) {
            //HBase的列名称
            pCol = context.getString("colName", "pCol");
            //flume采集数据的分隔符
            separator = context.getString("separator", ",");
            String iCol = context.getString("incrementColumn", "iCol");
    
    
            if (iCol != null && !iCol.isEmpty()) {
                incrementColumn = iCol.getBytes(Charsets.UTF_8);
            }
            incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
        }
    
        @Override
        public void setEvent(Event event) {
            this.payload = event.getBody();
        }
    
        @Override
        public void configure(ComponentConfiguration conf) {
            // TODO Auto-generated method stub
        }
    }
    
  2. 将代码打成jar包,添加到flume的lib目录下

  3. 编写配置

    a1.sources = r1
    a1.channels = c1
    a1.sinks = k1
    
    # 配置source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = hadoop101
    a1.sources.r1.port = 6666
    
    # 配置channel
    a1.channels.c1.type = memory
    
    # 配置sink
    a1.sinks.k1.type = asynchbase
    a1.sinks.k1.table = myhbase
    a1.sinks.k1.columnFamily = c
    a1.sinks.k1.serializer = com.bigdata.demo.LogHbaseEventSerializer
    a1.sinks.k1.serializer.colName = id,name,age
    a1.sinks.k1.serializer.separator = ,
    
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    

Agent的内部原理

在这里插入图片描述

Flume的故障转移和负载均衡

使用sink组

故障转移

使用sink组对应一个channel,sink组中只能有一个sink在take数据。如果该sink出现了故障,sink组中的可以使用另一个sink来take数据。

sink有一个与之相关的优先级,数量越大,优先级越高。

failover.conf:

a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume

# 配置故障转移
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 50
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

负载均衡

负载均衡sink处理器提供了在多个sink上实现负载均衡流的能力。

它维护一个活动接收器的索引列表,负载必须分布在该列表上。

实现支持通过round_robin或random选择机制分配负载。

选择机制默认为round_robin类型,但可以通过配置重写。

a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume01

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume

# 绑定channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值