flume数据采集工具

最新推荐文章于 2024-07-23 08:03:31 发布

一把秀儿

最新推荐文章于 2024-07-23 08:03:31 发布

阅读量432

点赞数

分类专栏：大数据项目

本文链接：https://blog.csdn.net/m0_52106226/article/details/112382667

版权

大数据项目专栏收录该内容

5 篇文章 0 订阅

订阅专栏

业务系统就是客户端加上后台的服务

Source,channel,Event

在这里插入图片描述
flime运行起来的进程叫agentflume采集系统就是由一个个agent连接起来所形成的一个或简单或复杂的数据传输通道每一个agent都有3个组件Source,channel,sink

Source就相当于read(读数据)
Channel就相当于缓存数据(为了解耦合)
Sink就相当于write(写数据)

Event

event是数据在channel中的封装形式
因此,Source组件在获取到原始数据后,需要封装成event放入channel
sink组件从channel中取出event后,需要根据目标存储的需求,转成其他形式的数据输出
event封装对象主要有两部分组成:Headers和Body
headers是一个集合 Map[String,String],用于携带一些KV形式的元数据(标志,描述等)
body是一个字节数组byte[]:装载具体的数据内容

interceptor拦截器

拦截器,就是为用户提供添加数据处理逻辑卷的可能性
拦截器工作在source组件之后,source产生的event会被传入拦截器根据需要进行拦截处理
而且,拦截器可以组成拦截器链
拦截器在flume中有一些内置的功能比较常用的拦截器
用户也可以根据自己的数据处理需求,自己开发自定义拦截器
这也是flume的一个可以用来自定义扩展的接口

channel selector

一个source可以对接多个channel，则event在这n个channel之间传递的策略，由配置的channel selector决定；
channel selector有2中实现： replicating（复制），multiplexing（多路复用）

sink processor

如果sink和channel是一对一关系，则不需要专门的sink processor；
如果一个channel配置多个sink，则可以将这多个sink配置成一个sink group（sink组）；
event在一个组中的多个sink间如何传递，则由所配置的sink processor来决定；
sink processor有2种： load balance (round robing)和 fail over
load balance手机负载均衡比如上游sink拿1-10下游就拿11-20
fail over失败切换一般是两个sink在但是只有一个sink在工作如果一个挂机了另一个就切换上去

Transaction：事务机制 at least once

Flime 并没有实现(精确一次)Exactly once 但是在不断的靠近(只能保证不丢)
at least once 至少一次(不会丢数据不保证不重复)
at most once 不重数据(可能会丢数据)
事务保证一个前提一个功能或业务,分成多个步骤取做(要么都成功要么都失败)

案例1加注解

a1.sources = r1
a1.channels = c1
a1.sinks = k1

#要连到哪个channels上
a1.sources.r1.channels = c1   
#是哪一种具体类型的source     exec是获取控制台上信息的source 
a1.sources.r1.type = exec  
#一个批次存多少条数据    
a1.sources.r1.batchSize = 100    
#获取哪一个命令F是根据文件名跟踪的 f是根据inode号跟踪的  文件名变了inode号不会变  如果日志文件系统在往1.txt写东西而这个文件名改了就会从新创建一个1.txt继续写
a1.sources.r1.command = tail -F /root/logs/a.log        


#写在内存的channel
a1.channels.c1.type = memory 
#channel里最多可以存多少条数据    
a1.channels.c1.capacity = 1000 
#决定一个事务最多存多少条数据 要大于上游的批次数据   
a1.channels.c1.transactionCapacity = 200    

#接的哪一个channel
a1.sinks.k1.channel = c1 
#获取的数据打在控制太上的sink类型    
a1.sinks.k1.type = logger 

---------------------------------------------------------------------------------------

在flume里创建一个采集配置文件夹agentsconf(可以起别的名字)在创建一个文件exec-m-logger.conf用于写采集配置文件

# 启动命令
bin/flume-ng agent -c conf/ -f agentsconf/exec-m-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c自己的分支文件所在的地方
-f采集配置文件所在的地方
-n a1指定配置文件里agent的名字
-Dflume.root.logger=INFO,console打日志的级别不是每次启动都要写的

案例3 自定义拦截器截取获取文件的时间为文件名且写到hdfs中

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/a.log
#自定义拦截器名字
a1.sources.r1.interceptors = i1     
#自定义拦截器类的路径及名字注意名字前加$              
a1.sources.r1.interceptors.i1.type = cn._51doit.flnm.demo01.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder
#参数       
a1.sources.r1.interceptors.i1.split_by = ,
#参数      
a1.sources.r1.interceptors.i1.ts_index = 2      


a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


#连接哪个channel
a1.sinks.k1.channel = c1 
#哪个具体类型的sink                
a1.sinks.k1.type = hdfs
#写道hdfs的路径写道hdfs的路径后面就是跟着日期变的目录                   
a1.sinks.k1.hdfs.path = hdfs://linux01:8020/doitedu01/
#文件名的前缀      
a1.sinks.k1.hdfs.filePrefix = DoitEduData
#文件名的后缀    
a1.sinks.k1.hdfs.fileSuffix = .log
#文件滚动的时间(就是写在这个文件里不写了要写到下一个文件里了[滚动到下一个文件])          
a1.sinks.k1.hdfs.rollInterval = 60 
#文件滚动的大小           
a1.sinks.k1.hdfs.rollSize = 268435456  
#文件滚动的条数        
a1.sinks.k1.hdfs.rollCount = 0 
#批次                 
a1.sinks.k1.hdfs.batchSize = 100 
#压缩编码   支持的有(gzip,bzip,lzop,snappy)             
a1.sinks.k1.hdfs.codeC = gzip
#文件格式(有三种SequenceFile[二进制的kv格式人看不懂但是占用空间小]Datastream[原本的流是什么格式就是什么格式]CompressedStream[压缩格式])                  
a1.sinks.k1.hdfs.fileType = CompressedStream 
#是否使用本地时间戳默认是false不使用本地时间戳   如果改成true就是用本地时间这样不好因为中间有延迟23点59分的数据可能会被传到第二天的文件夹  
a1.sinks.k1.hdfs.useLocalTimeStamp = false

拦截器的jar包放在lib文件夹下
# 实验手册
模拟数据：
for i in {1..1000000}; do echo "${i},zhangsan,`date +%s`000,iphone8plus,submit_order" >> a.log; sleep 0.5; done

# 启动命令
bin/flume-ng agent -c conf/ -f agentsconf/exec-m-logger.conf -n a1 -Dflume.root.logger=INFO,console

自定义拦截器jar包代码

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

public class EventTimeStampInterceptor implements Interceptor {
    private String splitby;
    private Integer ts_index;
    public EventTimeStampInterceptor(String splitby, Integer ts_index) {
        this.splitby=splitby;
        this.ts_index=ts_index;
    }
//初始化方法,在正式调用拦截逻辑之前,会调用一次
    public void initialize() {


    }
//拦截的处理逻辑所在的方法
    //假设我们要采集的数据如下:id,name,timestamp,devicetype,event
    public Event intercept(Event event) {
        byte[] body = event.getBody();
        String line = new String(body);
        //从事件内容中提取事件时间戳
        String[] split = line.split(this.splitby);
        String timestampStr = split[this.ts_index];
        //将时间戳放入header
        event.getHeaders().put("timestamp",timestampStr);
        return event;
    }

    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }

        return list;
    }
//关闭清理方法，在销毁该拦截器实例之前，会调用一次
    public void close() {

    }
//builder 是用于提供给flume来构建自定义拦截器对象的
    public static class EventTimeStampInterceptorBuilder implements Interceptor.Builder{

        String splitby;
        Integer ts_index;
//flime会调用改方法来创建我们的自定义拦截器对象
        public Interceptor build() {
            return new EventTimeStampInterceptor(splitby,ts_index);
        }
//flume会将加载的参数,通过改方法传递过来
        public void configure(Context context) {
           splitby = context.getString("split_by", ",");
            ts_index = context.getInteger("ts_index", 2);

        }
    }
}

级联案例接受网络source是avro同理上游的sink也是avro

就是上游有多个agent写到下游的一个agent再写到hdfs中
关键点上游的sink和下游的source可以接的上,相当于上游的sink是客户端下游的source是服务端

# 上游配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/a.log

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.sinks.k1.channel = c1
#avro类型的sink方便下游抓取数据序列化
a1.sinks.k1.type = avro
#接受的机器的名称         
a1.sinks.k1.hostname = linux03 
#端口号(随便)         
a1.sinks.k1.port = 41414               
a1.sinks.k1.batch-size = 100


# 下游配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
#sources的类型是avro
a1.sources.r1.type = avro 
#绑定的地址所有的ip       
a1.sources.r1.bind = 0.0.0.0
#端口号       
a1.sources.r1.port = 41414          

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://linux01:8020/doitedu03/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = DoitEduData
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 268435456
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.useLocalTimeStamp = true

启动命令
bin/flume-ng agent -c conf/ -f agentsconf/shangyou.conf -n a1 -Dflume.root.logger=INFO,console 
bin/flume-ng agent -c conf/ -f agentsconf/xiayou.conf -n a1 -Dflume.root.logger=INFO,console 
脚本
for i in {1..1000000}; do echo "${i},zhangsan,`date +%s`000,iphone8plus,submit_order" >> a.log; sleep 0.5; done
for i in {1..1000000}; do echo "${i},wwwwwwwwwwww,`date +%s`000,iphone8plus,submit_order" >> a.log; sleep 0.5; done

taildir source就是从两个文件目录读放到指定目录中

a1.sources = r1
a1.channels = c1
a1.sinks = k1


a1.sources.r1.channels = c1
a1.sources.r1.type = TAILDIR
#分别采集两个文件的命名(不是文件名)
a1.sources.r1.filegroups =g1 g2
#要采集明名为g1的文件路径
a1.sources.r1.filegroups.g1 = /root/logs/wxlog/event.*
#要采集明名为g2的文件路径
a1.sources.r1.filegroups.g2 = /root/logs/applog/event.*
#header中一个指定key的value来决定这条消息会写入哪个channel 这边就定义了这个k叫wxlog
a1.sources.r1.headers.g1.k = wxlog
#header中一个指定key的value来决定这条消息会写入哪个channel  这边就定义了这个k叫applog
a1.sources.r1.headers.g2.k = applog
a1.sources.r1.batchSize =100

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger




#启停命令
bin/flume-ng agent -c conf/ -f agentsconf/wx.conf -n a1 -Dflume.root.logger=INFO,console 
#创造文件脚本
for i in {1..1000000}; do echo "${i},wxwxwxwxwxwxwxwxwxw" >> applog/event.log; sleep 0.5; done
for i in {1..1000000}; do echo "${i},ppppppppppppppppppp" >> wxlog/event.log; sleep 0.5; done

复制选择器source上面接多个channel

上游有一个复制选择器一个source接多个channel和多个sink,就可以分发到下游的多个agent(如即给hdfs发又给本地发)就相当于把读到的日志文件复制了多次
在这里插入图片描述


下游
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200



a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger


#上游

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2


a1.sources.r1.channels = c1 c2
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/a.log
#这就相当于这是复制选择器就要分发给下游的c1 c2 了
a1.sources.r1.selector.type = replicating
#写在这的channel可以传输成功也可以不成功,如果不配这个就是默认都必须成功
a1.sources.r1.selector.optional = c2

#c1的channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

#c2的channel
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 200

#k1的sink
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux02
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100

#k2的sink
a1.sinks.k2.channel = c2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux03
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100


创造文件脚本
for i in {1..1000000}; do echo "${i},wxwxwxwxwxwxwxwxwxw" >> /root/a.log; sleep 0.5; done
运行命令
bin/flume-ng agent -c conf/ -f agentsconf/xiayou.conf -n a1 -Dflume.root.logger=INFO,console 
bin/flume-ng agent -c conf/ -f agentsconf/shangyou.conf -n a1 -Dflume.root.logger=INFO,console

多路选择器source是TAILDIR类型的

这个例子是从两个不同的文件夹(不同路径)同时采集数据在分类存储到不同的地方(这边例子是打印在两个控制台上)

#下游
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


#这是打印到控制台上如果想放到hdfs上就可以改这个sink
a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger


#上游
#这个例子是从两个不同的文件夹(不同路径)同时采集数据在分类存储到不同的地方(这边例子是打印在两个控制台上)

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2


a1.sources.r1.channels = c1 c2
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1 g2
a1.sources.r1.filegroups.g1 = /root/logs/wxlog
a1.sources.r1.filegroups.g2 = /root/logs/applog
a1.sources.r1.headers.g1.logtype = wxlog
a1.sources.r1.headers.g2.logtype = applog
a1.sources.r1.batchSize = 100
#这边定义了这个是多路选择器
a1.sources.r1.selector.type = multiplexing
#header是上面加进去的logtype
a1.sources.r1.selector.header = logtype
#映射如果是上面定义的wxiog就发送到c1
a1.sources.r1.selector.mapping.wxlog = c1
#映射如果是上面定义的applog就发送到c1
a1.sources.r1.selector.mapping.applog = c2
#如果都没有映射到的上面的就发送到c2(所以的default)
a1.sources.r1.selector.default = c2

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux02
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100

a1.sinks.k2.channel = c2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux03
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100



#产生文件的脚本
for i in {1..1000000}; do echo "${i},wxwxwxwxwxwxwxwxwxw" >> /root/logs/applog; sleep 0.5; done
for i in {1..1000000}; do echo "${i},pppppppppppppppppppppp" >> /root/logs/wxlog; sleep 0.5; done

#运行的命令
bin/flume-ng agent -c conf/ -f agentsconf/shangyou -n a1 -Dflume.root.logger=INFO,console

实战练习题

#就是从一个文件中读取数据把里面的数据拦截后分成两类写入到两个不同的文件夹中
需求：日志中是如下数据

#就是从一个文件中读取数据把里面的数据拦截后分成两类写入到两个不同的文件夹中
需求： 日志中是如下数据

1,zs,mi6,addcart,13845934468300,app
2,aa,mi6,addcart,13845934468300,app
3,aa,mi6,addcart,13845934468300,app
4,aa,mi6,addcart,13845934468300,app
5,bb,oppo6,addcart,13845934468300,wx
6,bb,oppo6,addcart,13845934468300,wx
7,bb,oppo6,addcart,13845934468300,wx
8,bb,oppo6,addcart,13845934468300,wx
9,bb,oppo6,addcart,13845934468300,wx

需要将不同类型的数据写入不同的hdfs目录：
hdfs://hdp01:8020/logdata/app/2021-01-08

hdfs://hdp01:8020/logdata/wx/2021-01-08




模拟数据的脚本
#!/bin/bash
for i in {1..100000}
do
if [ $(($RANDOM % 2)) -eq 0 ] 
 then
   echo  "${i},aa,mi6,addcart,`date +%s`000,app" >> /root/logs/event.log
 else 
   echo  "${i},bb,mi6,addcart,`date +%s`000,wx" >> /root/logs/event.log
fi
sleep 0.2
done


方案：

1. 利用自定义拦截器从数据中提取channel信息放入header
2. 可以使用多路选择器根据header进行多路映射



上游配置文件
#上游
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2


a1.sources.r1.channels = c1 c2
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1
a1.sources.r1.filegroups.g1 = /root/logs/event.*
a1.sources.r1.batchSize = 100

#定义了一个拦截器链
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = cn._51doit.flnm.demo02.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder
#自己定义的headerName因为传进去的是k,v类型的相当于定义了这个k被定义成了flag,
a1.sources.r1.interceptors.i1.headerName = flag
a1.sources.r1.interceptors.i1.split_by = ,
a1.sources.r1.interceptors.i1.ts_index = 5

a1.sources.r1.interceptors.i2.type = cn._51doit.flnm.demo02.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder
a1.sources.r1.interceptors.i2.headerName = timestamp
a1.sources.r1.interceptors.i2.split_by = ,
a1.sources.r1.interceptors.i2.ts_index = 4

a1.sources.r1.selector.type = multiplexing
#这边会获取上面定义的fiag和拦截到的索引5做比较在这会获取到wx和app
a1.sources.r1.selector.header = flag
#如果flag是spp就分到c1
a1.sources.r1.selector.mapping.app = c1
#如果flag是wx就分到c2
a1.sources.r1.selector.mapping.wx = c2
#如果获取到不是wx也不是app就分到c2
a1.sources.r1.selector.default = c2

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux02
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100

a1.sinks.k2.channel = c2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux03
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100



# 下游-写hdfs


a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200
a1.channels.c1.keep-alive=60
a1.channels.c1.capacity=1000000



a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://linux01:8020/doitedu02/app/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = DoitEduData
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 268435456
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = false


启动命令
bin/flume-ng agent -c conf/ -f agentsconf/wx -n a1 -Dflume.root.logger=INFO,console

sink轮询和随机

round_robin（轮询算法）
random（随机）
下面的轮询的默认的不用指定

就是多个sink连接一个channel没有配sink组就是很和谐这个sink拿以组另一个再拿一组数据是一个负载均衡的情况(就是这个sink拿了数据另外一个sink就不拿不会重复也不会少)
上游:
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/logs/event.log


a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hdp02.doitedu.cn
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100


a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hdp03.doitedu.cn
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100



下游：
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

下面的是随机的random

就是多个sink连接一个channel没有配sink组两个sink抢数据有个能这个sink拿了数据下一次还是他拿是一个负载均衡的情况(就是这个sink拿了数据另外一个sink就不拿不会重复也不会少)
上游:
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/logs/event.log


a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hdp02.doitedu.cn
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100


a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hdp03.doitedu.cn
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100

# 定义sink组及其配套的sink处理器
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random





下游：
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

sink失败切换

这是一个高可用的模式主要应用在级联(多个上游连接一个下游)给下游搞个备胎,但是下游主的和备胎都在工作只是
备胎在空跑,上游设置shik1连接主的,sink2连接备胎,主的宕机了就换sink2和备胎工作

这是一个高可用的模式主要应用在级联(多个上游连接一个下游)给下游搞个备胎,但是下游主的和备胎都在工作只是
备胎在空跑,上游设置shik1连接主的,sink2连接备胎,主的宕机了就换sink2和备胎工作
# 上游
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.batchSize = 100
a1.sources.r1.command = tail -F /root/logs/event.log


a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux02
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 100


a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux03
a1.sinks.k2.port = 41414
a1.sinks.k2.batch-size = 100

# 定义sink组及其配套的sink处理器
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
#这是这个可以切换sink的类型
a1.sinkgroups.g1.processor.type = failover
#这是优先级数字大的优先级高连接主的自己设置的
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 1
#主sink失败的停用的惩罚时间
a1.sinkgroups.g1.processor.maxpenalty = 10000



下游：
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

启动命令和制造文件命令
bin/flume-ng agent -c conf/ -f agentsconf/shangyou -n a1 -Dflume.root.logger=INFO,console 
for i in {1..1000000}; do echo "${i},pppppppppppppppppppppp" >> /root/logs/event.log; sleep 0.5; done