大数据学习--flume

最新推荐文章于 2022-04-12 11:15:03 发布

日写BUG八百行

最新推荐文章于 2022-04-12 11:15:03 发布

阅读量202

点赞数 1

文章标签： flume 大数据

本文链接：https://blog.csdn.net/wgyzzzz/article/details/108168571

版权

官方文档

中文官方文档

flume概述

Apache Flume 是一个分布式、高可靠、高可用的用来收集、聚合、转移不同来源的大量日志数据到中央数据仓库的工具

flume架构

Client：客户端，数据产生的地方，如Web服务器
Event：事件，指通过Agent传输的单个数据包，如日志数据通常对应一行数据
Agent：代理，一个独立的JVM进程

Flume以一个或多个Agent部署运行
Agent包含三个组件:Source,Channel,Sink

在这里插入图片描述

Agent中的三个组件Source、Channel、Sink
Source表示从哪里读
Channel表示存储方式
Sink表示传输到哪里去

Sources

netcat（监控一个端口）

在这里插入图片描述
在/opt/flume1.6/conf/job目录下创建一个conf文件netcat-flume-logger.conf

#Name 给三个组件起别名
a1.sources = r1   
a1.sinks = k1     
a1.channels = c1   

#Sources：类型选用netcat 监控的端口为本地的44444端口
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Sink：类型选为logger直接在控制台输出方便测试
a1.sinks.k1.type = logger

#Channel：类型选用memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Bind 定义连接
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

保存后启动agent命令

flume-ng agent 
--conf /opt/flume1.6/conf     //flume目录下的conf文件夹路径
--conf-file /opt/flume1.6/conf/job/netcat-flume-logger.conf   //自己编写的conf文件路径
--name a1   //自己编写的conf文件中的名字a1
-Dflume.root.logger=INFO,console   //当sink类型为logger时 此参数表示从控制台输出 其他类型不需要

简写形式：

flume-ng agent -c conf -f /opt/flume1.6/conf/job/netcat-flume-logger.conf -n a1 -Dflume.root.logger=INFO,console

启动之后：

在这里插入图片描述
新开一个虚拟机窗口
输入命令进入44444端口
telnet localhost 44444
如果报错，执行下列命令安装telnet

yum list telnet*			列出telnet相关的安装包
yum install telnet-server	安装telnet服务
yum install telnet.*		安装telnet客户端

成功后显示：

在这里插入图片描述

在端口输入任意字符：

在这里插入图片描述

都会在之前开启的监控窗口中输出在控制台：
在这里插入图片描述

exec（根据命令监控一般是tail或cat）

在这里插入图片描述
在/opt/flume1.6/conf/job目录下创建一个conf文件exec-flume-logger.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log  // 文件格式任意 一般监控日志文件 

a1.sinks.k1.type = logger   //这里的sink还是选择控制台输出

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动agent之后
往/data/test.log文件中传输字段

控制台就会显示
在这里插入图片描述

spooldir（监控一个文件夹）

在这里插入图片描述

在/opt/flume1.6/conf/job目录下创建一个conf文件spooldir-flume-hdfs.conf

a1.sources = r1
a1.channels = c1
a1.sinks = s1

a1.sources.r1.type = spooldir  //source类型为spooldir
a1.sources.r1.spoolDir = /data/test  //监控的文件夹
a1.sources.r1.deserializer = LINE  
a1.sources.r1.deserializer.maxLineLength = 600000
a1.sources.r1.includePattern = test__[0-9]{4}-[0-9]{2}-[0-9]{2}.csv  //监控符合此正则表达式的文件

a1.channels.c1.type = file   //channel类型选择file 
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test

a1.sinks.s1.type = hdfs   //sink类型选择hdfs  结果输出到hdfs上
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.filePrefix = test_     //输出到hdfs上是加的前缀
a1.sinks.s1.hdfs.fileSuffix = .csv   //输出到hdfs上是加的后缀
a1.sinks.s1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d    //hdfs路径
a1.sinks.s1.hdfs.useLocalTimeStamp = true       //是否使用本地时间戳
a1.sinks.s1.hdfs.batchSize = 640
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.rollSize = 64000000
a1.sinks.s1.hdfs.rollInterval = 30

a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

taildir（监控多个文件或者文件夹特点是：断点续传）

在这里插入图片描述

a1.sources = r1   
a1.sinks = k1     
a1.channels = c1  

a1.sources.r1.type = TATLDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/test1.log
a1.sources.r1.filegroups.f2 = /data/test2.txt
#存储inode信息 实现断点续传 不配置会传入默认文件夹
a1.sources.r1.positionFile = /opt/position/position.json

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

avro

在这里插入图片描述
avro的source和sink可以参考下面的负载均衡和故障转移

Channels

memory（内存存储速度快但是不安全）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log  

a1.sinks.k1.type = logger  

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

file（本地文件存储安全速度慢）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/kb07Flume/flumeFile/test
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 60000
a1.sources.s1.includePattern = test_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.batchSize = 640
a1.sinks.k1.brokerList = 192.168.226.101:9092
a1.sinks.k1.topic = test

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Sinks

logger（输出到控制台）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log  

a1.sinks.k1.type = logger   

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

hdfs（输出到hdfs）

a1.sources = r1
a1.channels = c1
a1.sinks = s1

a1.sources.r1.type = spooldir  //source类型为spooldir
a1.sources.r1.spoolDir = /data/test  //监控的文件夹
a1.sources.r1.deserializer = LINE  
a1.sources.r1.deserializer.maxLineLength = 600000
a1.sources.r1.includePattern = test__[0-9]{4}-[0-9]{2}-[0-9]{2}.csv  //监控符合此正则表达式的文件

a1.channels.c1.type = file   //channel类型选择file 
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test

a1.sinks.s1.type = hdfs   //sink类型选择hdfs  结果输出到hdfs上
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.filePrefix = test_     //输出到hdfs上是加的前缀
a1.sinks.s1.hdfs.fileSuffix = .csv   //输出到hdfs上是加的后缀
a1.sinks.s1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d    //hdfs路径
a1.sinks.s1.hdfs.useLocalTimeStamp = true       //是否使用本地时间戳
a1.sinks.s1.hdfs.batchSize = 640
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.rollSize = 64000000
a1.sinks.s1.hdfs.rollInterval = 30

a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

kafka（输出到kafka）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/kb07Flume/flumeFile/test
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 60000
a1.sources.s1.includePattern = test_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.batchSize = 640
a1.sinks.k1.brokerList = 192.168.226.101:9092
a1.sinks.k1.topic = test

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

filr row(保存在本地文件)

可以参考下面的复制

选择器副本机制（复制）

模式为一份输入，多份输出
需要创建两个channel 两个sink 一份输出到hdfs 一分输出保存到本地文件
创建flume1.conf

#Name  复制
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

#Source
a1.sources.r1.type = TATLDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /data/test1.log
#存储inode信息 实现断点续传 不配置会传入默认文件夹
a1.sources.r1.positionFile = /opt/position/position1.json

#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142

#Bind
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

创建flume2.conf

a2.sources = r1
a2.channels = c1
a2.sinks = k1


a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141

#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#Sink
a2.sinks.k1.type = hdfs   
a2.sinks.k1.hdfs.fileType = DataStream
a2.sinks.k1.hdfs.filePrefix = test_     
a2.sinks.k1.hdfs.fileSuffix = .csv   
a2.sinks.k1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d    
a2.sinks.k1.hdfs.useLocalTimeStamp = true     
a2.sinks.k1.hdfs.batchSize = 640
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.rollSize = 64000000
a2.sinks.k1.hdfs.rollInterval = 30

#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建flume3.conf

a3.sources = r1
a3.channels = c1
a3.sinks = k1


a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142

#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#Sink:file_row保存到本地文件 需要先创建目录
a3.sinks.k1.type = file_row   
a3.sinks.k1.sink.directory = /data/file_row


#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

故障转移

模式为一个channel对两个sinks，优先使用优先级高的 k2挂掉自动使用k1

创建三个flume.conf

flume1.conf

#故障转移 

#Name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1

#Sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142

#Sink Group
a1.sinkgroups.g1.sink = k1 k2
#故障转移的策略
a1.sinkgroups.g1.processor.type = failover
#优先级
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

#Bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

flume2.conf

#Name
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#Sources
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141

#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#Sink
a2.sinks.k1.type = logger

#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume3.conf

#Name
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#Sources
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142

#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#Sink
a3.sinks.k1.type = logger

#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

负载均衡

模式为一个source输入将接受的信息平均分发到多个sink上不会造成数据倾斜

先创建三个flume.conf

创建flume1.conf

#负载均衡

#Name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1

#Sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142

#Sink Group
a1.sinkgroups.g1.sink = k1 k2
#负载均衡的策略
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

#Bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

创建flume2.conf

#Name
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#Sources
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141

#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#Sink
a2.sinks.k1.type = logger

#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建flume3.conf

#Name
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#Sources
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142

#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#Sink
a3.sinks.k1.type = logger

#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

自定义拦截器

在idea中编写自定义拦截器并打包上传到/opt/flume1.6/lib

导入maven依赖

<dependency>
      <groupId>org.apache.flume</groupId>
      <artifactId>flume-ng-core</artifactId>
      <version>1.6.0</version>
    </dependency>

package cn.kgc.kb07.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author WGY
 * 拦截器
 */
public class InterceptorDemo implements Interceptor {
    //声明一个存放时间的集合
    private List<Event> addHeaderEvents;

    @Override
    public void initialize() {
        //初始化
        addHeaderEvents = new ArrayList<>();

    }

    //对单个事件处理
    @Override
    public Event intercept(Event event) {
        //获取事件中的头信息
        Map<String, String> headers = event.getHeaders();

        //获取事件中的body信息
        String body = new String(event.getBody());

        //根据body中是否含有“hello”来决定添加怎样的头信息
        if(body.contains("hello")){
            //添加头信息
            headers.put("type","1");
        }else{
            //添加头信息
            headers.put("type","2");
        }


        return null;
    }

    //对批量事件处理
    @Override
    public List<Event> intercept(List<Event> list) {
        //1、清空集合
        addHeaderEvents.clear();
        //2、遍历list，
        for (Event event : list) {
            //3、给每一个事件添加头信息
            addHeaderEvents.add(intercept(event));
        }
        //返回结果
        return addHeaderEvents;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new InterceptorDemo();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1   //自定义拦截器
a1.sources.r1.interceptors.i1.type = cn.kgc.kb07.flume.InterceptorDemo$Builder  

a1.sources.r1.selector.type = multiplexing  //定义选择器
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.gree = c1
a1.sources.r1.selector.mapping.lijia = c2

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = gree
a1.sinks.k1.hdfs.fileSuffix = .csv
a1.sinks.k1.hdfs.path = hdfs://192.168.226.101:9000/data/greedemo/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.batchSize = 640
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100
a1.sinks.k1.hdfs.rollInterval = 3

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.filePrefix = lijia
a1.sinks.k2.hdfs.fileSuffix = .csv
a1.sinks.k2.hdfs.path = hdfs://192.168.226.101:9000/data/lijiademo/%Y-%m-%d
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.batchSize = 640
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 100
a1.sinks.k2.hdfs.rollInterval = 3

a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

日写BUG八百行

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
大数据学习--flume

文章目录flume概述flume架构Sourcesnetcat（监控一个端口）exec（根据命令监控一般是tail或cat）spooldir（监控一个文件夹）taildir（监控多个文件或者文件夹特点是：断点续传）avroChannelsmemory（内存存储速度快但是不安全）file（本地文件存储安全速度慢）Sinkslogger（输出到控制台）hdfs（输出到hdfs）kafka（输出到kafka）filr row(保存在本地文件)选择器副本机制（复制）故障转移负载均衡自定义拦截器flu
复制链接

扫一扫