Flume安装部署及使用

小天278

已于 2023-10-30 21:30:09 修改

阅读量574

点赞数 1

分类专栏：大数据组件-数据采集文章标签： flume 大数据

于 2023-10-27 18:13:00 首次发布

本文链接：https://blog.csdn.net/xtsheng123456/article/details/134080691

版权

大数据组件-数据采集专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、flume概述

Aapche Flume是由Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的软件，网址： http://flume.apache.org/
Apache Flume的核心是把数据从数据源(source)收集过来，再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel)，待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。

Flume系统中核心的角色是agent，agent本身是一个Java进程，一般运行在日志收集节点。
每一个agent相当于一个数据传递员，内部有三个组件：
   • Source：采集源，用于跟数据源对接，以获取数据；
   • Sink：下沉地，采集数据的传送目的，用于往下一级agent或者往最终存储系统传递数据；
    • Channel：agent内部的数据传输通道，用于从source将数据传递到sink；

1、常见source

（1）Avro (一般用于多个flume 首尾相连时, 作为sink source媒介, 实现数据跨服务器传输, 多级流动)

（2）Exec 监听指定单个文件 (如tail -F xxx.log)

（3）Taildir * 监控一个目录, 可用使用正则表达式匹配目录中的文件进行实时收集, 支持断点续传

（4）Kafka *

（5）Spooling Directory source 监听一个目录下的所有文件(不支持断点续传, 可用排除某些文件, 不常用)

2、常见channel

(保证了事务, 提供了缓冲)

（1）Memory 内存 *（特点：读写快、容量小、安全性较差）

（2）Jdbc 内置的derby

（3）File 磁盘文件（读写相对慢、容量大、安全性较高）

（4）Kafka *

3、常见sink

（1）Hdfs

（2）Logger

（3）Avro 将数据转换成avro event ,发送到rpc端口(一般用于多个flume 首尾相连时, 作为sink source媒介)File 本地文件系统

（4）Hbase

（5）Es

（6）Kafka

二、安装部署flume

1、上传安装包，node1机器上安装

2、解压安装
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C/export/server/
cd /export/server
ln-s apache-flume-1.9.0-bin flume
chown -R root:root flume/

3、修改Flume环境变量
cd /export/server/flume/conf/
mv flume-env.sh.template flume-env.sh
vim flume-env.sh
# 修改22行
export JAVA_HOME=/export/server/jdk
# 修改35行
export HADOOP_HOME=/export/server/hadoop

4、集成HDFS，拷贝HDFS配置文件
cp /export/server/hadoop/etc/hadoop/core-site.xml /export/server/hadoop/etc/hadoop/hdfs-site.xml /export/server/flume/conf/

5、删除Flume自带的guava包，替换成Hadoop的
# 删除低版本jar包
rm -rf /export/server/flume/lib/guava-11.0.2.jar
# 拷贝高版本jar包
cp /export/server/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /export/server/flume/lib/

三、flume案例

1、（exec-memory-logger）采集日志文件，缓存到内存，打印到控制台

#define the agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1

#define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log

#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

#define the sink
a1.sinks.k1.type = logger

#bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2、（exec-memory-hdfs）采集日志数据，缓存到内存，保存到hdfs

#定义当前的agent的名称，以及对应source、channel、sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = k1

#定义s1:从哪读数据，读谁
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log

#定义c1:缓存在什么地方
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000


#定义k1:将数据发送给谁
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node1.itcast.cn:8020/flume/test1
#指定生成的文件的前缀
a1.sinks.k1.hdfs.filePrefix = nginx
#指定生成的文件的后缀
a1.sinks.k1.hdfs.fileSuffix = .log
#指定写入HDFS的文件的类型：普通的文件
a1.sinks.k1.hdfs.fileType = DataStream 
# 问题：Flume默认写入HDFS上会产生很多小文件，都在1KB左右，不利用HDFS存储
# 设置文件滚动策略:
#指定按照时间生成文件，一般关闭
a1.sinks.k1.hdfs.rollInterval = 0
#指定文件大小生成文件，一般120 ~ 125M对应的字节数
a1.sinks.k1.hdfs.rollSize = 1024
#指定event个数生成文件，一般关闭
a1.sinks.k1.hdfs.rollCount = 0

#s1将数据给哪个channel
a1.sources.s1.channels = c1
#k1从哪个channel中取数据
a1.sinks.k1.channel = c1

3、（exec-memory-kafka）采集日志数据，缓存到内存，保存到hdfs

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#定义s1:从哪读数据，读谁
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log

#定义c1:缓存在什么地方
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

#定义k1:将数据发送给谁
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.sinks.k1.kafka.topic = topic01
#s1将数据给哪个channel
a1.sources.s1.channels = c1
#k1从哪个channel中取数据
a1.sinks.k1.channel = c1

四、 sink处理器

1、负载均衡sink处理器

需求：采集node1指定文件, 使用负载均衡sink处理器发送数据到node2 node3

（1）（TAILDIR-file-avro）在node1采集日志文件，配置文件如下

# 通过负载均衡实现k1 k2随机从channel中获取数据
#define the agent
a1.sources = s1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
#被监听的数据源文件
a1.sources.s1.filegroups.f1 = /export/server/flume/logs/test.log
a1.sources.s1.channels = c1

#define the channel
a1.channels.c1.type = file
a1.channels.c1.checkPointDir = /export/server/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs

#define the sink
# sink组
a1.sinks.k1.g1.sinks = k1 k2
# 配置处理器负载均衡
a1.sinks.g1.processor.type = load_balance 
a1.sinks.g1.processor.backoff = true
# 定义处理器发送数据的方式(默认是round_robin.  还有random)
a1.sinks.g1.processor.selector = round_robin
a1.sinks.g1.processor.selector.maxTimeOut = 10000

#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234

#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234

（2）（avro-file-logger）node2、node3配置文件如下

agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1

#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234

#定义一个file channel
agent1.channels.c1.type = file
#channel中的checkpoint
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
#channel中数据存放的位置
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs

#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1

（3）先启动node2、node3的flume服务，再启动node1的flume服务

2、故障转移sink处理器

需求：采集node1指定文件, 使用故障转移sink处理器发送数据到node2 node3

（1）（TAILDIR-file-avro）在node1采集日志文件，配置文件如下

#define the agent
a1.sources = s1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
a1.sources.s1.filegroups.f1 = /export/server/flume-1.9.0/logs/test.log
a1.sources.s1.channels = c1

#define the channel
a1.channels.c1.type = file
a1.channels.c1.checkPointDir = /export/server/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs

#define the sink
# sink组
a1.sinks.k1.g1.sinks = k1 k2
# 配置处理器 故障转移
a1.sinks.g1.processor.type = failover
a1.sinks.g1.processor.priority.k1 =5
a1.sinks.g1.processor.priority.k2 =10
#优先级较高的服务器宕机后,如果在指定时间内连接上,则不转移到优先级较低的服务器
a1.sinks.g1.processor.priority.maxpenalty =10000

#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234

#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234

（2）（avro-file-logger）node2、node3配置文件如下

agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1

#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234

#定义一个file channel
agent1.channels.c1.type = file
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs

#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1

（3）先启动node2、node3的flume服务，再启动node1的flume服务

五、flume拦截器

1、buildin拦截器

（1）timestamp interceptor
在头信息中插入时间戳
（2）host interceptor
在头中插入主机名
（3）static
在头信息中插入指定的kv (如果需要多个kv, 就配置多个static 拦截器)
（4）搜索替换
通过正则匹配替换掉指定字符
（5）正则过滤
通过正则过滤掉含指定字符的数据
（6）Regex Extractor Interception
通过正则中的捕获组(即: 小括号)获取body中的信息, 放到头信息中

2、buildin拦截器案例

#define the agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1

#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
#被监听的数据源文件
a1.sources.s1.filegroups.f1 = /export/server/flume/logs/test.log
#如果存在多个类型的拦截器, 就配置多个,用空格分开
a1.sources.s1.interceptors = i1 i2 i3  i5 i6
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
#指定静态拦截器的kv
a1.sources.s1.interceptors.i3.key = name
a1.sources.s1.interceptors.i3.value = xt
#a1.sources.s1.interceptors.i4.type =search_replace
#指定要替换的对象
#a1.sources.s1.interceptors.i4.searchPattern=[a-z]
#指定要替换成啥
#a1.sources.s1.interceptors.i4.replaceString = *
a1.sources.s1.interceptors.i5.type = regex_filter
#指定要过滤的对象
a1.sources.s1.interceptors.i5.regex= ^A.*
a1.sources.s1.interceptors.i5.excludeEvents= true
a1.sources.s1.interceptors.i6.type= regex_extractor
a1.sources.s1.interceptors.i6.regex=(^[a-zA-Z]*)\\s([0-9]*$)
#根据正则提取两个字符串
a1.sources.s1.interceptors.i6.serializers= ser1 ser2
#给两个字符串命名
a1.sources.s1.interceptors.i6.serializers.ser1.name=word
a1.sources.s1.interceptors.i6.serializers.ser2.name=num
a1.sources.s1.channels = c1

#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#define the sink
#定义一个sink把数据放松到node2
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

3、自定义拦截器及channel多路复用通道选择器

需求：判断日志数据中是否包含spring字符串，如果包含spring，则该数据发送到node2，否则发送到node3。在event的header信息中标记为type_spring，否则标记为type_other。

（1）maven项目引入依赖

<dependency>
	<groupId>org.apache.flume</groupId>
	<artifactId>flume-ng-core</artifactId>
	<version>1.9.0</version>
</dependency>

（2）实现Interceptor 接口

功能：判断日志数据中是否包含spring字符串，如果包含spring，则在event的header信息中标记为type_spring，否则标记为type_other。

public class WordInterceptor implements Interceptor {
    //申明一个集合, 用于存放拦截器处理后的时间
    private List<Event> events;
    public void initialize() {
        //初始化集合
        events = new ArrayList<Event>();
    }

    /**
     * 单个事件处理方法
     */
    public Event intercept(Event event) {
        //获取事件的header和body
        Map<String, String> headers = event.getHeaders();
        byte[] body = event.getBody();
        //将原始的body转换成string
        String bodyStr = new String(body);
        //根据body中是否存在spring进行分类
        if (bodyStr.contains("spring")) {
            headers.put("type", "type_spring");
        } else {
            headers.put("type", "type_other");
        }
        return event;
    }

    /**
     * 批量事件处理方法
     */
    public List<Event> intercept(List<Event> list) {
        events.clear();
        for (Event event : list) {
            events.add(intercept(event));//调用 单个事件处理的方法
        }
        return events;
    }
    public void close() {
    }

    /**
     * 加这个类, 是为了在flume配置文件中使用自定义拦截器
     * 配置文件中的使用方式:
     */
    public static class Builder implements Interceptor.Builder {
        public Interceptor build() {
            return new WordInterceptor();
        }
        public void configure(Context context) {
        }
    }
}

（3）打成jar，并上传服务器

（4）配置node1的flume文件

# 使用自定义拦截器  对数据进行分组
#define the agent
a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

#define the source
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.88.151
a1.sources.s1.port = 2781
a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = org.example.WordInterceptor$Builder
# 第二个拦截器
a1.sources.s1.interceptors.i2.type = timestamp
#设置多路复用通道选择器( 通过头信息中type字段的值进行设置该条数据 属于哪个channel)
a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = type
a1.sources.s1.selector.mapping.type_spring = c1
a1.sources.s1.selector.mapping.type_other = c2
a1.sources.s1.selector.default = c2
a1.sources.s1.channels = c1 c2

#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#define the sink
#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234

#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234

（5）配置node2、node3的flume文件

agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1

#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234

#定义一个file channel
agent1.channels.c1.type = file
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs

#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1