Flume采集日志写到Hdfs(数仓项目)

最新推荐文章于 2024-08-07 11:13:35 发布

Knight_AL

最新推荐文章于 2024-08-07 11:13:35 发布

阅读量6.9k

点赞数 4

分类专栏： Flume 文章标签：大数据 flume

本文链接：https://blog.csdn.net/qq_46548855/article/details/113835651

版权

Flume 专栏收录该内容

10 篇文章 5 订阅

订阅专栏

本文详细介绍了使用Flume 1.8.0版本的TaildirSource进行日志实时采集，配置包括TaildirSource的监控目录、检查点文件等。同时，针对日志时间与系统时间不匹配的问题，提出了自定义拦截器解决方案，通过拦截器从日志中提取时间并存储到header中，从而实现按日志时间存储到HDFS。此外，还提供了自定义拦截器的创建步骤、配置及打包过程。

摘要由CSDN通过智能技术生成

Flume版本选择

Flume 1.6

无论是Spooling Directory Source和Exec Source均不能满足动态实时收集的需求
Flume 1.7+
提供了一个非常好用的TaildirSource
使用这个source，可以监控一个目录，并且使用正则表达式匹配该目录中的文件名进行实时收集

Taildir Source可实时监控一批文件，并记录每个文件最新消费位置，agent进程重启后不会有重复消费的问题。
使用时建议用1.8.0版本的flume，1.8.0版本中解决了Taildir Source一个可能会丢数据的bug。

项目流程

在这里插入图片描述

Flume核心配置

tailDirSource配置

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /export/servers/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /export/data/test1/example.log
a1.sources.r1.filegroups.f2 = /export/data/test2/.*log.*

配置说明

filegroups
- 指定filegroups，可以有多个，以空格分隔；（TailSource可以同时监控tail多个目录中的文件）
positionFile
- 配置检查点文件的路径，检查点文件会以json格式保存已经tail文件的位置，解决了断点不能续传的缺陷。
filegroups.
- 配置每个filegroup的文件绝对路径，文件名可以用正则表达式匹配

sink配置
本次将日志采集到HDFS中，需要使用HDFSSink文件。HDFSSink需要配置滚动属性。

基于hdfs文件副本数
配置项：hdfs.minBlockReplicas
默认值：和hdfs的副本数一致
说明
hdfs.minBlockReplicas是为了让flume感知不到hdfs的块复制，这样滚动方式配置（比如时间间隔、文件大小、events数量等）才不会受影响
示例说明：
假如hdfs的副本为3，配置的滚动时间为10秒，那么在第二秒的时候，flume检测到hdfs在复制块，这时候flume就会滚动，这样导致flume的滚动方式受到影响。所以通常hdfs.minBlockReplicas配置为1，就检测不到副本的复制了。但是hdfs的实际副本还是3

完整版配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /export/servers/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /export/data/test1/example.log
a1.sources.r1.filegroups.f2 = /export/data/test2/.*log.*

# Describe the sink
#指定hdfs sink
a1.sinks.k1.type = hdfs
#hdfs目录，带有时间信息
a1.sinks.k1.hdfs.path = /flume/tailout/%Y-%m-%d/
#生成的hdfs文件名的前缀
a1.sinks.k1.hdfs.filePrefix = events-
#指定滚动时间，默认是30秒，设置为0表示禁用该策略
a1.sinks.k1.hdfs.rollInterval = 0
#指定滚动大小，设置为0表示禁用该策略
a1.sinks.k1.hdfs.rollSize = 200000000
#指定滚动条数
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#副本策略
a1.sinks.k1.hdfs.minBlockReplicas=1
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume agent采集数据

创建目录

mkdir -p /export/data/test1/
mkdir -p /export/data/test2/

上传测试数据到上面创建的目录！！

链接：https://pan.baidu.com/s/1jV49R7DLHbXilKj4JcoGLg 
提取码：pv5l

flume启动命令

bin/flume-ng agent --conf-file conf/log2hdfs.conf -name a1 -Dflume.root.logger=INFO,console

在这里插入图片描述

思考问题 hdfs路径是否正确

问题
按照上面flume agent的配置文件会出现一种情况，数据存放的路径信息不正确，需要按照日志时间存储。
日志时间跟系统时间不匹配
(如果最后日志时间是23:59:59秒，进入flume，然后到hdfs就到了第二天了)

flume自定义拦截器

实现步骤

创建maven工程
新建class实现flume提供的Interceptor接口
- 实现相关方法
  interceptor方法
  定义静态内部类实现Interceptor.Builder接口
打成jar包上传到flume安装目录下lib文件夹中
开发flume agent配置文件引用header中的日期信息
具体实现
创建maven java工程，导入jar包

    <dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>

这是两个日志信息
在这里插入图片描述

自定义flume的拦截器

package com.donglin;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class CustomTimeInterceptor implements Interceptor {

    //初始化方法
    @Override
    public void initialize() {

    }

    //单条拦截器
    /*
    1 解析event中的时间
    2 把时间转换成年月日信息 存入header
     */
    @Override
    public Event intercept(Event event) { //event分为header和body两部分，数据是在body中
        byte[] body = event.getBody();
        String msg = new String(body, StandardCharsets.UTF_8);

        //按照空格切分
        String[] arr = msg.split(" ");
        //因为有个文件没有是脏的，所以不能够切分，所以我们需要判断
        String eventTime = "";
        if (arr.length > 11){
            eventTime = arr[4];
        }else {
            eventTime = "unkown";
        }

        //获取event中的header，把event存入header
        Map<String, String> headers = event.getHeaders();
        headers.put("event_time",eventTime);

        //header存入event中
        event.setHeaders(headers);
        return event;
    }

    //批量拦截器
    @Override
    public List<Event> intercept(List<Event> events) {

        //准备list集合接收拦截之后的event
        ArrayList<Event> list = new ArrayList<>();
        //循环遍历接收数据，然后调用单条批量处理逻辑
        for (Event event : events) {
            Event newEvent = intercept(event);
            list.add(newEvent);
        }
        return list;
    }

    //关闭方法
    @Override
    public void close() {

    }

    //flume获取到自定拦截器要调用的方法
    public static class Builder implements Interceptor.Builder{

        //返回一个自定义拦截器对象即可
        @Override
        public Interceptor build() {
            return new CustomTimeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }

    }

}

打包
在这里插入图片描述
需要先将打好的包放入到/flume/lib文件夹下面

我们需要将原先的配置文件增加一下内容

 #interceptor
 a1.sources.r1.interceptors =i1 
 a1.sources.r1.interceptors.i1.type =com.donglin.CustomTimeInterceptor$Builder

修改

#hdfs目录，带有时间信息
a1.sinks.k1.hdfs.path = /flume/tailout/%Y-%m-%d/

改成

#hdfs目录，带有时间信息
a1.sinks.k1.hdfs.path = /flume/tailout/%{event_time}/

最终配置文件代码

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /export/servers/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /export/data/test1/example.log
a1.sources.r1.filegroups.f2 = /export/data/test2/.*log.*

#interceptor
a1.sources.r1.interceptors =i1 
a1.sources.r1.interceptors.i1.type =com.donglin.CustomTimeInterceptor$Builder

# Describe the sink
#指定hdfs sink
a1.sinks.k1.type = hdfs
#hdfs目录，带有时间信息
a1.sinks.k1.hdfs.path = /flume/tailout/%{event_time}/
#生成的hdfs文件名的前缀
a1.sinks.k1.hdfs.filePrefix = events-
#指定滚动时间，默认是30秒，设置为0表示禁用该策略
a1.sinks.k1.hdfs.rollInterval = 0
#指定滚动大小，设置为0表示禁用该策略
a1.sinks.k1.hdfs.rollSize = 200000000
#指定滚动条数
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#副本策略
a1.sinks.k1.hdfs.minBlockReplicas=1
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1