数仓采集项目【05使用flume将日志文件收集到HDFS】

最新推荐文章于 2024-07-06 02:09:50 发布

OneTenTwo76

最新推荐文章于 2024-07-06 02:09:50 发布

阅读量1.2k

点赞数

分类专栏：数仓采集项目文章标签： hdfs flume kafka

本文链接：https://blog.csdn.net/weixin_43923463/article/details/125238242

版权

文章目录

一使用flume将日志文件收集到HDFS

一使用flume将日志文件收集到HDFS

在这里插入图片描述

logger server – flume读数据 – kafka – flume – hdfs

第一层flume：taildir source – memory channel（或者file channel） – kafka sink 【传统架构】【√】

taildir source – kafka channel 【使用kafka channel】

第二层flume：kafka source – memory channel（或者file channel） – hdfs sink 【传统架构】【√】需要添加拦截器，如果没有source，没有办法添加拦截器

kafka channel – hdfs sink 【使用kafka channel】

1 第一层flume实现过程（采集日志flume）

在实现第一层flume之前，需要进行数据清洗，将读取到的不符合规则的数据去除掉

（1）java实现过程

创建maven工程，编辑依赖信息（pom.xml）

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.hike.gmall</groupId>
    <artifactId>Colllect</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>7</source>
                    <target>7</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef<