Hudi的介绍与安装编译

open_test01

已于 2023-07-09 11:39:56 修改

阅读量1.6k

点赞数 5

分类专栏： Hudi 大数据环境搭建文章标签： hadoop hive 大数据

于 2023-05-06 14:28:57 首次发布

本文链接：https://blog.csdn.net/dafsq/article/details/130481971

版权

大数据环境搭建同时被 2 个专栏收录

11 篇文章 18 订阅

订阅专栏

Hudi

4 篇文章 0 订阅

订阅专栏

Hudi的介绍

Hudi简介

Hudi（Hadoop Upserts Delete and Incremental）是下一代流数据湖平台。Apache Hudi将核心仓库和数据库功能直接引入数据湖。Hudi提供了表、事务、高效的upserts/delete、高级索引、流摄取服务、数据集群/压缩优化和并发，同时保持数据的开源文件格式。

Hudi不仅非常适合于流工作负载，而且还允许创建高效的增量批处理管道。

Hudi可以轻松地在任何云存储平台上使用。Hudi的高级性能优化，使分析工作负载更快的任何流行的查询引擎，包括Apache Spark、Flink、Presto、Trino、Hive等。

Hudi特性

可插拔索引机制支持快速Upsert/Delete。
支持增量拉取表变更以进行处理。
支持事务提交及回滚，并发控制。
支持Spark、Presto、Trino、Hive、Flink等引擎的SQL读写。
自动管理小文件，数据聚簇，压缩，清理。
流式摄入，内置CDC源和工具。
内置可扩展存储访问的元数据跟踪。
向后兼容的方式实现表结构变更的支持。

使用场景

1）近实时写入

减少碎片化工具的使用。
CDC 增量导入 RDBMS 数据。
限制小文件的大小和数量。

2）近实时分析

相对于秒级存储（Druid, OpenTSDB），节省资源。
提供分钟级别时效性，支撑更高效的查询。
Hudi作为lib，非常轻量。

3）增量 pipeline

区分arrivetime和event time处理延迟数据。
更短的调度interval减少端到端延迟（小时 -> 分钟） => Incremental Processing。

4）增量导出

替代部分Kafka的场景，数据导出到在线服务存储 e.g. ES。

安装Maven

上传并解压

将Maven配置到环境变量中

打开配置文件

vim etc/profile

添加配置信息

#MAVEN_HOME
export MAVEN_HOME=/opt/apache-maven-3.6.1
export PATH=$PATH:$MAVEN_HOME/bin

保存生效

source etc/profile

测试安装成功

修改为阿里镜像

修改apache-maven-3.6.1/conf目录下的settings.xml文件

在mirrors标签中加入

<!-- 添加阿里云镜像-->
<mirror>
        <id>nexus-aliyun</id>
        <mirrorOf>central</mirrorOf>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

添加maven本地仓库路径

<localRepository>/opt/software/RepMaven</localRepository>

编译Hudi

上传包并解压

包的下载地址：https://github.com/apache/hudi/

tar -zxvf hudi-0.12.0.src.tgz -C /opt

修改hadoop与hive版本兼容

修改hudi-0.12.0文件夹目录下的pom.xml文件

vim /opt/hudi-0.12.0/pom.xml

默认版本兼容为2系列的所以要改成自己目前使用的版本系列（113行左右）

新增中央仓库repository加速依赖下载（大概在pom文件第1213行）

<repository>
        <id>nexus-aliyun</id>
        <name>nexus-aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>

复制添加到<repositories>标签中

修改源码兼容hadoop3

vim /opt/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

修改源码文件（大概在110行），新添加一个null的参数然后保存退出

解决spark模块依赖冲突

vim /opt/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml

在380行左右出现hive依赖

        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>

在420行左右添加修改jdbc

      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

在440行左右hive-metastore

      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>

463行左右hive-common

    <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
     </exclusions>

在最后手动增加jetty

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
    </dependency>

修改hudi-utilities-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty

vim /opt/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml

在350行的位置，修改如下

      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

360行左右

      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

423行左右

      <exclusions>
		<exclusion>
          <artifactId>servlet-api</artifactId>
          <groupId>javax.servlet</groupId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

455行左右

      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

476行左右

      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>

501行左右

      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>

最后增加hudi配置版本的jetty

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
    </dependency>

执行编译

编译并指定版本

mvn clean package -DskipTests -Dspark3.1 -Dscala-2.12 -Dhadoop.version=3.1.3 -Pflink-bundle-shade-hive3

等待七八分钟各模块编译完成

进入hudi自带的客户端

hudi-cli/hudi-cli.sh

将编译好的包拷贝到spark的jars目录中

cp hudi-spark3.1-bundle_2.12-0.12.0.jar /opt/module/spark/jars

启动spark-shell，配置启动序列化参数

spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"

运行案例

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.common.model.HoodieRecord

val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator

val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()