【Hudi数据湖-1-Hudi编译】

Apache Minor Trend

已于 2023-01-09 15:58:26 修改

阅读量547

点赞数

文章标签：大数据数据仓库

于 2022-12-29 14:11:15 首次发布

本文链接：https://blog.csdn.net/weixin_38136584/article/details/128476360

版权

Hudi数据湖-1-Hudi编译

Apache Hudi（Hadoop Upserts Delete and Incremental）

Apache Hudi（Hadoop Upserts Delete and Incremental）

1.Hudi是什么

由Uber公司开发并开源的Data Lakes解决方案，是下一代数仓解决方案，提供了搞笑对的upsert和近实时更新；
核心特性：
- 开放性：上游支持多种数据源格式，下游查询端支持多种查询引擎。
- 事物支持：支持在文件存储布局上做更新
- 基于ACID语义的增量处理：增量ETL处理，由之前的天级别更新优化为分钟级别更新
- 智能化调度：自动管理小文件
使用场景：
- 近实时摄取
- 近实时分析
- 增量处理管道
- 增量导出

2.Hudi架构图

在这里插入图片描述

3.Hudi特性

可插拔索引机制，支持快速Upsert/Delete
支持增量拉取表变更以进行处理
支持事物提交及回滚，并发控制
支持Spark、Presto、Trino、Hive、Flink等计算引擎的SQL读写。
自动管理小文件，自动进行数据聚簇，压缩，清理（定时清理）
流式摄入，内置CDC源和工具。
内置可扩展存储访问的元数据追踪。
向后兼容的方式实现表结构变更的支持。

4.Hudi使用场景

近实时写入
- 减少碎片化工具的使用
- CDC增量导入RDBMS数据
- 限制小文件的大小和数量
近实时分析
- 相对于秒级存储（Druid,OpenTSDB）,节省资源。
- 提供分钟级时效性，支持更高效的查询。
增量pipeline
- 区分arrivetime和 event time 处理延迟数据。
- 更短的调度interval,减少端到端延迟（小时->分钟）=> Incremental Processing.
增量导出
- 替代部分Kafka的场景，数据导出到在线服务存储。elasticsearch

5.编译安装

hudi针对其他组件的兼容性:百度 “hudi 0.11/0.12 release”

组件	版本
Hadoop	3.1.3
Hive	3.1.2
Flink	1.13.6 ,scala-2.12
Spark	3.2.2,scala-2.12

1.maven的安装，在linux服务器上

1）安装Maven
（1）上传apache-maven-3.6.1-bin.tar.gz到/opt/software目录，并解压更名

tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/
mv apache-maven-3.6.1 maven-3.6.1

（2）添加环境变量到/etc/profile中

sudo vim /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven-3.6.1
export PATH=$PATH:$MAVEN_HOME/bin

（3）测试安装结果

source /etc/profile
mvn -v 验证是都成功

2）修改为阿里镜像
（1）修改setting.xml，指定为阿里仓库地址

vim /opt/module/maven-3.6.1/conf/settings.xml 

<!-- 添加阿里云镜像-->
<mirror>
        <id>nexus-aliyun</id>
        <mirrorOf>central</mirrorOf>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

2.编译Hudi，解决与Hadoop以及Hive版本兼容性问题

1.上传源码包
将hudi-0.12.0.src.tgz上传到/opt/software，并解压

tar -zxvf /opt/software/hudi-0.12.0.src.tgz -C /opt/software

也可以从github下载：https://github.com/apache/hudi/

2.修改pom文件
vim /opt/software/hudi-0.12.0/pom.xml
1）新增repository加速依赖下载

    <repository>
        <id>nexus-aliyun</id>
        <name>nexus-aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>

2）修改依赖的组件版本

<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>

在这里插入图片描述

3.修改源码兼容hadoop3
Hudi默认依赖的hadoop2，要兼容hadoop3，除了修改版本，还需要修改如下代码：

vim /opt/software/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

修改第110行，原先只有一个参数，添加第二个参数null：
在这里插入图片描述
否则会因为hadoop2.x和3.x版本兼容问题，报错如下：

3.编译Hudi，解决与Kafka依赖问题

有几个kafka的依赖需要手动安装，否则编译报错如下：
在这里插入图片描述
1）下载jar包
通过网址下载：http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
解压后找到以下jar包，上传服务器hadoop1

common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar

2）install到maven本地仓库

mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar

4.编译Hudi，排除jetty问题，解决与Spark模块依赖冲突问题

1. 解决spark模块依赖冲突
  修改了Hive版本为3.1.2，其携带的jetty是0.9.3，hudi本身用的0.9.4，存在依赖冲突。
  1）修改hudi-spark-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:
  vim /opt/software/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml
  在382行的位置，修改如下（红色部分）：

    <!-- Hive -->
    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service-rpc</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-common</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
    </dependency>

否则在使用spark向hudi表插入数据时，会报错如下：

java.lang.NoSuchMethodError: org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V

在这里插入图片描述
出现上述问题原因：依赖冲突

1.spark写入数据的时候，依赖hive，hive我们已经修改为3.1.2版本，本身携带了jetty 0.9.3。
2.hudi的common模块也携带了jetty 0.9.4。
3.上述两者的jetty出现了依赖冲突问题，需要手动解决。

4.编译Hudi，排除jetty问题，解决与Detla Streamer模块依赖冲突问题

修改hudi-utilities-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:

vim /opt/software/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml

在405行的位置，修改如下：

    <!-- Hoodie -->
    <dependency>
      <groupId>org.apache.hudi</groupId>
      <artifactId>hudi-common</artifactId>
      <version>${project.version}</version>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
    <dependency>
      <groupId>org.apache.hudi</groupId>
      <artifactId>hudi-client-common</artifactId>
      <version>${project.version}</version>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>


<!-- Hive -->
    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
      <exclusions>
		<exclusion>
          <artifactId>servlet-api</artifactId>
          <groupId>javax.servlet</groupId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service-rpc</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-common</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
</dependency>

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
    </dependency>