环境centos7,hudi0.9.0,jdk8,hadoop-2.7.1,spark-3.0.1
Maven安装
(1)把apache-maven-3.6.1-bin.tar.gz上传到linux的/opt/software目录下
(2)解压apache-maven-3.6.1-bin.tar.gz到/opt/module/目录下面
[atguigu@hadoop102 software]$ tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/
(3)修改apache-maven-3.6.1的名称为maven
[atguigu@hadoop102 module]$ mv apache-maven-3.6.1/ maven
(4)添加环境变量到/etc/profile中
[atguigu@hadoop102 module]$ sudo vim /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven
export PATH=$PATH:$MAVEN_HOME/bin
(5)测试安装结果
[atguigu@hadoop102 module]$ source /etc/profile
[atguigu@hadoop102 module]$ mvn -v
(6)修改setting.xml,指定为阿里云
[atguigu@hadoop102 maven]$ cd conf
[atguigu@hadoop102 maven]$ vim settings.xml
<!-- 添加阿里云镜像-->
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
下载hudi—方式一 git
Git安装
[atguigu@hadoop102 software]$ sudo yum install git
[atguigu@hadoop102 software]$ git --version
2.1.3 构建hudi
[atguigu@hadoop102 software]$ cd /opt/module/
[atguigu@hadoop102 module]$ git clone https://github.com/apache/hudi.git && cd hudi
[atguigu@hadoop102 hudi]$ vim pom.xml
<repository>
<id>nexus-aliyun</id>
<name>nexus-aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
[atguigu@hadoop102 hudi]$ mvn clean package -DskipTests -DskipITs
可能会报错
error: Failed connect to github.com:443; Operation now in progress while accessing https://github.com/apache/hudi.git/info/refs
解决
1.先更新一下yum
yum -y update
2.删除变量SSH_ASKPASS
unset SSH_ASKPASS(这步没做)
git clone https://github.com/apache/hudi.git
方式二—源码包下载
到Apache 软件归档目录下载Hudi 0.8源码包:http://archive.apache.org/dist/hudi/0.9.0/
上传源码包到 /opt/module 目录,并解压配置软连接:
执行 mvn clean install -DskipTests -Dscala-2.12 -Dspark3 命令进行编译,成功后如下图所示:
[INFO] hudi-sync-common ................................... SUCCESS [ 4.328 s]
[INFO] hudi-hive-sync ..................................... SUCCESS [ 10.018 s]
[INFO] hudi-spark-datasource .............................. SUCCESS [ 0.091 s]
[INFO] hudi-spark-common_2.12 ............................. SUCCESS [02:34 min]
[INFO] hudi-spark3_2.12 ................................... SUCCESS [ 20.424 s]
[INFO] hudi-spark_2.12 .................................... SUCCESS [02:06 min]
[INFO] hudi-utilities_2.12 ................................ SUCCESS [01:33 min]
[INFO] hudi-utilities-bundle_2.12 ......................... SUCCESS [01:44 min]
[INFO] hudi-cli ........................................... SUCCESS [ 53.262 s]
[INFO] hudi-java-client ................................... SUCCESS [ 11.384 s]
[INFO] hudi-flink-client .................................. SUCCESS [ 34.410 s]
[INFO] hudi-spark2_2.12 ................................... SUCCESS [ 38.161 s]
[INFO] hudi-dla-sync ...................................... SUCCESS [ 7.919 s]
[INFO] hudi-sync .......................................... SUCCESS [ 0.074 s]
[INFO] hudi-hadoop-mr-bundle .............................. SUCCESS [ 9.576 s]
[INFO] hudi-hive-sync-bundle .............................. SUCCESS [ 3.319 s]
[INFO] hudi-spark3-bundle_2.12 ............................ SUCCESS [ 18.856 s]
[INFO] hudi-presto-bundle ................................. SUCCESS [ 12.718 s]
[INFO] hudi-timeline-server-bundle ........................ SUCCESS [ 19.230 s]
[INFO] hudi-hadoop-docker ................................. SUCCESS [ 5.501 s]
[INFO] hudi-hadoop-base-docker ............................ SUCCESS [01:56 min]
[INFO] hudi-hadoop-namenode-docker ........................ SUCCESS [ 1.158 s]
[INFO] hudi-hadoop-datanode-docker ........................ SUCCESS [ 1.226 s]
[INFO] hudi-hadoop-history-docker ......................... SUCCESS [ 1.192 s]
[INFO] hudi-hadoop-hive-docker ............................ SUCCESS [ 18.823 s]
[INFO] hudi-hadoop-sparkbase-docker ....................... SUCCESS [ 1.305 s]
[INFO] hudi-hadoop-sparkmaster-docker ..................... SUCCESS [ 1.204 s]
[INFO] hudi-hadoop-sparkworker-docker ..................... SUCCESS [ 1.590 s]
[INFO] hudi-hadoop-sparkadhoc-docker ...................... SUCCESS [ 1.179 s]
[INFO] hudi-hadoop-presto-docker .......................... SUCCESS [ 1.306 s]
[INFO] hudi-integ-test .................................... SUCCESS [01:12 min]
[INFO] hudi-integ-test-bundle ............................. SUCCESS [02:14 min]
[INFO] hudi-examples ...................................... SUCCESS [ 56.224 s]
[INFO] hudi-flink_2.12 .................................... SUCCESS [ 22.424 s]
[INFO] hudi-flink-bundle_2.12 ............................. SUCCESS [ 35.678 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 25:05 min
[INFO] Finished at: 2022-09-30T00:32:16+08:00
[INFO] ------------------------------------------------------------------------
[root@hadoop06 hudi]#
编译完成以后,进入$HUDI_HOME/hudi-cli目录,运行hudi-cli脚本,如果可以运行,说明编译成功,如下图所示:
HoodieSplashScreen loaded
===================================================================
* ___ ___ *
* /\__\ ___ /\ \ ___ *
* / / / /\__\ / \ \ /\ \ *
* / /__/ / / / / /\ \ \ \ \ \ *
* / \ \ ___ / / / / / \ \__\ / \__\ *
* / /\ \ /\__\ / /__/ ___ / /__/ \ |__| / /\/__/ *
* \/ \ \/ / / \ \ \ /\__\ \ \ \ / / / /\/ / / *
* \ / / \ \ / / / \ \ / / / \ /__/ *
* / / / \ \/ / / \ \/ / / \ \__\ *
* / / / \ / / \ / / \/__/ *
* \/__/ \/__/ \/__/ Apache Hudi CLI *
* *
===================================================================
Welcome to Apache Hudi CLI. Please type help if you are looking for help.
安装Hadoop
如果以下命令执行后报无权限,可加权限
#例如bin下的某文件无权限,执行以下命令
chmod 777 bin/*
hadoop第一次启动之前需要格式化
hadoop namenode -format
启动Hadoop
start-all.sh
关闭
stop-all.sh
安全模式问题
[root@hadoop06 home]# hadoop fs -mkdir /data
mkdir: Cannot create directory /data. Name node is in safe mode.
[root@hadoop06 home]# hadoop fs -mkdir /log
mkdir: Cannot create directory /log. Name node is in safe mode.
[root@hadoop06 home]# hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Safe mode is OFF
注意:1.如果少了XXXNode,那么修改core,hdfs,删除hadoop-2.7.1/tmp 然后重新格式化重新启动
2.如果XXXManageer,那么修改mapred,yarn,重新启动
3.命令找不见,hadoop-env.sh配置错误,profile配置错了
安装spark
[root@hadoop06 bin]# sh spark-shell --master=local
spark-shell:行57: /home/software/spark-2.0.1-bin-hadoop2.7/bin/spark-submit: 权限不够
解决
[root@hadoop06 bin]# cd ..
[root@hadoop06 spark-2.0.1-bin-hadoop2.7]# chmod 777 bin/*
[root@hadoop06 spark-2.0.1-bin-hadoop2.7]# cd bin
[root@hadoop06 bin]# sh spark-shell --master=local
Scala安装
环境变量
export SCALA_HOME=/home/software/scala-2.12.10
export PATH=$PATH:$MAVEN_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin
在spark-shell中运行hudi程序
首先使用spark-shell命令行,以本地模式(LocalMode:–master local[2])方式运行,模拟产生Trip乘车交易数据,将其保存至Hudi表,并且从Hudi表加载数据查询分析,其中Hudi表数据最后存储在HDFS分布式文件系统上。
在服务器中执行如下spark-shell命令,会在启动spark程序时,导入hudi包,请注意,执行此命令时需要联网,从远程仓库中下载对应的jar包:
spark-shell \
--master local[4] \
--packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
如果服务器不能联网,可以先将jar包上传到服务器,然后在通过spark-shell启动时,通过–jars命令指定jar包,如下所示:
spark-shell \
--master local[4] \
--jars /opt/module/Hudi/packaging/hudi-spark-bundle/target/hudi-spark3-bundle_2.12-0.8.0.jar \
--packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
效果
[root@hadoop06 bin]# sh spark-shell --master local[4] --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/home/software/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-avro_2.12 added as a dependency
org.apache.hudi#hudi-spark3-bundle_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.apache.spark#spark-avro_2.12;3.0.1 in central
found org.apache.spark#spark-tags_2.12;3.0.1 in central
found org.spark-project.spark#unused;1.0.0 in local-m2-cache
found org.apache.hudi#hudi-spark3-bundle_2.12;0.9.0 in local-m2-cache
downloading https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.0.1/spark-avro_2.12-3.0.1.jar ...
[SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.0.1!spark-avro_2.12.jar (326ms)
downloading file:/root/.m2/repository/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar ...
[SUCCESSFUL ] org.apache.hudi#hudi-spark3-bundle_2.12;0.9.0!hudi-spark3-bundle_2.12.jar (863ms)
downloading https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.12/3.0.1/spark-tags_2.12-3.0.1.jar ...
[SUCCESSFUL ] org.apache.spark#spark-tags_2.12;3.0.1!spark-tags_2.12.jar (160ms)
downloading file:/root/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar ...
[SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (2ms)
:: resolution report :: resolve 7447ms :: artifacts dl 1357ms
:: modules in use:
org.apache.hudi#hudi-spark3-bundle_2.12;0.9.0 from local-m2-cache in [default]
org.apache.spark#spark-avro_2.12;3.0.1 from central in [default]
org.apache.spark#spark-tags_2.12;3.0.1 from central in [default]
org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 4 | 4 | 4 | 0 || 4 | 4 |
---------------------------------------------------------------------
:: problems summary ::
:::: ERRORS
SERVER ERROR: Bad Gateway url=http://dl.bintray.com/spark-packages/maven/org/apache/apache/18/apache-18.jar
SERVER ERROR: Bad Gateway url=http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-parent_2.12/3.0.1/spark-parent_2.12-3.0.1.jar
SERVER ERROR: Bad Gateway url=http://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/9/oss-parent-9.jar
SERVER ERROR: Bad Gateway url=http://dl.bintray.com/spark-packages/maven/org/apache/apache/21/apache-21.jar
SERVER ERROR: Bad Gateway url=http://dl.bintray.com/spark-packages/maven/org/apache/hudi/hudi/0.9.0/hudi-0.9.0.jar
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
4 artifacts copied, 0 already retrieved (37885kB/45ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1519 [main] WARN org.apache.spark.SparkContext - Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.253.134:4040
Spark context available as 'sc' (master = local[4], app id = local-1664539108221).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
报找不到类,注意版本问题
因为我使用的是spark 2.4.6,hudi使用的是0.6.0,这里会有两个兼容性问题:
1)java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
这个是由于当前spark版本使用的avro序列化版本是1.7.7,没有LogicalType这个类,该类在org.apache.avro:avro:1.8.0之后的版本才出现。所以需要升级spark avro的序列化版本。
spark 联网启动
spark-shell \
--master local[4] \
--packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
无网络启动(需要整理jar包)
sh spark-shell \
--master local[2] \
--jars /home/software/spark-2.0.1-bin-hadoop2.7/hudi-jar/hudi-spark3-bundle_2.12-0.9.0.jar, \
/home/software/spark-2.0.1-bin-hadoop2.7/hudi-jar/spark-avro_2.12-3.0.1.jar\
--conf spark.driver.extraClassPath=/home/software/spark-2.0.1-bin-hadoop2.7/hudi-jar/spark-avro_2.12-3.0.1.jar \
--conf spark.executor.extraClassPath=/home/software/spark-2.0.1-bin-hadoop2.7/hudi-jar/spark-avro_2.12-3.0.1.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
bin/spark-shell \
--master local[2] \
--jars /opt/module/spark-3.0.0-bin-hadoop2.7/hudi-jars/hudi-spark3-bundle_2.12-0.9.0.jar, \
/opt/module/spark-3.0.0-bin-hadoop2.7/hudi-jars/spark-avro_2.12-3.0.1.jar\
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
保存数据至Hudi表
1)、导入park及Hudi相关包
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
2)、定义变量(表的名称和数据存储路径,路径可以本地也可以hdfs)
##创建表名
val tableName = "hudi_trips_cow"
##指定数据存放路径(本地)
val basePath = "file:///home/hudi_trips_cow"
##指定数据存放路径(hdfs)
val basePath = "hdfs://hadoop06:9000/datas/hudi-warehouse/hudi_trips_cow"
3)、模拟生成Trip乘车数据
##构建DataGenerator对象,用于模拟生成10条Trip乘车数据
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
数据格式
{
"ts": 1653124172267,
"uuid": "80ad40d1-95f8-4677-ad1c-6eee1eeb72dd",
"rider": "rider-213",
"driver": "driver-213",
"begin_lat": 0.4726905879569653,
"begin_lon": 0.46157858450465483,
"end_lat": 0.754803407008858,
"end_lon": 0.9671159942018241,
"fare": 34.158284716382845,
"partitionpath": "americas/brazil/sao_paulo"
}
4)、将模拟数据List转换为DataFrame数据集,查看数据
##转成df
val df = spark.read.json(spark.sparkContext.parallelize(inserts,2))
##查看数据结构
df.printSchema
##查看数据
df.show()
5)、将数据写入到hudi(此处有问题)
spark-2.4.6版本
报错信息
warning: there was one deprecation warning; re-run with -deprecation for details java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
解决-----安装spark-3.0.1-----成功进行以下步骤
直接通过format指定数据源hudi,设置相关属性保存数据即可
df.write.format("hudi").
mode(Overwrite).
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
##
重要参数说明
参数:getQuickstartWriteConfigs,设置写入/更新数据至Hudi时,Shuffle时分区数目
参数:PRECOMBINE_FIELD_OPT_KEY,数据合并时,依据主键字段
参数:RECORDKEY_FIELD_OPT_KEY,每条记录的唯一id,支持多个字段
参数:PARTITIONPATH_FIELD_OPT_KEY,用于存放数据的分区字段
6)、HDFS数据结构(全是parquet格式)
2.3、读取hudi表数据
1)、采用SparkSQL外部数据源加载数据方式,指定format数据源和相关参数options
val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath + "/*/*/*/*")
##创建临时表
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
##参数说明
参数 /*/*/*/* 其中指定Hudi表数据存储路径即可,采用正则Regex匹配方式,由于保存Hudi表属于分区表,并且为三级分区(相当于Hive中表指定三个分区字段)
2)、查询乘车费用 大于 20 信息数据
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
2.4、更新hudi表数据
类似于插入新数据,用官方提供工具类DataGenerator模拟生成更新update数据
val updates = convertToStringList(dataGen.generateUpdates(10))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
##参数说明
参数:Append 追加数据
2.5、删除hudi表数据
1)、先从Hudi表获取2条数据,然后构建出删除数据格式
##查询数据总条数
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
##获取2条数据,然后构建出数据格式
val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
2)、再重新保存到Hudi表
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION_OPT_KEY,"delete").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
参数说明
参数:OPERATION_OPT_KEY,delete 删除数据(必须Append模式)