1 启动
1.1 启动Spark Shell
# For Spark versions: 3.2 - 3.4
spark-shell --jars /path/to/jars/hudi-spark3.2-bundle_2.12-0.14.0.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
FOR SPARK 3.2 AND HIGHER VERSIONS
Use scala 2.12 builds with an additional config: --conf ‘spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog’
hudi-spark3.2-bundle_2.12-0.14.0.jar
为 Hudi 0.14.0 编译 中的jar包。
1.2 导入依赖并设置表名
// spark-shell
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._
val tableName = "trips_table"
val basePath = "hdfs:///tmp/trips_table"
2 创建表
// scala
// First commit will auto-initialize the table, if it did not exist in the specified base path.
不需要单独的建表。如果表不存在,第一批写表将创建该表。
3 插入数据
// spark-shell
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
(1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
(1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
(1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo" ),
(1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
option(PARTITIONPATH_FIELD_NAME.key(), "city").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
mode(Overwrite)将覆盖重新创建表(如果已存在)。可以检查hdfs:///tmp/trips_table 路径下是否有数据生成。
[root@master01 ~]# hdfs dfs -ls /tmp/trips_table
Found 4 items
drwxr-xr-x - root hdfs 0 2024-01-16 09:59 /tmp/trips_table/.hoodie
drwxr-xr-x - root hdfs 0 2024-01-16 09:59 /tmp/trips_table/chennai
drwxr-xr-x - root hdfs 0 2024-01-16 09:59 /tmp/trips_table/san_francisco
drwxr-xr-x - root hdfs 0 2024-01-16 09:59 /tmp/trips_table/sao_paulo
4 查询数据
4.1 转换成DataFrame
Hudi tables can be queried back into a DataFrame or Spark SQL.
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")
4.2 查询
spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM trips_table WHERE fare > 20.0").show()
+--------------------+-----+-------------+-------+--------+-------------+
| uuid| fare| ts| rider| driver| city|
+--------------------+-----+-------------+-------+--------+-------------+
|e96c4396-3fad-413...| 27.7|1695091554788|rider-C|driver-M|san_francisco|
|9909a8b1-2d15-4d3...| 33.9|1695046462179|rider-D|driver-L|san_francisco|
|e3cf430c-889d-401...|34.15|1695516137016|rider-F|driver-P| sao_paulo|
+--------------------+-----+-------------+-------+--------+-------------+
spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
| 20240116101626540|20240116101626540...| san_francisco|rider-A|driver-K| 19.1|
| 20240116101626540|20240116101626540...| san_francisco|rider-C|driver-M| 27.7|
| 20240116101626540|20240116101626540...| san_francisco|rider-D|driver-L| 33.9|
| 20240116101626540|20240116101626540...| sao_paulo|rider-F|driver-P|34.15|
| 20240116101626540|20240116101626540...| chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+
5 更新数据
Hudi tables can be updated by streaming in a DataFrame or using a standard UPDATE statement.
// Lets read data from target Hudi table, modify fare column for rider-D and update it.
val updatesDf = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-D").withColumn("fare", col("fare") * 10)
updatesDf.write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(PARTITIONPATH_FIELD_NAME.key(), "city").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
保存模式现在是Append。
通常,除非是第一次创建表,否则请始终使用追加模式。
现在再次查询数据将显示更新的行程数据。每个写操作都会生成一个用时间戳表示的新提交。查找以前提交中相同的_hoodie_record_keys在该表的_hoodie_commit_time、fare字段中的变化。
查询更新后的数据,要重新加载该hudi表:
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")
spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
| 20240116101626540|20240116101626540...| san_francisco|rider-A|driver-K| 19.1|
| 20240116101626540|20240116101626540...| san_francisco|rider-C|driver-M| 27.7|
| 20240116103152078|20240116101626540...| san_francisco|rider-D|driver-L|339.0|
| 20240116101626540|20240116101626540...| sao_paulo|rider-F|driver-P|34.15|
| 20240116101626540|20240116101626540...| chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+
6 删除数据
// spark-shell
// Lets delete rider: rider-D
val deletesDF = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-F")
deletesDF.write.format("hudi").
option(OPERATION_OPT_KEY, "delete").
option(PARTITIONPATH_FIELD_NAME.key(), "city").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
再次查看数据,查看变化:
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")
spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
| 20240116101626540|20240116101626540...| san_francisco|rider-A|driver-K| 19.1|
| 20240116101626540|20240116101626540...| san_francisco|rider-C|driver-M| 27.7|
| 20240116103152078|20240116101626540...| san_francisco|rider-D|driver-L|339.0|
| 20240116101626540|20240116101626540...| chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+
7 时间旅行查询
Hudi从0.9.0开始就支持时间旅行查询。目前支持三种查询时间格式,如下所示。
spark.read.format("hudi").
option("as.of.instant", "20210728141108100").
load(basePath)
spark.read.format("hudi").
option("as.of.instant", "2021-07-28 14:11:08.200").
load(basePath)
// It is equal to "as.of.instant = 2021-07-28 00:00:00"
spark.read.format("hudi").
option("as.of.instant", "2021-07-28").
load(basePath)
8 增量查询
Hudi还提供了增量查询的方式,可以获取从给定提交时间戳以来更改的数据流。需要指定增量查询的beginTime,选择性指定endTime。如果我们希望在给定提交之后进行所有更改,则不需要指定endTime(这是常见的情况)。
// 加载数据
spark.read.format("hudi").load(basePath).createOrReplaceTempView("trips_table")
// 获取指定beginTime
val commits = spark.sql("SELECT DISTINCT(_hoodie_commit_time) AS commitTime FROM trips_table ORDER BY commitTime").map(k => k.getString(0)).take(50)
commits: Array[String] = Array(20240116101626540, 20240116103152078)
scala> val beginTime = commits(commits.length - 2) // commit time we are interested in
beginTime: String = 20240116101626540
// incrementally query data
val tripsIncrementalDF = spark.read.format("hudi").
option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME.key(), beginTime).
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("trips_incremental")
spark.sql("SELECT `_hoodie_commit_time`, fare, rider, driver, uuid, ts FROM trips_incremental WHERE fare > 20.0").show()
+-------------------+-----+-------+--------+--------------------+-------------+
|_hoodie_commit_time| fare| rider| driver| uuid| ts|
+-------------------+-----+-------+--------+--------------------+-------------+
| 20240116103152078|339.0|rider-D|driver-L|9909a8b1-2d15-4d3...|1695046462179|
+-------------------+-----+-------+--------+--------------------+-------------+
这将过滤出beginTime之后提交且fare>20的数据。
利用增量查询,我们能在批处理数据上创建streaming pipelines。
9 指定时间点查询
查询特定时间点的数据,可以将endTime指向特定时间,beginTime指向000(表示最早提交时间)。
查询数据范围为:[beginTime, endTime]。
val beginTime = "000"
val endTime = commits(commits.length - 2)
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("SELECT `_hoodie_commit_time`, fare, rider, driver, uuid, ts FROM hudi_trips_point_in_time WHERE fare > 20.0").show()
+-------------------+-----+-------+--------+--------------------+-------------+
|_hoodie_commit_time| fare| rider| driver| uuid| ts|
+-------------------+-----+-------+--------+--------------------+-------------+
| 20240116101626540| 27.7|rider-C|driver-M|e96c4396-3fad-413...|1695091554788|
| 20240116101626540| 33.9|rider-D|driver-L|9909a8b1-2d15-4d3...|1695046462179|
| 20240116101626540|34.15|rider-F|driver-P|e3cf430c-889d-401...|1695516137016|
+-------------------+-----+-------+--------+--------------------+-------------+