Hudi0.14.0集成Spark3.2.3(Spark Shell方式)

1 启动

1.1 启动Spark Shell

# For Spark versions: 3.2 - 3.4
spark-shell --jars /path/to/jars/hudi-spark3.2-bundle_2.12-0.14.0.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
FOR SPARK 3.2 AND HIGHER VERSIONS

Use scala 2.12 builds with an additional config: --conf ‘spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog’

hudi-spark3.2-bundle_2.12-0.14.0.jarHudi 0.14.0 编译 中的jar包。

1.2 导入依赖并设置表名

// spark-shell
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._

val tableName = "trips_table"
val basePath = "hdfs:///tmp/trips_table"

2 创建表

// scala
// First commit will auto-initialize the table, if it did not exist in the specified base path.

不需要单独的建表。如果表不存在,第一批写表将创建该表。

3 插入数据

// spark-shell
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
  Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
    (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
    (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
    (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
    (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
  option(PARTITIONPATH_FIELD_NAME.key(), "city").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

mode(Overwrite)将覆盖重新创建表(如果已存在)。可以检查hdfs:///tmp/trips_table 路径下是否有数据生成。

[root@master01 ~]# hdfs dfs -ls /tmp/trips_table
Found 4 items
drwxr-xr-x   - root hdfs          0 2024-01-16 09:59 /tmp/trips_table/.hoodie
drwxr-xr-x   - root hdfs          0 2024-01-16 09:59 /tmp/trips_table/chennai
drwxr-xr-x   - root hdfs          0 2024-01-16 09:59 /tmp/trips_table/san_francisco
drwxr-xr-x   - root hdfs          0 2024-01-16 09:59 /tmp/trips_table/sao_paulo

4 查询数据

4.1 转换成DataFrame

Hudi tables can be queried back into a DataFrame or Spark SQL.

val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")

4.2 查询


spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM  trips_table WHERE fare > 20.0").show()
+--------------------+-----+-------------+-------+--------+-------------+       
|                uuid| fare|           ts|  rider|  driver|         city|
+--------------------+-----+-------------+-------+--------+-------------+
|e96c4396-3fad-413...| 27.7|1695091554788|rider-C|driver-M|san_francisco|
|9909a8b1-2d15-4d3...| 33.9|1695046462179|rider-D|driver-L|san_francisco|
|e3cf430c-889d-401...|34.15|1695516137016|rider-F|driver-P|    sao_paulo|
+--------------------+-----+-------------+-------+--------+-------------+

spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM  trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|  rider|  driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
|  20240116101626540|20240116101626540...|         san_francisco|rider-A|driver-K| 19.1|
|  20240116101626540|20240116101626540...|         san_francisco|rider-C|driver-M| 27.7|
|  20240116101626540|20240116101626540...|         san_francisco|rider-D|driver-L| 33.9|
|  20240116101626540|20240116101626540...|             sao_paulo|rider-F|driver-P|34.15|
|  20240116101626540|20240116101626540...|               chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+

5 更新数据

Hudi tables can be updated by streaming in a DataFrame or using a standard UPDATE statement.

// Lets read data from target Hudi table, modify fare column for rider-D and update it. 
val updatesDf = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-D").withColumn("fare", col("fare") * 10)

updatesDf.write.format("hudi").
  option(OPERATION_OPT_KEY, "upsert").
  option(PARTITIONPATH_FIELD_NAME.key(), "city").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)

保存模式现在是Append。
通常,除非是第一次创建表,否则请始终使用追加模式。
现在再次查询数据将显示更新的行程数据。每个写操作都会生成一个用时间戳表示的新提交。查找以前提交中相同的_hoodie_record_keys在该表的_hoodie_commit_time、fare字段中的变化。

查询更新后的数据,要重新加载该hudi表:

val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")

spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM  trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|  rider|  driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
|  20240116101626540|20240116101626540...|         san_francisco|rider-A|driver-K| 19.1|
|  20240116101626540|20240116101626540...|         san_francisco|rider-C|driver-M| 27.7|
|  20240116103152078|20240116101626540...|         san_francisco|rider-D|driver-L|339.0|
|  20240116101626540|20240116101626540...|             sao_paulo|rider-F|driver-P|34.15|
|  20240116101626540|20240116101626540...|               chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+

6 删除数据

// spark-shell
// Lets  delete rider: rider-D
val deletesDF = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-F")

deletesDF.write.format("hudi").
  option(OPERATION_OPT_KEY, "delete").
  option(PARTITIONPATH_FIELD_NAME.key(), "city").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)

再次查看数据,查看变化:

val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")


spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM  trips_table").show()
+-------------------+--------------------+----------------------+-------+--------+-----+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|  rider|  driver| fare|
+-------------------+--------------------+----------------------+-------+--------+-----+
|  20240116101626540|20240116101626540...|         san_francisco|rider-A|driver-K| 19.1|
|  20240116101626540|20240116101626540...|         san_francisco|rider-C|driver-M| 27.7|
|  20240116103152078|20240116101626540...|         san_francisco|rider-D|driver-L|339.0|
|  20240116101626540|20240116101626540...|               chennai|rider-J|driver-T|17.85|
+-------------------+--------------------+----------------------+-------+--------+-----+

7 时间旅行查询

Hudi从0.9.0开始就支持时间旅行查询。目前支持三种查询时间格式,如下所示。

spark.read.format("hudi").
  option("as.of.instant", "20210728141108100").
  load(basePath)

spark.read.format("hudi").
  option("as.of.instant", "2021-07-28 14:11:08.200").
  load(basePath)

// It is equal to "as.of.instant = 2021-07-28 00:00:00"
spark.read.format("hudi").
  option("as.of.instant", "2021-07-28").
  load(basePath)

8 增量查询

Hudi还提供了增量查询的方式,可以获取从给定提交时间戳以来更改的数据流。需要指定增量查询的beginTime,选择性指定endTime。如果我们希望在给定提交之后进行所有更改,则不需要指定endTime(这是常见的情况)。

// 加载数据

spark.read.format("hudi").load(basePath).createOrReplaceTempView("trips_table")

// 获取指定beginTime
val commits = spark.sql("SELECT DISTINCT(_hoodie_commit_time) AS commitTime FROM  trips_table ORDER BY commitTime").map(k => k.getString(0)).take(50)
commits: Array[String] = Array(20240116101626540, 20240116103152078)

scala> val beginTime = commits(commits.length - 2) // commit time we are interested in
beginTime: String = 20240116101626540

// incrementally query data
val tripsIncrementalDF = spark.read.format("hudi").
  option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME.key(), beginTime).
  load(basePath)
tripsIncrementalDF.createOrReplaceTempView("trips_incremental")

spark.sql("SELECT `_hoodie_commit_time`, fare, rider, driver, uuid, ts FROM  trips_incremental WHERE fare > 20.0").show()
+-------------------+-----+-------+--------+--------------------+-------------+
|_hoodie_commit_time| fare|  rider|  driver|                uuid|           ts|
+-------------------+-----+-------+--------+--------------------+-------------+
|  20240116103152078|339.0|rider-D|driver-L|9909a8b1-2d15-4d3...|1695046462179|
+-------------------+-----+-------+--------+--------------------+-------------+

这将过滤出beginTime之后提交且fare>20的数据。
利用增量查询,我们能在批处理数据上创建streaming pipelines。

9 指定时间点查询

查询特定时间点的数据,可以将endTime指向特定时间,beginTime指向000(表示最早提交时间)。
查询数据范围为:[beginTime, endTime]。

val beginTime = "000" 
val endTime = commits(commits.length - 2) 

val tripsPointInTimeDF = spark.read.format("hudi").
  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
  option(END_INSTANTTIME_OPT_KEY, endTime).
  load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")

spark.sql("SELECT `_hoodie_commit_time`, fare, rider, driver, uuid, ts FROM  hudi_trips_point_in_time WHERE fare > 20.0").show()
+-------------------+-----+-------+--------+--------------------+-------------+ 
|_hoodie_commit_time| fare|  rider|  driver|                uuid|           ts|
+-------------------+-----+-------+--------+--------------------+-------------+
|  20240116101626540| 27.7|rider-C|driver-M|e96c4396-3fad-413...|1695091554788|
|  20240116101626540| 33.9|rider-D|driver-L|9909a8b1-2d15-4d3...|1695046462179|
|  20240116101626540|34.15|rider-F|driver-P|e3cf430c-889d-401...|1695516137016|
+-------------------+-----+-------+--------+--------------------+-------------+

欢迎关注微信公众号:大数据AI

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大数据AI

你的鼓励将是我创作的最大动力!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值