第2节 测试spark操作hudi 0.9 cdh6.3.2 版本不兼容时

spark-shell操作

(1)spark-shell启动,需要指定spark-avro模块,因为默认环境里没有,spark-avro模块版本号需要和spark版本对应,(可以在maven仓库https://mvnrepository.com/查看spark 个版本对应的spark-avro有没有再maven仓),并且使用Hudi编译好的jar包。

 发现spark-avro使用的3.0.0版本scala 是2.12,如果使用的spark 是apache spark3.0.0之后的可以参考第一节编译时使用scala版本取2.12

当前由于是cdh6.3.2,spark版本是2.4.0 使用如下命令启动spark-shell

local 模式 --master local[*] 

[xxx@xxx Hudi]# spark-shell 
--jars packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.9.0.jar 
--packages org.apache.spark:spark-avro_2.11:2.4.4 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

 设置表名

scala> import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.QuickstartUtils._
 
scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
 
scala> import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SaveMode._
 
scala> import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceReadOptions._
 
scala> import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.DataSourceWriteOptions._
 
scala> import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.config.HoodieWriteConfig._
 
scala> val tableName = "hudi_trips_cow"
tableName: String = hudi_trips_cow
 
scala> val basePath = "file:///tmp/hudi_trips_cow"
basePath: String = file:///tmp/hudi_trips_cow
 
scala> val dataGen = new DataGenerator
dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@5cdd5ff9

插入数据

新增数据,生成一些数据,将其加载到DataFrame中,然后将DataFrame写入Hudi表

scala> val inserts = convertToStringList(dataGen.generateInserts(10))
scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
scala> df.write.format("hudi").
        options(getQuickstartWriteConfigs).
        option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        option(TABLE_NAME, tableName).
        mode(Overwrite).
        save(basePath)

Mode(overwrite)将覆盖重新创建表(如果已存在)。可以检查/tmp/hudi_trps_cow 路径下是否有数据生成。

报错

df.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(TABLE_NAME, tableName).mode(Overwrite).save(basePath)
warning: there was one deprecation warning; re-run with -deprecation for details
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.DataSourceUtils$.PARTITIONING_COLUMNS_KEY()Ljava/lang/String;
  at org.apache.hudi.DataSourceWriteOptions$.translateSqlOptions(DataSourceOptions.scala:203)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:158)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)

原因是使用的cdh里面的spark版本比编译使用的spark底,仔细查看报错位置

org.apache.spark.sql.execution.datasources.DataSourceUtils

 

 cdh版本的这个spark版本低,缺少一部分依赖的代码,

仔细对比之后发现 多出来的 spark-2.4.4版本的这个代码只被一个地方依赖到,检查代码

val PARTITIONING_COLUMNS_KEY : java.lang.String = { /* compiled code */ }
def encodePartitioningColumns(columns : scala.Seq[scala.Predef.String]) : scala.Predef.String = { /* compiled code */ }
def decodePartitioningColumns(str : scala.Predef.String) : scala.Seq[scala.Predef.String] = { /* compiled code */ }

 

观察这段代码可能暂时用不到,将里面的if 代码注释掉,重新编译,重复上面方法,发现可以写入数据

 [xxx@xxx~]# cd /tmp/hudi_trips_cow/

[xxx@xxx hudi_trips_cow]# ls

americas asia

查询数据

scala> val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
tripsSnapshotDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

scala> spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+------------------+-------------------+-------------------+-------------+
|              fare|          begin_lon|          begin_lat|           ts|
+------------------+-------------------+-------------------+-------------+
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1631251396281|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1631217520301|
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1631301858022|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1631481586975|
|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1631289033325|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1631239743749|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1631424371491|
| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1631353184509|
+------------------+-------------------+-------------------+-------------+
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|     20200701105144|6007a624-d942-4e0...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|     20200701105144|db7c6361-3f05-48d...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|     20200701105144|dfd0e7d9-f10c-468...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|     20200701105144|e36365c8-5b3a-415...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|     20200701105144|fb92c00e-dea2-48e...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|     20200701105144|98be3080-a058-47d...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|     20200701105144|3dd6ef72-4196-469...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|     20200701105144|20f9463f-1c14-4e6...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|     20200701105144|1585ad3a-11c9-43c...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|     20200701105144|d40daa90-cf1a-4d1...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+

修改数据

类似于插入新数据,使用数据生成器生成新数据对历史数据进行更新。将数据加载到DataFrame中并将DataFrame写入Hudi表中

scala> val updates = convertToStringList(dataGen.generateUpdates(10))
scala> val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
scala> df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath)

增量查询

Hudi还提供了获取自给定提交时间戳以来以更改记录流的功能。这可以通过使用Hudi的增量查询并提供开始流进行更改的开始时间来实现。

scala>spark.read.format("hudi").load(basePath+"/*/*/*/*").createOrReplaceTempView("hudi_trips_snapshot")
scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
scala> val beginTime = commits(commits.length - 2)
beginTime: String = 20200701105144
scala> val tripsIncrementalDF = spark.read.format("hudi").
     |   option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     |   option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |   load(basePath)
scala> tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
+-------------------+------------------+--------------------+-------------------+---+
|_hoodie_commit_time|              fare|           begin_lon|          begin_lat| ts|
+-------------------+------------------+--------------------+-------------------+---+
|     20200701110546|49.527694252432056|  0.5142184937933181| 0.7340133901254792|0.0|
|     20200701110546|  90.9053809533154| 0.19949323322922063|0.18294079059016366|0.0|
|     20200701110546|  98.3428192817987|  0.3349917833248327| 0.4777395067707303|0.0|
|     20200701110546| 90.25710109008239|  0.4006983139989222|0.08528650347654165|0.0|
|     20200701110546| 63.72504913279929|   0.888493603696927| 0.6570857443423376|0.0|
|     20200701110546| 29.47661370147079|0.010872312870502165| 0.1593867607188556|0.0|
+-------------------+------------------+--------------------+-------------------+---+

这将提供在beginTime提交后的数据,并且fare>20的数据

时间点查询

根据特定时间查询,可以将endTime指向特定时间,beginTime指向000(表示最早提交时间)

scala> val beginTime = "000"
beginTime: String = 000
 
scala> val endTime = commits(commits.length - 2)
endTime: String = 20200701105144
scala> val tripsPointInTimeDF = spark.read.format("hudi").
     |   option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     |   option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |   option(END_INSTANTTIME_OPT_KEY, endTime).
     |   load(basePath)
scala> tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200701105144| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200701105144| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200701105144| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200701105144| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200701105144|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200701105144| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200701105144|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200701105144| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

删除数据

scala> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
res12: Long = 10
scala> val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
scala> val deletes = dataGen.generateDeletes(ds.collectAsList())
scala> val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
scala> df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(OPERATION_OPT_KEY,"delete").
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath)
scala> val roAfterDeleteViewDF = spark. read. format("hudi"). load(basePath + "/*/*/*/*")
scala> roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
scala> spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
res15: Long = 8

只有append模式,才支持删除功能

覆盖数据

(1)对于一些批量etl操作,overwrite覆盖分区内的数据这种操作可能会比upsert操作效率更高,即一次重新计算目标分区内的数据。因为overwrite操作可以绕过upsert操作总需要的索引、预聚合步骤。

scala>spark.read.format("hudi").load(basePath+"/*/*/*/*").select("uuid","partitionpath"). sort("partitionpath","uuid"). show(100, false)
 
21/08/04 13:00:08 WARN DefaultSource: Loading Base File Only View.
+------------------------------------+------------------------------------+
|uuid                                |partitionpath                       |
+------------------------------------+------------------------------------+
|0fb8f685-6db5-4d58-a7b1-79da5a1e0e00|americas/brazil/sao_paulo           |
|31705950-ccff-4555-9f95-afb1e2438346|americas/brazil/sao_paulo           |
|b3c04064-81aa-429b-a0eb-5b1e5c87dde8|americas/brazil/sao_paulo           |
|b24917ba-e241-4c1d-bb64-e5d70cb984a3|americas/united_states/san_francisco|
|c32e33b7-1fff-4422-a026-62efcf912863|americas/united_states/san_francisco|
|c4f2075a-a36d-45a5-ac8a-f2a18fabe2a2|americas/united_states/san_francisco|
|40fe2e44-7d38-4660-8369-987ec7c4ba82|asia/india/chennai                  |
|53081003-176e-47bd-b408-8130cd623f77|asia/india/chennai                  |
+------------------------------------+------------------------------------+
 
scala> val inserts = convertToStringList(dataGen.generateInserts(10))
scala> val df = spark.
     |   read.json(spark.sparkContext.parallelize(inserts, 2)).
     |   filter("partitionpath = 'americas/united_states/san_francisco'")
 
scala> df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(OPERATION_OPT_KEY,"insert_overwrite").
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath)
 
scala> spark.
     |   read.format("hudi").
     |   load(basePath + "/*/*/*/*").
     |   select("uuid","partitionpath").
     |   sort("partitionpath","uuid").
     |   show(100, false)
  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
CDH(Cloudera Distribution of Hadoop)是一种用于大数据处理的分布式计算框架,它包含了Hadoop、Spark、Hive等组件,用于存储、处理和分析大规模的数据。 Hudi是一种开源数据湖工具,它 stands for Hadoop Upserts Deletes Incremental,可以在数据湖中实现增量更新和删除操作Hudi为大规模数据处理提供了高性能、低延迟和可靠性的解决方案,可以轻松处理PB级别的数据。 CDH 6.3.2Hudi的结合,使得在CDH集群上使用Hudi变得更加简单。通过CDH的集成,用户可以直接在CDH集群中安装和配置Hudi,无需额外的安装步骤。此外,CDH还提供了一些工具和管理界面,帮助用户更好地管理和监控Hudi在集群中的运行。 使用CDH 6.3.2搭配Hudi,可以实现以下功能: 1. 增量更新和删除操作Hudi可以在数据湖中实现增量更新和删除操作,从而减少数据处理的间和资源消耗。 2. 事务支持:Hudi在CDH集群中提供了事务支持,确保数据的一致性和可靠性。 3. 数据索引和查询:Hudi支持数据索引和查询,能够快速检索和分析大规模的数据。 4. 增量同步和复制:Hudi还提供了增量同步和复制功能,可以将数据湖中的数据复制到其他系统或平台上进行进一步的处理和分析。 综上所述,CDH 6.3.2Hudi的结合为大数据处理提供了更加灵活和高效的解决方案。通过它们的组合,用户可以在CDH集群中轻松地实现增量更新、删除和查询操作,从而更好地管理和处理大规模的数据。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值