数据湖hudi,spark-shell启动

spark-shell启动

spark-shell启动,需要指定spark-avro模块,因为默认环境里没有,spark-avro模块版本好需要和spark版本对应,2.4.5。spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars /opt/module/hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar

设置表名

设置表名,基本路径和数据生成器

scala> import org.apache.hudi.QuickstartUtils._

import org.apache.hudi.QuickstartUtils._

scala> import scala.collection.JavaConversions._

import scala.collection.JavaConversions._

scala> import org.apache.spark.sql.SaveMode._

import org.apache.spark.sql.SaveMode._

scala> import org.apache.hudi.DataSourceReadOptions._

import org.apache.hudi.DataSourceReadOptions._

scala> import org.apache.hudi.DataSourceWriteOptions._

import org.apache.hudi.DataSourceWriteOptions._

scala> import org.apache.hudi.config.HoodieWriteConfig._

import org.apache.hudi.config.HoodieWriteConfig._

scala> val tableName = "hudi_trips_cow"

tableName: String = hudi_trips_cow

scala> val basePath = "file:///tmp/hudi_trips_cow"

basePath: String = file:///tmp/hudi_trips_cow

scala> val dataGen = new DataGenerator

dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@5cdd5ff9

插入数据

新增数据,生成一些数据,将其加载到DataFrame中,然后将DataFrame写入Hudi表

scala> val inserts = convertToStringList(dataGen.generateInserts(10))

scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

scala> df.write.format("hudi").

     |   options(getQuickstartWriteConfigs).

     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").

     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").

     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").

     |   option(TABLE_NAME, tableName).

     |   mode(Overwrite).

     |   save(basePath)

Mode(overwrite)将覆盖重新创建表(如果已存在)。可以检查/tmp/hudi_trps_cow 路径下是否有数据生成。

[root@hadoop102 ~]# cd /tmp/hudi_trips_cow/

[root@hadoop102 hudi_trips_cow]# ls

americas  asia

3.4 查询数据

scala> val tripsSnapshotDF = spark.

     |   read.

     |   format("hudi").

     |   load(basePath + "/*/*/*/*")

scala> tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

scala> spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()

+------------------+-------------------+-------------------+---+

|              fare|          begin_lon|          begin_lat| ts|

+------------------+-------------------+-------------------+---+

| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|

| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|

| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|

| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|

|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|

| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|

|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|

| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|

+------------------+-------------------+-------------------+---+

scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

+-------------------+--------------------+----------------------+---------+----------+------------------+

|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|

+-------------------+--------------------+----------------------+---------

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值