pyspark 读取 redis数据_pyspark && hudi

63e0fdfc25557f5c9dba148544272dbc.png

0.内容主要

示例代码主要来自如下链接,有改动

Quick-Start Guide​hudi.apache.org

使用jupyter进行了调试,保留了每个步骤的输出内容,并增加了一下spark dataframe和spark sql操作的对比。ipynb的代码地址为:

pyspark-hudi-quick-start.ipynb​github.com

可以直接下载运行。运行环境见另外一片文章。

老冯:制作spark&jupyter镜像​zhuanlan.zhihu.com
4e730a2e945c908089647c35497a6da8.png

1. 启动pyspark

启动pyspark比较简单,直接运行spark_home/bin/pyspark即可。如何让pyspark启动一个jupyter环境?这里涉及到几个环境变量:

export PYSPARK_DRIVER_PYTHON=jupyter 
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
export PYSPARK_PYTHON=/usr/bin/python3

这里指明了运行pyspark使用“jupyter”命令,并需要jupyter命令提供一个参数“notebook”。spark worker执行spark task时使用“PYSPARK_PYTHON=/usr/bin/python3”。可以在os里设置这三个环境变量。如果出错,可以直接修改pyspark文件,把这三个环境变量加到23行位置上:

72d74c5e1fac7171e8f2f170b4fd3960.png

因为设置“PYSPARK_DRIVER_PYTHON=jupyter“可能会影响“find-spark-home”的执行。

如下代码将启动pyspark,如果上述环境已经配置好,则启动的时jupyter.

spark-2.4.4-bin-hadoop2.7/bin/pyspark 
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

3. pyspark&hudi on jupyter

hudi的表存在两种模式:Copy On Write Table、Merge on read table。

  • cow:在写入数据时,进行文件合并,这种类型的表对数据读取比较友好。
  • mor:在数据读取时,进行文件合并,这种形式写速度较快,读相对慢一些。

本文例子的表是cow。

1.插入数据

# 定义表名
tableName = "hudi_trips_cow"
# 定义存储路径,直接存入到文件中,也可以存储到hdfs或者其他云存储中。
basePath = "file:///opt/data/hudi_trips_cow"
# hudi的工具类:QuickstartUtils,用来生成测试数据
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

生成测试数据,并进一步转化为spark的dataframe

# 生成了10条数据
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
# 转化为dataframe,数据被划分为两个分区
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.show()

数据内容如下,

+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|       partitionpath|    rider| ts|                uuid|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|americas/brazil/s...|rider-213|0.0|4c24b4a1-0168-427...|
| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|americas/brazil/s...|rider-213|0.0|14657a04-2488-440...|
| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|americas/united_s...|rider-213|0.0|3dad2536-7f33-418...|
|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|americas/united_s...|rider-213|0.0|d51ffec5-60ac-489...|
|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|  asia/india/chennai|rider-213|0.0|24ca6908-cb75-4e4...|
| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|americas/united_s...|rider-213|0.0|47bc52aa-dd97-46d...|
| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|americas/united_s...|rider-213|0.0|5e83112b-49a9-414...|
| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|americas/brazil/s...|rider-213|0.0|e210af6c-edb2-416...|
|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|  asia/india/chennai|rider-213|0.0|567e5d3e-a436-4dc...|
|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|americas/united_s...|rider-213|0.0|a5547435-6566-47d...|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+

共有是10个列,包含了:begin_lat、begin_lon、driver、end_lat、 end_lon、fare、partitionpath、rider、ts、uuid。hudi的表根据partitionpath组织目录结构和hive的分区表类。

插入代码:

# hudi操作配置,
hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'insert',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2
}
# hoodie.datasource.write.precombine.field, 默认值:ts 
# 实际写入之前在preCombining中使用的字段。 当两个记录具有相同的键值(recordkey)时,我们将使用Object.compareTo(..)从precombine字段中选择一个值最大的记录。

#hoodie.datasource.write.recordkey.field, 默认值:uuid 
#记录键字段。用作HoodieKey中recordKey部分的值。 实际值将通过在字段值上调用.toString()来获得。可以使用点符号指定嵌套字段,例如:a.b.c

#属性:hoodie.datasource.write.partitionpath.field, 默认值:partitionpath 
#分区路径字段。用作HoodieKey中partitionPath部分的值。 通过调用.toString()获得实际的值

#属性:hoodie.insert.shuffle.parallelism, hoodie.upsert.shuffle.parallelism
#最初导入数据后,此并行度将控制用于读取输入记录的初始并行度。 确保此值足够高,例如:1个分区用于1 GB的输入数据

#属性:hoodie.datasource.write.operation, 默认值:upsert
#是否为写操作进行插入更新、插入或批量插入。使用bulkinsert将新数据加载到表中,之后使用upsert或insert。 
#批量插入使用基于磁盘的写入路径来扩展以加载大量输入,而无需对其进行缓存。
#如果是大量数据使用bulkinsert

# 只有第一次创建表才使用,overwrite,其他情况都应该用append
df.write.format("hudi"). 
  options(**hudi_options). 
  mode("overwrite"). 
  save(basePath)

表初始化化后的文件结构:

230242e5dd16abe75d6e92fb72a1ac9f.png

“hoodie.datasource.write.partitionpath.field“ 指明了使用哪个字段进行分区。每个分区下都有一个parquet文件存储了数据。

2.查询数据

  1. 将hudi表结构目录,使用hudi格式读取成dataframe
  2. 定义一个临时表
  3. 在临时表上执行各种spark sql
# 读取数据例子,注意load路径的写法
tripsSnapshotDF = spark. 
  read. 
  format("hudi"). 
  load(basePath + "/*/*/*/*")
# load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
#创建一个视图,都是spark sql的标准操作
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

# 获取数据数量(使用dataframe)
count = tripsSnapshotDF.count()

print("数据量:" , count)
#  获取数据数量(使用sql)
spark.sql("select count(1) from  hudi_trips_snapshot").show()
数据量: 10
+--------+
|count(1)|
+--------+
|      10|
+--------+

查询数据(选取列),分别使用dataframe和sql方式,输出是一摸一样的。

注意如下三个列,在原数据中是没有的:

  • _hoodie_commit_time,数据插入的时间
  • _hoodie_record_key,主键,原数据集中的uuid列
  • _hoodie_partition_path,原数据集中的partitionpath列
new_df = tripsSnapshotDF.select(["_hoodie_commit_time","_hoodie_record_key","_hoodie_partition_path","rider","driver","fare"])

new_df.show()

spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|     20200626130414|3dad2536-7f33-418...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|     20200626130414|d51ffec5-60ac-489...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|     20200626130414|47bc52aa-dd97-46d...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|     20200626130414|5e83112b-49a9-414...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|     20200626130414|a5547435-6566-47d...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|     20200626130414|4c24b4a1-0168-427...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|     20200626130414|14657a04-2488-440...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|     20200626130414|e210af6c-edb2-416...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|     20200626130414|24ca6908-cb75-4e4...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|     20200626130414|567e5d3e-a436-4dc...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+

+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|     20200626130414|3dad2536-7f33-418...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|     20200626130414|d51ffec5-60ac-489...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|     20200626130414|47bc52aa-dd97-46d...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|     20200626130414|5e83112b-49a9-414...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|     20200626130414|a5547435-6566-47d...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|     20200626130414|4c24b4a1-0168-427...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|     20200626130414|14657a04-2488-440...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|     20200626130414|e210af6c-edb2-416...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|     20200626130414|24ca6908-cb75-4e4...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|     20200626130414|567e5d3e-a436-4dc...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+

带where条件的查询,分别使用dataframe和sql,输出也是一摸一样的。

tripsSnapshotDF.select(["_hoodie_commit_time","fare","begin_lon","begin_lat","ts"]).where("fare > 20.0").show()

spark.sql("select _hoodie_commit_time , fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626130414| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626130414| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626130414| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626130414| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200626130414|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626130414|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626130414| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626130414| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626130414| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626130414| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626130414| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626130414| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200626130414|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626130414|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626130414| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626130414| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

3. 更新数据

更新的数据集和表数据uuid是完全重复的。

  • 更新数据的写模式使用,append住家
 # 新生成10调数据,注意wirte的模式是append,追加
updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
print("这时更新数据集:")
df.show()

print("这是表里已经有的数据")
df1 = spark.sql("select uuid from hudi_trips_snapshot order by uuid").show()

# 将更相信的数据,并行化为两个分区
df_updates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
print("这是准备更新的数据")
df_updates.select(["uuid","ts"]).orderBy("uuid").show()
这时更新数据集:
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|       partitionpath|    rider| ts|                uuid|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|americas/brazil/s...|rider-213|0.0|4c24b4a1-0168-427...|
| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|americas/brazil/s...|rider-213|0.0|14657a04-2488-440...|
| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|americas/united_s...|rider-213|0.0|3dad2536-7f33-418...|
|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|americas/united_s...|rider-213|0.0|d51ffec5-60ac-489...|
|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|  asia/india/chennai|rider-213|0.0|24ca6908-cb75-4e4...|
| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|americas/united_s...|rider-213|0.0|47bc52aa-dd97-46d...|
| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|americas/united_s...|rider-213|0.0|5e83112b-49a9-414...|
| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|americas/brazil/s...|rider-213|0.0|e210af6c-edb2-416...|
|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|  asia/india/chennai|rider-213|0.0|567e5d3e-a436-4dc...|
|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|americas/united_s...|rider-213|0.0|a5547435-6566-47d...|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+

这是表里已经有的数据
+--------------------+
|                uuid|
+--------------------+
|14657a04-2488-440...|
|24ca6908-cb75-4e4...|
|3dad2536-7f33-418...|
|47bc52aa-dd97-46d...|
|4c24b4a1-0168-427...|
|567e5d3e-a436-4dc...|
|5e83112b-49a9-414...|
|a5547435-6566-47d...|
|d51ffec5-60ac-489...|
|e210af6c-edb2-416...|
+--------------------+

这是准备更新的数据
# 10条数据,有5条唯一的uuid,每条uui在原来的表里都存在
+--------------------+---+
|                uuid| ts|
+--------------------+---+
|14657a04-2488-440...|0.0|
|24ca6908-cb75-4e4...|0.0|
|47bc52aa-dd97-46d...|0.0|
|47bc52aa-dd97-46d...|0.0|
|47bc52aa-dd97-46d...|0.0|
|567e5d3e-a436-4dc...|0.0|
|567e5d3e-a436-4dc...|0.0|
|567e5d3e-a436-4dc...|0.0|
|d51ffec5-60ac-489...|0.0|
|d51ffec5-60ac-489...|0.0|
+--------------------+---+

update数据:

# 注意这里的 mode用的append
df.write.format("hudi"). 
  options(**hudi_options). 
  mode("append"). 
  save(basePath)

更新后的表目录结构

b6cc47ba9a101b124530bf22d169dd5c.png

每个分区里的parquet文件都表成了两个,都生成 了新的parquet文件,保存新数据。

使用原来的视图查询,10条数据还保持了原来的样子。spark将数据加载成dataframe后,数据是不可变的。原来目录里数据数据变化,不会影响到dataframe。

spark.sql("select _hoodie_commit_time , fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626052507| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626052507| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626052507| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626052507| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200626052507|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626052507|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626052507| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626052507| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

要想获取新数据,需要重加载一下视图:

spark. 
  read. 
  format("hudi"). 
  load(basePath + "/*/*/*/*"). 
  createOrReplaceTempView("hudi_trips_snapshot")
#重加载后查询
spark.sql("select _hoodie_commit_time , fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0 order by _hoodie_commit_time").show()

数据集已经改变:

+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626134101|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626134101|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626134101| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626134101| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
|     20200626134101| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626134101| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626134101| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626134101| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
+-------------------+------------------+-------------------+-------------------+---+

4. 查询增量数据

查询增量数据是hudi,着重支持的一个特性。

  • 'hoodie.datasource.query.type': 'incremental'
  • 'hoodie.datasource.read.begin.instanttime': beginTime
commits = list(map(lambda row: row[0], spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").limit(50).collect()))
beginTime = commits[len(commits) - 2] # commit time we are interested in
beginTime = '20200626052507'
print("开始时间是:", beginTime)
# 增量查询,并设置开始时间
incremental_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': beginTime,
}

tripsIncrementalDF = spark.read.format("hudi"). 
  options(**incremental_read_options). 
  load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")

spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()

查询只在begin time之后的数据集中执行.

开始时间是: 20200626052507
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626134101| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626134101| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626134101| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626134101| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200626134101|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626134101|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626134101| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626134101| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+3. more

5. 查询特定时间数据

和增量查询方法一致,增加了一个结束时间。

beginTime的000代表最早的时间。

beginTime = "000" 
endTime = commits[len(commits) - 2]

# query point in time data
point_in_time_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.end.instanttime': endTime,
  'hoodie.datasource.read.begin.instanttime': beginTime
}

tripsPointInTimeDF = spark.read.format("hudi"). 
  options(**point_in_time_read_options). 
  load(basePath)
print("开始时间:" , beginTime, "t结束时间:", endTime)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
开始时间: 000 	结束时间: 20200626134101
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200626134101| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200626134101| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200626134101| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200626134101| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200626134101|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200626134101|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200626134101| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200626134101| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

6. 删除数据

  1. 创建一个dataframe包含,主键(uuid),分区字段(partitionPath), precombine字段(ts)
  2. hoodie.datasource.write.operation,设置为delete
  3. 使用df去写目录,write mode使用append

准备删除数据集

# pyspark
# fetch total records count
count = spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
print(count)
# fetch two records to be deleted
ds = spark.sql("select uuid, partitionPath from hudi_trips_snapshot").limit(2)
ds.show()

这两条数据就是要删除的数据集

10
+--------------------+--------------------+
|                uuid|       partitionPath|
+--------------------+--------------------+
|3dad2536-7f33-418...|americas/united_s...|
|d51ffec5-60ac-489...|americas/united_s...|
+--------------------+--------------------+

删除代码

# issue deletes
hudi_delete_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'delete',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2
}

from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
print("准备删除的数据")
print(deletes)
# 这里官网的代码可能有bug
# 官网代码如下:df = spark.sparkContext.parallelize(deletes).toDF(['partitionpath', 'uuid']).withColumn('ts', lit(0.0))
df = spark.sparkContext.parallelize(deletes).toDF(['uuid','partitionpath']).withColumn('ts', lit(0.0))
df.show()
df.write.format("hudi"). 
  options(**hudi_delete_options). 
  mode("append"). 
  save(basePath)

重加载数据,只剩8条了。

roAfterDeleteViewDF = spark. 
  read. 
  format("hudi"). 
  load(basePath + "/*/*/*/*") 
roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
# fetch should return (total - 2) records
result = spark.sql("select uuid, partitionPath from hudi_trips_snapshot")
result.show()
print(result.count())

输出结果:

+--------------------+--------------------+
|                uuid|       partitionPath|
+--------------------+--------------------+
|47bc52aa-dd97-46d...|americas/united_s...|
|5e83112b-49a9-414...|americas/united_s...|
|a5547435-6566-47d...|americas/united_s...|
|4c24b4a1-0168-427...|americas/brazil/s...|
|14657a04-2488-440...|americas/brazil/s...|
|e210af6c-edb2-416...|americas/brazil/s...|
|24ca6908-cb75-4e4...|  asia/india/chennai|
|567e5d3e-a436-4dc...|  asia/india/chennai|
+--------------------+--------------------+

8

删除后的数据表目录,unitedstates/san_francisco目录下生成了第三个文件,是删除后的数据集。

6a79ab27e5a1febc94c53bebf6beecdc.png

3. more

hudi的更多读写配置

Configurations​hudi.apache.org

每次执行生成的数据都会有差异,输出数据集仅供参考。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值