![63e0fdfc25557f5c9dba148544272dbc.png](https://i-blog.csdnimg.cn/blog_migrate/d545a6604341284e4bd95e953e6a8c4d.png)
0.内容主要
示例代码主要来自如下链接,有改动
Quick-Start Guidehudi.apache.org使用jupyter进行了调试,保留了每个步骤的输出内容,并增加了一下spark dataframe和spark sql操作的对比。ipynb的代码地址为:
pyspark-hudi-quick-start.ipynbgithub.com可以直接下载运行。运行环境见另外一片文章。
老冯:制作spark&jupyter镜像zhuanlan.zhihu.com![4e730a2e945c908089647c35497a6da8.png](https://i-blog.csdnimg.cn/blog_migrate/84d80fa8aef10b7492b4f79cd00d85a5.jpeg)
1. 启动pyspark
启动pyspark比较简单,直接运行spark_home/bin/pyspark即可。如何让pyspark启动一个jupyter环境?这里涉及到几个环境变量:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
export PYSPARK_PYTHON=/usr/bin/python3
这里指明了运行pyspark使用“jupyter”命令,并需要jupyter命令提供一个参数“notebook”。spark worker执行spark task时使用“PYSPARK_PYTHON=/usr/bin/python3”。可以在os里设置这三个环境变量。如果出错,可以直接修改pyspark文件,把这三个环境变量加到23行位置上:
![72d74c5e1fac7171e8f2f170b4fd3960.png](https://i-blog.csdnimg.cn/blog_migrate/2400dce19fd2600fe4696271bb4fb69b.jpeg)
因为设置“PYSPARK_DRIVER_PYTHON=jupyter“可能会影响“find-spark-home”的执行。
如下代码将启动pyspark,如果上述环境已经配置好,则启动的时jupyter.
spark-2.4.4-bin-hadoop2.7/bin/pyspark
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
3. pyspark&hudi on jupyter
hudi的表存在两种模式:Copy On Write Table、Merge on read table。
- cow:在写入数据时,进行文件合并,这种类型的表对数据读取比较友好。
- mor:在数据读取时,进行文件合并,这种形式写速度较快,读相对慢一些。
本文例子的表是cow。
1.插入数据
# 定义表名
tableName = "hudi_trips_cow"
# 定义存储路径,直接存入到文件中,也可以存储到hdfs或者其他云存储中。
basePath = "file:///opt/data/hudi_trips_cow"
# hudi的工具类:QuickstartUtils,用来生成测试数据
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
生成测试数据,并进一步转化为spark的dataframe
# 生成了10条数据
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
# 转化为dataframe,数据被划分为两个分区
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.show()
数据内容如下,
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
| begin_lat| begin_lon| driver| end_lat| end_lon| fare| partitionpath| rider| ts| uuid|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
| 0.4726905879569653|0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|americas/brazil/s...|rider-213|0.0|4c24b4a1-0168-427...|
| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|americas/brazil/s...|rider-213|0.0|14657a04-2488-440...|
| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|americas/united_s...|rider-213|0.0|3dad2536-7f33-418...|
|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|americas/united_s...|rider-213|0.0|d51ffec5-60ac-489...|
| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155| asia/india/chennai|rider-213|0.0|24ca6908-cb75-4e4...|
| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|americas/united_s...|rider-213|0.0|47bc52aa-dd97-46d...|
| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|americas/united_s...|rider-213|0.0|5e83112b-49a9-414...|
| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|americas/brazil/s...|rider-213|0.0|e210af6c-edb2-416...|
| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368| asia/india/chennai|rider-213|0.0|567e5d3e-a436-4dc...|
|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|americas/united_s...|rider-213|0.0|a5547435-6566-47d...|
+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
共有是10个列,包含了:begin_lat、begin_lon、driver、end_lat、 end_lon、fare、partitionpath、rider、ts、uuid。hudi的表根据partitionpath组织目录结构和hive的分区表类。
插入代码:
# hudi操作配置,
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'insert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism&