概述
本文讲述如何通过spark sql把一个dataframe加载到spark的动态分区表中。
场景介绍
把csv和parquet文件加载到spark的动态分区表中,有很多中方案,这里介绍如何通过spark的dataframe把数据文件加载到动态分区表中。
注意:为了保证性能,指定分区的字段的字典数据的唯一值最好不要超过几万。这是spark-2.3的partitionBy算子的源码的注释中写到的。
数据准备
有如下的csv数据文件:idtimedata
david,shenzhen,31,1,201903
eason,shenzhen,27,2,201904
jarry,wuhan,35,3,201904
aarry2,wuhan1,34,4,201904
barry3,wuhan2,33,5,201904
carry4,wuhan3,32,6,201904
darry5,wuhan4,31,7,201903
earry6,wuhan9,30,8,201903
把该文件上传到hdfs文件系统的/curdata目录下。
hadoop fs -put idtimedata /curdata/
加载csv文件
加载csv文件到一级分区表
import org.apache.spark.sql.SparkSession
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition = true")
spark.sql("show databases").show()
spark.sql("create database testdb").show()
spark.sql("use testdb").show()
spark.sql("CREATE TABLE IF NOT EXISTS persons (name STRING,address STRING,age INT, time string) PARTITIONED BY(pid STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ")
val pdf = spark.read.format("csv").load("/curdata/idtimedata")
pdf.show();
pdf.write.partitionBy("_c3").mode("append").saveAsTable("tmp.tmp1");
spark.sql("select * from tmp.tmp1").show();
spark.sql("insert into persons partition(pid) select * from tmp.tmp1").show();
spark.sql("select * from persons").show();
加载csv文件到二级分区表
import org.apache.spark.sql.SparkSession
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition = true")
spark.sql("show databases").show()
spark.sql("create database testdb").show()
spark.sql("use testdb").show()
spark.sql("CREATE TABLE IF NOT EXISTS persons2 (name STRING, address STRING, age INT) PARTITIONED BY(pid STRING, time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ")
val pdf = spark.read.format("csv").load("/curdata/idtimedata")
pdf.show();
pdf.write.partitionBy("_c3", "_c4").mode("append").saveAsTable("tmp.tmp2");
spark.sql("select * from tmp.tmp1").show();
spark.sql("insert into persons2 partition(pid,time) select * from tmp.tmp2").show();
spark.sql("select * from persons2").show();
参考资料
- https://issues.apache.org/jira/browse/SPARK-14927