前言
我们在spark sql中可能遇到下面的三种创建表的方式
spark.sql("create table tb as select ...")
//以前叫registerTempTable,后来被deprecated,用createOrReplaceTempView代替了,
//其目的就是可能为了消除误解,真实我们下面要讲的区别。
df.createOrReplaceTempView("tb")
df.write.mode(SaveMode.Append).saveAsTable("tb")
那么,这三种创建方式有什么区别吗?
createOrReplaceTempView注释
/**
* Creates a local temporary view using the given name. The lifetime of this
* temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
*
* @group basic
* @since 2.0.0
*/
createOrReplaceTempView注释翻译
使用给定的名称创建一个本地临时视图。这个临时视图的生命周期和创建数据集的SparkSession是密切相关的。
saveAsTable注释
/**
* Saves the content of the `DataFrame` as the specified table.
*
* In the case the table already exists, behavior of this function depends on the
* save mode, specified by the `mode` function (default to throwing an exception).
* When `mode` is `Overwrite`, the schema of the `DataFrame` does not need to be
* the same as that of the existing table.
*
* When `mode` is `Append`, if there is an existing table, we will use the format and options of
* the existing table. The column order in the schema of the `DataFrame` doesn't need to be same
* as that of the existing table. Unlike `insertInto`, `saveAsTable` will use the column names to
* find the correct column positions. For example:
*
* {{{
* scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
* scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
* scala> sql("select * from t1").show
* +---+---+
* | i| j|
* +---+---+
* | 1| 2|
* | 4| 3|
* +---+---+
* }}}
*
* In this method, save mode is used to determine the behavior if the data source table exists in
* Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in
* JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the
* underlying data of data source if the table already exists.
*
* When the DataFrame is created from a non-partitioned `HadoopFsRelation` with a single input
* path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
* and Parquet), the table is persisted in a Hive compatible format, which means other systems
* like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
* specific format.
*
* @since 1.4.0
*/
saveAsTable注释翻译
将DataFrame的内容保存为指定的表
当表已经存在时,这个方法的行为将取决于保存的方式,保存的方式通过mode函数指定(默认的话会报错异常)。当mode是Overwrite时,DataFrame的schema不需要和那个存在的表一样。
当mode是Append时,加入表已经存在,我们将使用已存在表的格式和设置选项。在DataFrame的scheme中的列顺序不需要和已存在的表一致。和insertInto不同,saveAsTable将使用列名去寻找正确的列位置。比如:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 1| 2|
| 4| 3|
+---+---+
在此方法中,保存模式用于确定Spark catalog中是否存在数据源表时的行为。假如表在spark catalog中不存在,我们会总是将数据覆写数据源的底层数据(比如,一个JDBC数据源中的表),如果表在spark catalog中已存在,我们会总是将数据追加到数据源的底层数据中。
当这个DataFrame是通过单个输入路径从没有分区的HadoopFsRelation中创建,并且可以将数据源提供者映射到现有的Hive内置SerDe(即ORC和Parquet)时,该表将以Hive兼容的格式保存,这意味着像Hive这样的其他系统也可以读取此表。否则,该表将以Spark SQL特定格式持久保存。
使用spark.sql(Create table…)的sql建表与saveAsTable是一样的。
总结
createOrReplaceTempView更像是数据库中的创建视图,而savaAsTable则是真真切切的创建表,视图和表的差异可想而知。在spark中createOrReplaceTempView是transformation操作,是不会立即执行的,这个表只是虚拟的表,不是实实在在的表,而saveAsTable是action操作,是会立即生成任务执行的,从而产生实实在在的表。
我们举个实际的例子:
有一个sql语句是这样的,
create table a1 as select
nname_en
from
ans
group by
nname_en
having
count(DISTINCT(active)) > 1;
create table b1 as select
nname_cn
from
ans
group by
nname_cn
having
count(DISTINCT(active)) > 1;
select
count(DISTINCT(a1.nname_en)) as nname_en_inactive,
count(DISTINCT(b1.nname_cn)) as nname_cn_inactive
from a1
join b1 on 1 = 1
其join操作比较多,总是执行报内存不足,得不到结果,因而想着分解这条语句,从而出现下面两者方法:
spark.sql("select nname_en from ans group by nname_en having count(DISTINCT(active)) > 1").createOrReplaceTempView("a1")
spark.sql("select nname_cn from ans group by nname_cn having count(DISTINCT(active)) > 1").createOrReplaceTempView("b1")
val df = spark.sql("select count(DISTINCT(a1.nname_en)) as nname_en_inactive, count(DISTINCT(b1.nname_cn)) as nname_cn_inactive from a1 join b1 on 1 = 1")
spark.sql("create table a1 as select nname_en from ans group by nname_en having count(DISTINCT(active)) > 1").createOrReplaceTempView("a")
spark.sql("create table b1 as select nname_cn from ans group by nname_cn having count(DISTINCT(active)) > 1").createOrReplaceTempView("b")
val df = spark.sql("select count(DISTINCT(a1.nname_en)) as nname_en_inactive, count(DISTINCT(b1.nname_cn)) as nname_cn_inactive from a1 join b1 on 1 = 1")
两段代码,第一段使用的createOrReplaceTempView,第二段使用的create table。同样的集群资源,实际跑下来,第一段代码仍然是内存溢出,得不到结果,第二段则可以轻松得出结果。在spark后台可以看到,第一段代码只能生存一个stage,task数量巨多。而第二段代码有三个stage,每个stage任务在合理范围。
对于一些join多张表的sql,建议通过create table as生成中间表的方式来分解sql。如果使用createOrReplaceTempView的方式来分解,是不会起作用的,只是形式上会更好看一些。
参考
Difference between df.SaveAsTable and spark.sql(Create table…)