spark sql中createOrReplaceTempView，saveAsTable，spark.sql(Create table..)创建的表的区别

最新推荐文章于 2024-08-05 15:54:35 发布

秋语星空

最新推荐文章于 2024-08-05 15:54:35 发布

阅读量8.7k

点赞数 4

文章标签： sql spark

本文链接：https://blog.csdn.net/hongqiutiandi163/article/details/108882908

版权

前言

我们在spark sql中可能遇到下面的三种创建表的方式

spark.sql("create table tb as select ...")
//以前叫registerTempTable，后来被deprecated，用createOrReplaceTempView代替了，
//其目的就是可能为了消除误解，真实我们下面要讲的区别。
df.createOrReplaceTempView("tb")
df.write.mode(SaveMode.Append).saveAsTable("tb")

那么，这三种创建方式有什么区别吗？

createOrReplaceTempView注释

/**
   * Creates a local temporary view using the given name. The lifetime of this
   * temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
   *
   * @group basic
   * @since 2.0.0
   */

createOrReplaceTempView注释翻译

使用给定的名称创建一个本地临时视图。这个临时视图的生命周期和创建数据集的SparkSession是密切相关的。

saveAsTable注释

/**
   * Saves the content of the `DataFrame` as the specified table.
   *
   * In the case the table already exists, behavior of this function depends on the
   * save mode, specified by the `mode` function (default to throwing an exception).
   * When `mode` is `Overwrite`, the schema of the `DataFrame` does not need to be
   * the same as that of the existing table.
   *
   * When `mode` is `Append`, if there is an existing table, we will use the format and options of
   * the existing table. The column order in the schema of the `DataFrame` doesn't need to be same
   * as that of the existing table. Unlike `insertInto`, `saveAsTable` will use the column names to
   * find the correct column positions. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  1|  2|
   *    |  4|  3|
   *    +---+---+
   * }}}
   *
   * In this method, save mode is used to determine the behavior if the data source table exists in
   * Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in
   * JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the
   * underlying data of data source if the table already exists.
   *
   * When the DataFrame is created from a non-partitioned `HadoopFsRelation` with a single input
   * path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
   * and Parquet), the table is persisted in a Hive compatible format, which means other systems
   * like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
   * specific format.
   *
   * @since 1.4.0
   */

saveAsTable注释翻译

将DataFrame的内容保存为指定的表

当表已经存在时，这个方法的行为将取决于保存的方式，保存的方式通过mode函数指定（默认的话会报错异常）。当mode是Overwrite时，DataFrame的schema不需要和那个存在的表一样。
当mode是Append时，加入表已经存在，我们将使用已存在表的格式和设置选项。在DataFrame的scheme中的列顺序不需要和已存在的表一致。和insertInto不同，saveAsTable将使用列名去寻找正确的列位置。比如：

scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
scala> sql("select * from t1").show
+---+---+
|  i|  j|
+---+---+
|  1|  2|
|  4|  3|
+---+---+

在此方法中，保存模式用于确定Spark catalog中是否存在数据源表时的行为。假如表在spark catalog中不存在，我们会总是将数据覆写数据源的底层数据（比如，一个JDBC数据源中的表），如果表在spark catalog中已存在，我们会总是将数据追加到数据源的底层数据中。

当这个DataFrame是通过单个输入路径从没有分区的HadoopFsRelation中创建，并且可以将数据源提供者映射到现有的Hive内置SerDe（即ORC和Parquet）时，该表将以Hive兼容的格式保存，这意味着像Hive这样的其他系统也可以读取此表。否则，该表将以Spark SQL特定格式持久保存。

使用spark.sql(Create table…)的sql建表与saveAsTable是一样的。

总结

createOrReplaceTempView更像是数据库中的创建视图，而savaAsTable则是真真切切的创建表，视图和表的差异可想而知。在spark中createOrReplaceTempView是transformation操作，是不会立即执行的，这个表只是虚拟的表，不是实实在在的表，而saveAsTable是action操作，是会立即生成任务执行的，从而产生实实在在的表。

我们举个实际的例子：
有一个sql语句是这样的，

create table a1 as select
      nname_en
    from
      ans
    group by
      nname_en
    having
      count(DISTINCT(active)) > 1;
create table b1 as select
      nname_cn
    from
      ans
    group by
      nname_cn
    having
      count(DISTINCT(active)) > 1;
select
  count(DISTINCT(a1.nname_en)) as nname_en_inactive,
  count(DISTINCT(b1.nname_cn)) as nname_cn_inactive
from a1
  join b1 on 1 = 1

其join操作比较多，总是执行报内存不足，得不到结果，因而想着分解这条语句，从而出现下面两者方法：

spark.sql("select nname_en from ans group by nname_en having count(DISTINCT(active)) > 1").createOrReplaceTempView("a1")
spark.sql("select nname_cn from ans group by nname_cn having count(DISTINCT(active)) > 1").createOrReplaceTempView("b1")
val df = spark.sql("select count(DISTINCT(a1.nname_en)) as nname_en_inactive, count(DISTINCT(b1.nname_cn)) as nname_cn_inactive from a1 join b1 on 1 = 1")

spark.sql("create table a1 as select nname_en from ans group by nname_en having count(DISTINCT(active)) > 1").createOrReplaceTempView("a")
spark.sql("create table b1 as select nname_cn from ans group by nname_cn having count(DISTINCT(active)) > 1").createOrReplaceTempView("b")
val df = spark.sql("select count(DISTINCT(a1.nname_en)) as nname_en_inactive, count(DISTINCT(b1.nname_cn)) as nname_cn_inactive from a1 join b1 on 1 = 1")

两段代码，第一段使用的createOrReplaceTempView，第二段使用的create table。同样的集群资源，实际跑下来，第一段代码仍然是内存溢出，得不到结果，第二段则可以轻松得出结果。在spark后台可以看到，第一段代码只能生存一个stage，task数量巨多。而第二段代码有三个stage，每个stage任务在合理范围。

对于一些join多张表的sql，建议通过create table as生成中间表的方式来分解sql。如果使用createOrReplaceTempView的方式来分解，是不会起作用的，只是形式上会更好看一些。