Spark DataFrame 写入HIve 出现HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`

最新推荐文章于 2025-02-06 02:25:28 发布

lvwenyuan_1

最新推荐文章于 2025-02-06 02:25:28 发布

阅读量9.5k

点赞数 9

分类专栏： spark hive 文章标签： spark

本文链接：https://blog.csdn.net/lvwenyuan_1/article/details/90697049

版权

spark 同时被 2 个专栏收录

4 篇文章

订阅专栏

hive

4 篇文章

订阅专栏

场景

现在有一个需求，解析一个csv文件，然后写入hive已经存在的表中，就出现了这个错

org.apache.spark.sql.AnalysisException: The format of the existing table arcsoft_analysis.zz_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;

解析：

如果用命令行创建的hive表，会根据hive的hive.default.fileformat，这个配置来规定hive文件的格式，其中fileformat一般有4中，分别是TextFile、SequenceFile、RCFile、ORC。默认情况下，不指定的话，是TextFile。那么如何在hive建表的时候指定呢？就在建表语句最后加上stored as TextFile 或者stored as RCFile等等就可以了。

但是df.write默认的format是parquet + snappy。如果表是用hive命令行创建的，就不符合格式，所以就会报错。如果表是提前不存在的，那么就不会有什么问题。

解决：

第一种：

个人会用这一种

df.write.format("Hive").mode(SaveMode.Append).saveAsTable("zz_table")

将format设置为Hive以后，无论hive建表的时候，使用的fileformat使用的是哪一种，都是没有关系的

直接上测试代码：

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().
      appName("aa")
      .enableHiveSupport()
      .getOrCreate()
    //切换namespace
    spark.sql("use arcsoft_analysis")
    //获取原表的schema
    val mychema = spark.table("closeli_user_info").schema
    //生产dataframe
    val df = spark.read.option("header",true)
      .option("inferSchema",false)
      .option("delimiter",",")
      .schema(mychema)
      .csv("/user/root/closeli_user_info.csv")
    //使用saveAsTable的方式写入hive
    df.write.format("Hive").mode(SaveMode.Append).saveAsTable("zz_table")
    spark.close()
  }

第二种：

其实，还可以一种方式，就是使用insertInto，但是不太建议。因为在insertInto源码中，这样写道：

insertInto插入的时候，是根据列的位置插入，而不是根据列的名字。表的format和设置的options也会被忽略。所以不是很推荐，但是也能达到目标。

df.write.insertInto("zz_table")

/**
   * Inserts the content of the `DataFrame` to the specified table. It requires that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based
   * resolution. For example:
   *
   * {{{
   *    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
   *    scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
   *    scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
   *    scala> sql("select * from t1").show
   *    +---+---+
   *    |  i|  j|
   *    +---+---+
   *    |  5|  6|
   *    |  3|  4|
   *    |  1|  2|
   *    +---+---+
   * }}}
   *
   * Because it inserts data to an existing table, format or options will be ignored.
   *
   * @since 1.4.0
   */

第三种：

还有一种方式，就是先将dataframe注册成临时表，然后通过sql的方式，插入

df.createOrReplaceTempView("temp_tab")
spark.sql("insert into zz_table select * from temp_tab")