Spark中InsertIntoHiveTable 和 InsertIntoHadoopFsRelationCommand(两种写hive的方式)区别和注意的点_spark.sql.hive.convertinsertingpartitionedtable-CSDN博客

本文链接：https://blog.csdn.net/monkeyboy_tech/article/details/136567260

本文基于Spark 3.5，介绍了Spark写Hive表的两种模式：Hive原生模式和Spark native datasource模式。阐述了两种模式的相同点，如底层用Hive建表且设置压缩格式时可无缝切换；也分析了不同点，包括写入调用方法、动态分区操作等，总体测试显示第二种模式比第一种快10%。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景

本文基于Spark 3.5
目前Spark写hive表有两种形式，一种是基于 Hive 原生的模式，一种是Spark native datasource的模式, 这两种模式可以通过配置的参数spark.sql.hive.convertInsertingPartitionedTable来进行切换，默认是True,也就是Spark native datasource的模式

相同点

如果底层用的是 hive 建表的话，如果设置了表的压缩格式等配置,如：'parquet.compression'='zstd'，两种方式下都是以zstd压缩，所以这两种格式可以无缝切换。
对于直接写入hive来说，在对应的物理算子是：InsertIntoHiveTable:
在这里插入图片描述

对于直接用spark native datasource的方式来说对应物理计划为：InsertIntoHadoopFsRelationCommand:
在这里插入图片描述

如果开启转换以后，会经过Rule：RelationConversions，DataSourceAnalysis

规则RelationConversions中有个转换，metastoreCatalog.convert，该方法会从把hive中的properties的信息传给HadoopFsRelation，当然也包括了hive建表语句中的parquet.compression/comment属性

  val options = relation.tableMeta.properties.filter { case (k, _) => isParquetProperty(k) } ++
  relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
  SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
  convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet", isWrite)

再经过规则DataSourceAnalysis, 把HadoopFsRelation中的属性值传给InsertIntoHadoopFsRelationCommand

   case i @ InsertIntoStatement(
       l @ LogicalRelation(t: HadoopFsRelation, _, table, _), parts, _, query, overwrite, _, _)
       if query.resolved =>
      ...
     val insertCommand = InsertIntoHadoopFsRelationCommand(
       outputPath,
       staticPartitions,
       i.ifPartitionNotExists,
       partitionSchema,
       t.bucketSpec,
       t.fileFormat,
       t.options,
       actualQuery,
       mode,
       table,
       Some(t.location),
       actualQuery.output.map(_.name))
     if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
       DDLUtils.verifyNotReadPath(actualQuery, outputPath, table)
     }
     insertCommand

不同点

对于第一种原生的hive写入方式来说，最终调用的是HiveOutputWriter中的方法,而这里面的都是采用hive表上配置的参数进行写入，如：
```
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
```
而如果写完以后，如果是动态分区的写入数据,还会进行externalCatalog.loadDynamicPartitions，这个方法如果对于分区数比较多的情况下是比较耗时的，而且
对于 Hive低版本（如1.2.1）这个操作是单线程的，
对于高版本（如2.2.x）是多线程操作的,其中的配置项为hive.load.dynamic.partitions.thread，默认是15
具体的可以参考HIVE-14204
对于第二种Spark native Datasource这种方式,最终调用的是ParquetOutputWriter中的方法,里面采用的是spark自身的进行写入
但是在写文件之前会有catalog.listPartitions操作，这是一个调用Hive Metastore方法的过程，尤其如果说是已存在的hive表有百万分区的话，很容易造成
Hive Metastore OOM.
总体测试下来,采用Spark native Source的方式比第一种方式快10%