hive metastore 和 parquet 转化的方式通过 spark.sql.hive.convertMetastoreParquet 控制,默认为 true。
如果设置为 true ,会使用 org.apache.spark.sql.execution.FileSourceScanExec ,否则会使用 org.apache.spark.sql.hive.execution.HiveTableScanExec。
HiveTableScanExec
~~通过文件数量,大小进行分区。
eg. 读入一份 2048M 大小的数据,hdfs 块大小设置为 128M
该目录有1000个小文件,则会生成1000个partition。
如果只有1个文件,则会生成 16 个分区。
如果有一个大文件1024M,其余 999 个文件共 1024M,则会生成 1009 个分区。~~
**In fact, each file will correspond to 2 partitions, since in source code:**
private[hive]
class HadoopTableReader(
@transient private val attributes: Seq[Attribute],
@transient private val partitionKeys: Seq[Attribute],
@transient private val tableDesc: TableDesc,
@transient private val sparkSession: SparkSession,
hadoopConf: Configuration)
extends TableReader with CastSupport with Logging {
// Hadoop honors "mapreduce.job.maps" as hint,
// but will ignore when mapreduce.jobtracker.address is "local".
// https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
// mapred-default.xml
//
// In order keep consistency with Hive, we will let it be 0 in local mode also.
private val _minSplitsPerRDD = if (sparkSession.sparkContext.isLocal) {
0 // will splitted based on block by default.
} else {
math.max(hadoopConf.getInt("mapreduce.job.maps", 1),
sparkSession.sparkContext.defaultMinPartitions)
}
The _minSplitsPerRDD is set as sparkSession.sparkContext.defaultMinPa