Spark Sql Read Parquet Files; Number of Partitions.

最新推荐文章于 2024-01-02 17:27:05 发布

HeMJGaoMM

最新推荐文章于 2024-01-02 17:27:05 发布

阅读量452

点赞数

分类专栏： Spark Hive

本文链接：https://blog.csdn.net/weixin_38670967/article/details/89372921

版权

hive metastore 和 parquet 转化的方式通过 spark.sql.hive.convertMetastoreParquet 控制，默认为 true。
如果设置为 true ，会使用 org.apache.spark.sql.execution.FileSourceScanExec ，否则会使用 org.apache.spark.sql.hive.execution.HiveTableScanExec。

HiveTableScanExec

~~通过文件数量，大小进行分区。
eg. 读入一份 2048M 大小的数据，hdfs 块大小设置为 128M
该目录有1000个小文件，则会生成1000个partition。
如果只有1个文件，则会生成 16 个分区。
如果有一个大文件1024M,其余 999 个文件共 1024M，则会生成 1009 个分区。~~
**In fact, each file will correspond to 2 partitions, since in source code:**

private[hive]
class HadoopTableReader(
    @transient private val attributes: Seq[Attribute],
    @transient private val partitionKeys: Seq[Attribute],
    @transient private val tableDesc: TableDesc,
    @transient private val sparkSession: SparkSession,
    hadoopConf: Configuration)
  extends TableReader with CastSupport with Logging {

  // Hadoop honors "mapreduce.job.maps" as hint,
  // but will ignore when mapreduce.jobtracker.address is "local".
  // https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
  // mapred-default.xml
  //
  // In order keep consistency with Hive, we will let it be 0 in local mode also.
  private val _minSplitsPerRDD = if (sparkSession.sparkContext.isLocal) {
    0 // will splitted based on block by default.
  } else {
    math.max(hadoopConf.getInt("mapreduce.job.maps", 1),
      sparkSession.sparkContext.defaultMinPartitions)
  }

The _minSplitsPerRDD is set as sparkSession.sparkContext.defaultMinPa

最低0.47元/天解锁文章

HeMJGaoMM

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Spark Sql Read Parquet Files; Number of Partitions.

hive metastore 和 parquet 转化的方式通过 spark.sql.hive.convertMetastoreParquet 控制，默认为 true。如果设置为 true ，会使用 org.apache.spark.sql.execution.FileSourceScanExec ，否则会使用 org.apache.spark.sql.hive.execution.HiveTa...
复制链接

扫一扫