spark sql 查看分区,从Hive表加载数据时,Spark SQL如何确定将使用的分区数?

But I think that question did not get a correct answer. Note that the question is asking how many partitions will be created when the dataframe is created as a result of executing a sql query against a HIVE table using SparkSession.sql method.

IIUC, the question above is distinct from asking how many partitions will be created when the dataframe is created as a result of executing some code like spark.read.json("examples/src/main/resources/people.json") which loads the data directly from the filesystem - which could be HDFS. I think the answer to this latter question is given by spark.sql.files.maxPartitionBytes

spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum

number of bytes to pack into a single partition when reading files.

Experimentally, I have tried creating a dataframe from a HIVE table and the # of partitions I get is not explained by total data in hive table / spark.sql.files.maxPartitionBytes

Also adding to the OP, it would be good to know how can the number of partitions be controlled i.e., when one wants to force spark to use a different number than it would by default.

References:

解决方案

TL;DR: The default number of partitions when reading data from Hive will be governed by the HDFS blockSize. The number of partitions can be increased by setting mapreduce.job.maps to appropriate value, and can be decreased by setting mapreduce.input.fileinputformat.split.minsize to appropriate value

Spark SQL creates an instance of HadoopRDD when loading data from a hive table.

An RDD that provides core functionality for reading data stored in

Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older

MapReduce API (org.apache.hadoop.mapred).

XMhzj.png

HadoopRDD in turn splits input files according to the computeSplitSize method defined in org.apache.hadoop.mapreduce.lib.input.FileInputFormat (the new API) and org.apache.hadoop.mapred.FileInputFormat (the old API).

New API:

protected long computeSplitSize(long blockSize, long minSize,

long maxSize) {

return Math.max(minSize, Math.min(maxSize, blockSize));

}

Old API:

protected long computeSplitSize(long goalSize, long minSize,

long blockSize) {

return Math.max(minSize, Math.min(goalSize, blockSize));

}

computeSplitSize splits files according to the HDFS blockSize but if the blockSize is less than minSize or greater than maxSize then it is clamped to those extremes. The HDFS blockSize can be obtained from

hdfs getconf -confKey dfs.blocksize

According to Hadoop the definitive guide Table 8.5, the minSize is obtained from mapreduce.input.fileinputformat.split.minsize and the maxSize is obtained from mapreduce.input.fileinputformat.split.maxsize.

LGNsa.png

However, the book also mentions regarding mapreduce.input.fileinputformat.split.maxsize that:

This property is not present

in the old MapReduce API (with the exception of

CombineFileInputFormat). Instead, it is calculated indirectly as the

size of the total input for the job, divided by the guide number of

map tasks specified by mapreduce.job.maps (or the setNumMapTasks()

method on JobConf).

this post also calculates the maxSize using the total input size divided by the number of map tasks.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值