spark分区规则

最新推荐文章于 2024-07-31 23:19:43 发布

321HMC123

最新推荐文章于 2024-07-31 23:19:43 发布

阅读量1.5k

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/xiaoxiaoniao135/article/details/124483511

版权

spark分区规则

一、从内存中获取RDD

// Sequences need to be sliced at the same set of index positions for operations
// like RDD.zip() to behave as expected
//length集合长度，numSlices切片数，返回值为每个分区的起始下标
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
  (0 until numSlices).iterator.map { i =>
    val start = ((i * length) / numSlices).toInt
    val end = (((i + 1) * length) / numSlices).toInt
    (start, end)
  }
}

length = 7 numSlices = 3

partition	start	end
0	0	2
1	2	4
2	4	7

二、从文件中获取RDD

1.分区个数

//filepath 文件路径，可以用绝对路径，也可以用相对路径，可以使用*来匹配零个或多个字符
//minPartitions 最小分区数（实际分区数为：minPartitions或minPartitions+1）
context.textFile(filepath,minPartitions)

minPartitions如果不填，默认为 math.min(defaultParallelism, 2) 参数defaultParallelism为机器线程数 (windows系统使用任务管理器点击性能点击CPU查看逻辑处理器)

minPartitions为什么是minPartitions而不是partitions的含义：


//totalSize:文件包含的字节数

//numSplits = minPartitions:输入的最小分区数

//goalSize:每个分区的预估大小

long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);

//如果totalSize/numSplits的余数大于goalSize*0.1则将余数字节放入新增的分区中