spark - makeRDD源码解析
1.实际是调用parallelize(seq, numSlices)方法,makeRDD只是对parallelize做了一层封装
2.查看parallelize方法可以发现,实际对数据读取规则定义的是ParallelCollectionRDD方法
def parallelize[T: ClassTag]( seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = withScope { assertNotStopped() new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]()) }
3.在ParallelCollectionRDD中调用了slice(data, numSlices)定义了具体切分规则,下面对slice方法具体分析
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = { if (numSlices < 1) { //分区数必须大于1 throw new IllegalArgumentException("Positive number of partitions required") } // Sequences need to be sliced at the same set of index positions for operations // like RDD.zip() to behave as expected // 将根据切片数,对序列长度切分,返回可迭代tuple,每个切片的(from,until)位置, // 集合不能均分,多余的数据放在最后一个分区 //例