使用textFile()方法的时候,我们往往会在方法里面传入参数,用于指定分区数。那么我们传入几,就会创建几个分区么?
一开始我就是这么认为的,但是,通过阅读源码之后,才发现并不是这样的。那个参数名是minPartitions,也就是说最小的分区数,并不是指创建几个分区数。
在阅读这方面的源码的时候,我一开始认为使用textFile()进行文件读取的时候,就已经计算好了分区、task启动的个数。但是,事实上,往往是相反的。transformation操作(如textFile,map等)会针对已有的RDD创建一个新的RDD。transformation具有lazy特性,即transformation不会触发spark程序的执行,它们只是记录了对RDD所做的操作,不会自发的执行。只有执行了一个action,之前的所有transformation才会执行。也就是说,transformation操作只是记录当前RDD进行了哪些操作,只有当遇到了action操作的时候才执行已经记录的操作。
val rdd1 = sc.textFile(input,3)
val rdd2 = rdd1.map( x => x + s"(${x.length})" )
rdd2.collect().mkString("\n")
从上面的代码我们可以看出textFile,map等是transformation操作,collect是action操作。当执行到collect的时候才进行相应地计算。我们看看关于collect的代码,并进行跟进。
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
// -------------------------------------------------------------------------------
/**
* Run a job on all partitions in an RDD and return the results in an array.
*/
def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
runJob(rdd, func, 0 until rdd.partitions.length)
}
//--------------------------------------------------------------------------------
/**
* Run a job on a given set of partitions of an RDD, but take a function of type
* `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}
源码来自SparkContext.scala
上述三个方法依次跟进,通过第三个方法上面的注释我们可以知道,此时已经计算出来了RDD的Partitions。我们看看第二个方法关于Partitions的操作:rdd.partitions。点进去看其实现:
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
}
partitions_
}
}
源码来自RDD.scala
上面代码我们可以看到会执行getPartitions。通过方法名我们应该能够大体的发现这个应该就是我们需要的。我们发现getPartition的实现如下,其中*表示该处打断点。
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
** val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
** val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
** array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
源码来自HadoopRDD.scala
上述代码中,我们跟进inputFormat.getSplits()。代码如下,其中*表示该处打断点。
/** Splits files returned by {@link #listStatus(JobConf)} when
* they're too big.*/
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
** totalSize += file.getLen();
}
** long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
** long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
** ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
** if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
** long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
** while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
** splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
** if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
LOG.debug("Total # of splits: " + splits.size());
** return splits.toArray(new FileSplit[splits.size()]);
}
源码来自FileInputFormat.java
我们对上述代码进行跟进查看细节。遍历需要读入数据的文件,然后totalSize += file.getLen();获得文件总大小。long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);此时numSplits也就是前面传过来的参数minPartition,即3.通过计算我们获得goalSize = 2,即每个分片的数据大小。ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);初始化一个大小为numSplits(即3),元素类型为FileSplit的ArrayList。循环遍历访问到的所有file,对每个file进行逻辑操作。isSplitable(fs, path)默认返回true,(压缩流除外)。long splitSize = computeSplitSize(goalSize, minSize, blockSize);用来计算切片大小。ComputeSplitSize实现如下:
protected long computeSplitSize(long goalSize, long minSize,long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
其中blockSize默认为32M。上述返回的意思是,当每个分片的数据大小大于blockSize的时候,返回blockSize,如果小于blockSize,则返回goalSize,作为切片大小。((double) bytesRemaining)/splitSize > SPLIT_SLOP用来作为循环的遍历条件。SPLIT_SLOP=1.1,即剩余的字节数/切片大小>1.1时候,进入循环操作。在循环中,splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts));其中makeSplit是创建切片对象的工厂,实现如下:
/**
* A factory that makes the split for this class. It can be overridden
* by sub-classes to make sub-types
*/
protected FileSplit makeSplit(Path file, long start, long length,
String[] hosts) {
return new FileSplit(file, start, length, hosts);
}
循环过后,发现splits已经有了三个FileSplit对象。此时byteRemaining=2,splitSize=2,已经不满足循环条件了。跳出循环进行下一步操作。if (bytesRemaining != 0) ,进入操作,然后把剩下的数据通过makeSplit创建一个FileSplit对象,添加进splits中。之后,return splits.toArray(new FileSplit[splits.size()]);完成了上述的操作,将返回值传给HadoopRDD.scala中getPartitions的inputSplits变量。程序会创建一个数组array,大小是inputSplits.size,即4,元素类型是Partition。通过for循环,向array填充值,array(i) = new HadoopPartition(id, i, inputSplits(i))。之后将array返回 。RDD.scala中partitions方法会将接收到的返回值返回。
这个时候,进行SparkContext.scala中的runJob操作:runJob(rdd, func, 0 until rdd.partitions.length)。
这个时候,就已经创建了4个Partition,并且每个Partition上面个启动一个task。
另外,当我们在textFile中并没有指定minPartition的值,这个时候,会指定minPartition为一个默认值(即不能大于2):
/**
* Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
* The reasons for this are discussed in https://github.com/mesos/spark/pull/718
*/
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
由于该博客内容属于另一博客的分支,故将总博客的链接附上:
https://blog.csdn.net/qq_20064763/article/details/88392874