spark源码分析：（二）2.分区与Task创建的细节

最新推荐文章于 2022-10-31 05:45:00 发布

empcl

最新推荐文章于 2022-10-31 05:45:00 发布

阅读量479

点赞数 1

文章标签： Spark 源码分析

本文链接：https://blog.csdn.net/qq_20064763/article/details/88393205

版权

使用textFile()方法的时候，我们往往会在方法里面传入参数，用于指定分区数。那么我们传入几，就会创建几个分区么？

一开始我就是这么认为的，但是，通过阅读源码之后，才发现并不是这样的。那个参数名是minPartitions，也就是说最小的分区数，并不是指创建几个分区数。

在阅读这方面的源码的时候，我一开始认为使用textFile()进行文件读取的时候，就已经计算好了分区、task启动的个数。但是，事实上，往往是相反的。transformation操作(如textFile，map等)会针对已有的RDD创建一个新的RDD。transformation具有lazy特性，即transformation不会触发spark程序的执行，它们只是记录了对RDD所做的操作，不会自发的执行。只有执行了一个action，之前的所有transformation才会执行。也就是说，transformation操作只是记录当前RDD进行了哪些操作，只有当遇到了action操作的时候才执行已经记录的操作。

val rdd1 = sc.textFile(input,3)
val rdd2 = rdd1.map( x => x + s"(${x.length})" )
rdd2.collect().mkString("\n")

从上面的代码我们可以看出textFile，map等是transformation操作，collect是action操作。当执行到collect的时候才进行相应地计算。我们看看关于collect的代码，并进行跟进。

 /**
   * Return an array that contains all of the elements in this RDD.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

// -------------------------------------------------------------------------------
  /**
   * Run a job on all partitions in an RDD and return the results in an array.
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }
//--------------------------------------------------------------------------------
  /**
   * Run a job on a given set of partitions of an RDD, but take a function of type
   * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

源码来自SparkContext.scala

上述三个方法依次跟进，通过第三个方法上面的注释我们可以知道，此时已经计算出来了RDD的Partitions。我们看看第二个方法关于Partitions的操作：rdd.partitions。点进去看其实现：

/**
   * Get the array of partitions of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
      }
      partitions_
    }
  }

源码来自RDD.scala

上面代码我们可以看到会执行getPartitions。通过方法名我们应该能够大体的发现这个应该就是我们需要的。我们发现getPartition的实现如下，其中*表示该处打断点。

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
**  val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
**  val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
**     array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

源码来自HadoopRDD.scala

上述代码中，我们跟进inputFormat.getSplits()。代码如下，其中*表示该处打断点。

 /** Splits files returned by {@link #listStatus(JobConf)} when
   * they're too big.*/ 
  public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    FileStatus[] files = listStatus(job);
    
    // Save the number of input files for metrics/loadgen
    job.setLong(NUM_INPUT_FILES, files.length);
    long totalSize = 0;                           // compute total size
    for (FileStatus file: files) {                // check we have valid files
      if (file.isDirectory()) {
        throw new IOException("Not a file: "+ file.getPath());
      }
 **   totalSize += file.getLen();
    }

 ** long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
 ** long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
      FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

    // generate splits
 ** ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
    NetworkTopology clusterMap = new NetworkTopology();
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        FileSystem fs = path.getFileSystem(job);
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
   **  if (isSplitable(fs, path)) {
          long blockSize = file.getBlockSize();
   **     long splitSize = computeSplitSize(goalSize, minSize, blockSize);

          long bytesRemaining = length;
   **     while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            String[] splitHosts = getSplitHosts(blkLocations,
                length-bytesRemaining, splitSize, clusterMap);
   **       splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                splitHosts));
            bytesRemaining -= splitSize;
          }

   **    if (bytesRemaining != 0) {
            String[] splitHosts = getSplitHosts(blkLocations, length
                - bytesRemaining, bytesRemaining, clusterMap);
            splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
                splitHosts));
          }
        } else {
          String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
          splits.add(makeSplit(path, 0, length, splitHosts));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    LOG.debug("Total # of splits: " + splits.size());
 ** return splits.toArray(new FileSplit[splits.size()]);
  }

源码来自FileInputFormat.java

我们对上述代码进行跟进查看细节。遍历需要读入数据的文件，然后totalSize += file.getLen();获得文件总大小。long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);此时numSplits也就是前面传过来的参数minPartition，即3.通过计算我们获得goalSize = 2，即每个分片的数据大小。ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);初始化一个大小为numSplits(即3)，元素类型为FileSplit的ArrayList。循环遍历访问到的所有file，对每个file进行逻辑操作。isSplitable(fs, path)默认返回true，（压缩流除外）。long splitSize = computeSplitSize(goalSize, minSize, blockSize);用来计算切片大小。ComputeSplitSize实现如下：

  protected long computeSplitSize(long goalSize, long minSize,long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

其中blockSize默认为32M。上述返回的意思是，当每个分片的数据大小大于blockSize的时候，返回blockSize，如果小于blockSize，则返回goalSize，作为切片大小。((double) bytesRemaining)/splitSize > SPLIT_SLOP用来作为循环的遍历条件。SPLIT_SLOP=1.1，即剩余的字节数/切片大小>1.1时候，进入循环操作。在循环中，splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts));其中makeSplit是创建切片对象的工厂，实现如下：

  /**
   * A factory that makes the split for this class. It can be overridden
   * by sub-classes to make sub-types
   */
  protected FileSplit makeSplit(Path file, long start, long length, 
                                String[] hosts) {
    return new FileSplit(file, start, length, hosts);
  }

循环过后，发现splits已经有了三个FileSplit对象。此时byteRemaining=2，splitSize=2，已经不满足循环条件了。跳出循环进行下一步操作。if (bytesRemaining != 0) ，进入操作，然后把剩下的数据通过makeSplit创建一个FileSplit对象，添加进splits中。之后，return splits.toArray(new FileSplit[splits.size()]);完成了上述的操作，将返回值传给HadoopRDD.scala中getPartitions的inputSplits变量。程序会创建一个数组array，大小是inputSplits.size，即4，元素类型是Partition。通过for循环，向array填充值，array(i) = new HadoopPartition(id, i, inputSplits(i))。之后将array返回。RDD.scala中partitions方法会将接收到的返回值返回。

这个时候，进行SparkContext.scala中的runJob操作：runJob(rdd, func, 0 until rdd.partitions.length)。

这个时候，就已经创建了4个Partition，并且每个Partition上面个启动一个task。

另外，当我们在textFile中并没有指定minPartition的值，这个时候，会指定minPartition为一个默认值（即不能大于2）：

  /**
   * Default min number of partitions for Hadoop RDDs when not given by user
   * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
   * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
   */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

由于该博客内容属于另一博客的分支，故将总博客的链接附上：

https://blog.csdn.net/qq_20064763/article/details/88392874

empcl

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
spark源码分析：（二）2.分区与Task创建的细节

使用textFile()方法的时候，我们往往会在方法里面传入参数，用于指定分区数。那么我们传入几，就会创建几个分区么？一开始我就是这么认为的，但是，通过阅读源码之后，才发现并不是这样的。那个参数名是minPartitions，也就是说最小的分区数，并不是指创建几个分区数。在阅读这方面的源码的时候，我一开始认为使用textFile()进行文件读取的时候，就已经计算...
复制链接

扫一扫