RDD详解

最新推荐文章于 2023-01-30 14:26:44 发布

bingo_liu

最新推荐文章于 2023-01-30 14:26:44 发布

阅读量1k

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/bingo_liu/article/details/54979769

版权

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

什么是RDD

RDD就是Spark的input，知道input是啥吧，就是输入的数据。
RDD的全名是Resilient Distributed Dataset，意思是容错的分布式数据集，每一个RDD都会有5个特征：
1、有一个分片列表。就是能被切分，和hadoop一样的，能够切分的数据才能并行计算。
2、有一个函数计算每一个分片，这里指的是下面会提到的compute函数。
3、对其他的RDD的依赖列表，依赖还具体分为宽依赖和窄依赖，但并不是所有的RDD都有依赖。
4、可选：key-value型的RDD是根据哈希来分区的，类似于mapreduce当中的Paritioner接口，控制key分到哪个reduce。
5、可选：每一个分片的优先计算位置（preferred locations），比如HDFS的block的所在位置应该是优先计算的位置。
对应着上面这几点，我们在RDD里面能找到这4个方法和1个属性：

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   * 通过子类在RDD返回分区的set。只会调用一次，所以在它中实现一个耗时的计算是安全的
   */
  protected def getPartitions: Array[Partition]

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   * 由子类实现计算给定的分区
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   * 通过子类返回这个RDD。只会调用一次，所以在它中实现一个耗时的计算是安全的
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

  /** 
   * Optionally overridden by subclasses to specify how they are partitioned. 
   * 可由子类重写指定如何划分(可选的)
   */
  @transient val partitioner: Option[Partitioner] = None

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   * 可由子类重写指定优先位置(可选的), 输入参数是split分片，输出结果是一组优先的节点位置
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

多种RDD之间的转换

实例：

    val hdfsFile = sc.textFile(args(1))
    val flatMapRdd = hdfsFile.flatMap(s => s.split(" "))
    val filterRdd = flatMapRdd.filter(_.length == 2)
    val mapRdd = filterRdd.map(word => (word, 1))
    val reduce = mapRdd.reduceByKey(_ + _)

textFile是一个HadoopRDD经过map后的MapPartitionsRDD，经过flatMap是一个FlatMappedRDD，经过filter方法之后生成了一个FilteredRDD…，最后经过reduceByKey

textFile(…)

【源码】SparkContext.scala

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString)
  }

  /** Get an RDD for a Hadoop file with an arbitrary InputFormat
   *
   * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   */
  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()
    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

参数：
1、hdfs的地址
2、InputFormat的类型
3、Mapper的第一个类型
4、Mapper的第二类型

操作：
1、把hadoop的配置文件保存到广播变量里。
2、设置路径的方法
3、new了一个HadoopRDD返回

追踪发现默认的defaultMinPartitions为2，我们用的时候还是设置大一点。
如果大家要用的不是textFile，而是一个别的hadoop文件类型，那么就使用别的对应的方法。

HadoopRDD

【源码】HadoopRDD.scala

override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    // 计算分片
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      //把分片HadoopPartition包装到到array里面返回
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }


  override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
      val iter = new NextIterator[(K, V)] {
      // 转换成HadoopPartition
      val split = theSplit.asInstanceOf[HadoopPartition]
      logInfo("Input split: " + split.inputSplit)
      val jobConf = getJobConf()

      val inputMetrics = context.taskMetrics.getInputMetricsForReadMethod(DataReadMethod.Hadoop)

      // Sets the thread local variable for the file's name
      split.inputSplit.value match {
        case fs: FileSplit => SqlNewHadoopRDDState.setInputFileName(fs.getPath.toString)
        case _ => SqlNewHadoopRDDState.unsetInputFileName()
      }

      // Find a function that will return the FileSystem bytes read by this thread. Do this before
      // creating RecordReader, because RecordReader's constructor might read some bytes
      val bytesReadCallback = inputMetrics.bytesReadCallback.orElse {
        split.inputSplit.value match {
          case _: FileSplit | _: CombineFileSplit =>
            SparkHadoopUtil.get.getFSBytesReadOnThreadCallback()
          case _ => None
        }
      }
      inputMetrics.setBytesReadCallback(bytesReadCallback)

      var reader: RecordReader[K, V] = null
      val inputFormat = getInputFormat(jobConf)
      HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),
        context.stageId, theSplit.index, context.attemptNumber, jobConf)
      // 通过Inputform的getRecordReader来创建这个InputSpit的Reader
      reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)

      // Register an on-task-completion callback to close the input stream.
      context.addTaskCompletionListener{ context => closeIfNeeded() }
      val key: K = reader.createKey()
      val value: V = reader.createValue()

      // 重写Iterator的getNext方法，通过创建的reader调用next方法读取下一个值
      override def getNext(): (K, V) = {
        try {
          finished = !reader.next(key, value)
        } catch {
          case eof: EOFException =>
            finished = true
        }
        if (!finished) {
          inputMetrics.incRecordsRead(1)
        }
        (key, value)
      }
      ...
    }
    new InterruptibleIterator[(K, V)](context, iter)
  }

  override def getPreferredLocations(split: Partition): Seq[String] = {
    val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
    val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {
      case Some(c) =>
        try {
          val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)
          val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]
          Some(HadoopRDD.convertSplitLocationInfo(infos))
        } catch {
          case e: Exception =>
            logDebug("Failed to use InputSplitWithLocations.", e)
            None
        }
      case None => None
    }
    // 直接调用InputSplit的getLocations方法获得所在的位置
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
  }

可以看得出来compute方法是通过分片来获得Iterator接口，以遍历分片的数据

map(…)

【源码】RDD.scala

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

【源码】MapPartitionsRDD.scala

/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 */
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))
}

getPartitions和compute给重写了，而且都用到了firstParent[T], firstParent[T]在继承的父构造函数确定了

【源码】RDD.scala

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {

  /** Construct an RDD with just a one-to-one dependency on one parent */
  def this(@transient oneParent: RDD[_]) =
    this(oneParent.context , List(new OneToOneDependency(oneParent)))

  /** Returns the first parent RDD */
  protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }
  ...
}

它把RDD复制给了deps，HadoopRDD成了MapPartitionsRDD的父依赖了，这个OneToOneDependency是一个窄依赖，子RDD直接依赖于父RDD

结论：
1、getPartitions直接沿用了父RDD的分片信息
2、compute函数是在父RDD遍历每一行数据时套一个匿名函数f进行处理

compute函数的两个显著作用:
1、在没有依赖的条件下，根据分片的信息生成遍历数据的Iterable接口
2、在有前置依赖的条件下，父RDD每遍历一个元素时套一个匿名函数f进行处理

new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) 中的 map(cleanF)：

【源码】Iterator.scala

  /** Creates a new iterator that maps all produced values of this iterator
   *  to new values using a transformation function.
   *
   *  @param f  the transformation function
   *  @return a new iterator which transforms every value produced by this
   *          iterator by applying the function `f` to it.
   *  @note   Reuse: $consumesAndProducesIterator
   */
  def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
    def hasNext = self.hasNext
    def next() = f(self.next())
  }

。。。更新中。。。