Spark源码解析之RDD

最新推荐文章于 2021-06-25 20:01:05 发布

AlferWei

最新推荐文章于 2021-06-25 20:01:05 发布

阅读量571

点赞数

分类专栏： Spark Spark专栏文章标签： spark 源码

本文链接：https://blog.csdn.net/OiteBody/article/details/52576272

版权

Spark 同时被 2 个专栏收录

32 篇文章 0 订阅

订阅专栏

Spark专栏

14 篇文章 7 订阅

订阅专栏

从HDFS读取并转换为HadoopRDD

    val path = args(0)
    val dateTime = DateTime.now().getMillis
    val jobName = "WordCount"
    val sparkConf = new SparkConf().setAppName(jobName)
    val sc = new SparkContext(sparkConf)
    val data = sc.textFile(path, 2)
    val paris = {
       data.flatMap(l => l.replaceAll("\\.|,|\"", " ").trim.split(" ")).filter(w => !w.trim.equals("")).map(w => (w, 1))
    }
    val count = paris.reduceByKey((a, b) => a + b).sortByKey()
    count.saveAsTextFile("output/" + jobName + "-" + dateTime)

SparkContext类中包好了实例化HadoopRDD的方法：


  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => **pair._2.toString**).setName(path)
  }
// 注： pair => pair._2 取hadoopFile[K,V]中的V，即对应的文本内容，而K代表文本在文件中的偏移量？

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()
    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

HadoopRDD继承了RDD：

class HadoopRDD[K, V](
    sc: SparkContext,
    broadcastedConf: Broadcast[SerializableConfiguration],
    initLocalJobConfFuncOpt: Option[JobConf => Unit],
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minPartitions: Int)
  extends RDD[(K, V)](sc, Nil) with Logging

RDD是一个抽象类，包含了对rdd的各种集合操作，通过带函数参数的函数，实现对rdd的特定转换，如：

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

集成RDD的实现类在不同的场景下有不同的实现：
这里写图片描述

AlferWei

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark源码解析之RDD

从HDFS读取并转换为HadoopRDD val path = args(0) val dateTime = DateTime.now().getMillis val jobName = "WordCount" val sparkConf = new SparkConf().setAppName(jobName) val sc = new SparkContex
复制链接

扫一扫

专栏目录