【Spark】RDD操作详解4——Action算子

最新推荐文章于 2022-11-27 20:03:32 发布

JasonDing1354

最新推荐文章于 2022-11-27 20:03:32 发布

阅读量3.3k

点赞数

分类专栏：【Spark】文章标签： spark

本文链接：https://blog.csdn.net/JasonDing1354/article/details/46848763

版权

本质上在Actions算子中通过SparkContext执行提交作业的runJob操作，触发了RDD DAG的执行。
根据Action算子的输出空间将Action算子进行分类：无输出、 HDFS、 Scala集合和数据类型。

无输出

foreach

对RDD中的每个元素都应用f函数操作，不返回RDD和Array，而是返回Uint。

图中，foreach算子通过用户自定义函数对每个数据项进行操作。本例中自定义函数为println，控制台打印所有数据项。

源码：

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit) {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

HDFS

（1）saveAsTextFile

函数将数据输出，存储到HDFS的指定目录。将RDD中的每个元素映射转变为（Null，x.toString），然后再将其写入HDFS。

图中，左侧的方框代表RDD分区，右侧方框代表HDFS的Block。通过函数将RDD的每个分区存储为HDFS中的一个Block。

源码：

  /**
   * Save this RDD as a text file, using string representations of elements.
   */
  def saveAsTextFile(path: String) {
    // https://issues.apache.org/jira/browse/SPARK-2075
    //
    // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
    // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
    // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
    // Ordering for `NullWritable`. That's why the compiler will generate different anonymous
    // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
    //
    // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
    // same bytecodes for `saveAsTextFile`.
    val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
    val textClassTag = implicitly[ClassTag[Text]]
    val r = this.mapPartitions { iter =>
      val text = new Text()
      iter.map { x =>
        text.set(x.toString)
        (NullWritable.get(), text)
      }
    }
    RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
  }

  /**
   * Save this RDD as a compressed text file, using string representations of elements.
   */
  def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
    // https://issues.apache.org/jira/browse/SPARK-2075
    val nullWritableClassTag = implicitly[C

最低0.47元/天解锁文章

JasonDing1354

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【Spark】RDD操作详解4——Action算子

本质上在Actions算子中通过SparkContext执行提交作业的runJob操作，触发了RDD DAG的执行。根据Action算子的输出空间将Action算子进行分类：无输出、 HDFS、 Scala集合和数据类型。无输出foreach对RDD中的每个元素都应用f函数操作，不返回RDD和Array，而是返回Uint。图中，foreach算子通过用户自定义函数对每个数据项进行操作。
复制链接

扫一扫