spark算子详解------Action算子介绍

最新推荐文章于 2022-09-21 10:53:33 发布

生活丶对我笑

最新推荐文章于 2022-09-21 10:53:33 发布

阅读量724

点赞数 1

分类专栏： Spark Bigdata 文章标签： spark actionRDD

本文链接：https://blog.csdn.net/smiles13/article/details/85221441

版权

本文详细介绍了Spark中的Action算子，包括无输出的foreach和foreachPartition，以及将数据保存到HDFS等文件系统的多种算子，如saveAsTextFile、saveAsObjectFile等。此外，还涵盖了返回Scala集合和数据类型的算子，如first、count、reduce、collect等。全面解析了Spark中Action算子的功能和使用场景。

摘要由CSDN通过智能技术生成

本文首发自个人博客：https://blog.smile13.com/articles/2018/11/30/1543589289882.html

一、无输出的算子

1.foreach算子

功能：对 RDD 中的每个元素都应用 f 函数操作，无返回值。

源码：

/**
* Applies a function f to all elements of this RDD. 
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

示例：

scala> val rdd1 = sc.parallelize(1 to 9)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala>  rdd1.foreach(x => printf("%d ", x))
1 2 3 4 5 6 7 8 9

2.foreachPartition算子

功能：该函数和foreach类似，不同的是,foreach是直接在每个partition中直接对iterator执行foreach操作,传入的function只是在foreach内部使用,
而foreachPartition是在每个partition中把iterator给传入的function,让function自己对iterator进行处理（可以避免内存溢出）。

简单来说，foreach的iterator是针对的rdd中的元素，而foreachPartition的iterator是针对的分区本身。

源码：

/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}

示例：

scala> val rdd1 = sc.parallelize(1 to 9, 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24

scala> rdd1.foreachPartition(x => printf("%s ", x.size))
4 5

二、输出到HDFS等文件系统的算子

1.saveAsTextFile算子

功能：该函数将数据输出，以文本文件的形式写入本地文件系统或者HDFS等。Spark将对每个元素调用toString方法，将数据元素转换为文本文件中的一行记录。若将文件保存到本地文件系统，那么只会保存在executor所在机器的本地目录。

源码：

/**
* Save this RDD as a text file, using string representations of elements. 
*/
def saveAsTextFile(path: String): Unit = withScope {
// https://issues.apache.org/jira/browse/SPARK-2075
// 
// NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit 
// Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]` 
// in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an 
// Ordering for `NullWritable`. That's why the compiler will generate different anonymous 
// classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. 
// 
// Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate 
// same bytecodes for `saveAsTextFile`.  val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
val textClassTag = implicitly[ClassTag[Text]]
val r = this.mapPartitions { iter =>
val text = new Text()
iter.map { x =>
text.set(x.toString)
(NullWritable.get(), text)
}
}  RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

示例：

scala> val rdd1 = sc.parallelize(1 to 9, 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24

scala> rdd1.saveAsTextFile("file:///opt/app/test/saveAsTextFileTest.txt")

2.saveAsObjectFile算子

功能：该函数用于将RDD以ObjectFile形式写入本地文件系统或者HDFS等。

源码：

/**
* Save this RDD as a SequenceFile of serialized objects. 
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}

示例：

scala> val rdd1 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3), ("d", 5), ("a", 4)), 2)
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:24

scala> rdd1.saveAsObjectFile("file:///opt/app/test/saveAsObejctFileTest.txt")

3.saveAsHadoopFile算子

功能：该函数将RDD存储在HDFS上的文件中,可以指定outputKeyClass、outputValueClass以及压缩格式,每个分区输出一个文件。

源码：

/**
* Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
* supporting the key and value types K and V in this RDD. 
* 
* @note We should make sure our tasks are idempotent when speculation is enabled, i.e. do
* not use output committer that writes data directly. 
* There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad 
* result of using direct output committer with speculation enabled. */def saveAsHadoopFile(
path: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[_ <: OutputFormat[_, _]],
conf: JobConf = new JobConf(self.context.hadoopConfiguration),
codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
val hadoopConf = conf
hadoopConf.setOutputKeyClass(keyClass)
hadoopConf.setOutputValueClass(valueClass)
conf.setOutputFormat(outputFormatClass)
for (c <- codec) {
hadoopConf.setCompre