本文首发自个人博客:https://blog.smile13.com/articles/2018/11/30/1543589289882.html
一、无输出的算子
1.foreach算子
功能:对 RDD 中的每个元素都应用 f 函数操作,无返回值。
源码: /** * Applies a function f to all elements of this RDD. */ def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }
示例: scala> val rdd1 = sc.parallelize(1 to 9) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24 scala> rdd1.foreach(x => printf("%d ", x)) 1 2 3 4 5 6 7 8 9
2.foreachPartition算子
功能:该函数和foreach类似,不同的是,foreach是直接在每个partition中直接对iterator执行foreach操作,传入的function只是在foreach内部使用, 而foreachPartition是在每个partition中把iterator给传入的function,让function自己对iterator进行处理(可以避免内存溢出)。 简单来说,foreach的iterator是针对的rdd中的元素,而foreachPartition的iterator是针对的分区本身。
源码: /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = withScope { val cleanedF = sc.clean(f) new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter), preservesPartitioning) }
示例: scala> val rdd1 = sc.parallelize(1 to 9, 2) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24 scala> rdd1.foreachPartition(x => printf("%s ", x.size)) 4 5
二、输出到HDFS等文件系统的算子
1.saveAsTextFile算子
功能:该函数将数据输出,以文本文件的形式写入本地文件系统或者HDFS等。Spark将对每个元素调用toString方法,将数据元素转换为文本文件中的一行记录。若将文件保存到本地文件系统,那么只会保存在executor所在机器的本地目录。
源码: /** * Save this RDD as a text file, using string representations of elements. */ def saveAsTextFile(path: String): Unit = withScope { // https://issues.apache.org/jira/browse/SPARK-2075 // // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]` // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an // Ordering for `NullWritable`. That's why the compiler will generate different anonymous // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. // // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate // same bytecodes for `saveAsTextFile`. val nullWritableClassTag = implicitly[ClassTag[NullWritable]] val textClassTag = implicitly[ClassTag[Text]] val r = this.mapPartitions { iter => val text = new Text() iter.map { x => text.set(x.toString) (NullWritable.get(), text) } } RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path) }
示例: scala> val rdd1 = sc.parallelize(1 to 9, 2) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24 scala> rdd1.saveAsTextFile("file:///opt/app/test/saveAsTextFileTest.txt")
2.saveAsObjectFile算子
功能:该函数用于将RDD以ObjectFile形式写入本地文件系统或者HDFS等。
源码: /** * Save this RDD as a SequenceFile of serialized objects. */ def saveAsObjectFile(path: String): Unit = withScope { this.mapPartitions(iter => iter.grouped(10).map(_.toArray)) .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x)))) .saveAsSequenceFile(path) }
示例: scala> val rdd1 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3), ("d", 5), ("a", 4)), 2) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:24 scala> rdd1.saveAsObjectFile("file:///opt/app/test/saveAsObejctFileTest.txt")
3.saveAsHadoopFile算子
功能:该函数将RDD存储在HDFS上的文件中,可以指定outputKeyClass、outputValueClass以及压缩格式,每个分区输出一个文件。
源码: /** * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class * supporting the key and value types K and V in this RDD. * * @note We should make sure our tasks are idempotent when speculation is enabled, i.e. do * not use output committer that writes data directly. * There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad * result of using direct output committer with speculation enabled. */def saveAsHadoopFile( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ <: OutputFormat[_, _]], conf: JobConf = new JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf hadoopConf.setOutputKeyClass(keyClass) hadoopConf.setOutputValueClass(valueClass) conf.setOutputFormat(outputFormatClass) for (c <- codec) { hadoopConf.setCompre