二、Spark源码-- RDD生成及转换过程

最新推荐文章于 2024-08-07 16:37:36 发布

灰二和杉菜

最新推荐文章于 2024-08-07 16:37:36 发布

阅读量664

点赞数

分类专栏： Apache Spark 文章标签： spark RDD转换源码 spark RDD 转换过程源码

本文链接：https://blog.csdn.net/qq475781638/article/details/93066400

版权

本文深入探讨Spark的RDD生成，包括从文件读取的HadoopRDD和从集合创建的ParallelCollectionRDD。在转换过程中，通过FlatMap和Map操作将数据转化为MapPartitionsRDD，接着使用reduceByKey进行数据聚合，此操作实际上调用了combineByKeyWithClassTag方法，生成Shuffle类型的RDD。整个流程展示了RDD在转换过程中的内部变化。

摘要由CSDN通过智能技术生成

本篇从源码角度介绍下RDD的生成和转换过程

RDD生成过程

RDD生成有几种方式，最常用的是sparkContext.textFile方法

 def textFile(
     path: String,
     minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
   
   assertNotStopped()
   hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
     minPartitions).map(pair => pair._2.toString).setName(path)
 }

 def hadoopFile[K, V](
     path: String,
     inputFormatClass: Class[_ <: InputFormat[K, V]],
     keyClass: Class[K],
     valueClass: Class[V],
     minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
   
   assertNotStopped()

   // This is a hack to enforce loading hdfs-site.xml.
   // See SPARK-11227 for details.
   FileSystem.getLocal(hadoopConfiguration)

   // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
   val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
   val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputP