Spark源码解析之textFile

最新推荐文章于 2023-08-14 20:04:00 发布

AlferWei

最新推荐文章于 2023-08-14 20:04:00 发布

阅读量3.9k

点赞数 4

分类专栏： Spark Spark专栏文章标签： spark 源码 hdfs

本文链接：https://blog.csdn.net/OiteBody/article/details/54934317

版权

Spark 同时被 2 个专栏收录

32 篇文章 0 订阅

订阅专栏

Spark专栏

14 篇文章 7 订阅

订阅专栏

Spark加载文件的时候可以指定最小的partition数量，那么这个patition数量和读取文件时的split操作有什么联系呢？下面就跟着Spark源码，看看二者到底是什么关系。

/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
   path: String,
   minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
 assertNotStopped()
 hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
   minPartitions).map(pair => pair._2.toString).setName(path)
  }

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

其中最重要的就是hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)方法了：

def hadoopFile[K, V](
   path: String,
   inputFormatClass: Class[_ <: InputFormat[K, V]],
   keyClass: Class[K],
   valueClass: Class[V],
   minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
 assertNotStopped()

 // This is a hack to enforce loading hdfs-site.xml.
 // See SPARK-11227 for details.
 FileSystem.getLocal(hadoopConfiguration)

 // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
 val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
 val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
 new HadoopRDD(
   this,
   confBroadcast,
   Some(setInputPathsFunc),
   inputFormatClass,
   keyClass,
   valueClass,
   minPartitions).setName(path)
}

上面代码可以看到做了几件事情：
1、强制获取hadoopConfiguration，并将hadoopConfiguration进行广播；
2、设置任务的文件读取路径；
3、实例化HadoopRDD。

进入到HadoopRDD中，找到getPartitions():

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

发现其中做了三件事情：
1、获取JobConf，并将其添加信任凭证；
2、获取输入路径格式，并将其按照minPartitions进行split；
3、根据输入的split的个数创建对应的HadoopPartition。

从源码中可以得出，textFile方法加载文件时，会根据minPartitions的个数进行split，如果不指定minPartitions的值则默认为defaultParallelism和2的最小值。进行split的方式，由要读取的文件类型动态决定。此处读取的文本文件，则根据类的继承关系，TextInputFormat －> FileInputFormat 中的getSplits(JobConf job, int numSplits)方法。

AlferWei

关注

4
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark源码解析之textFile

Spark加载文件的时候可以指定最小的partition数量，那么这个patition数量和读取文件时的split操作有什么联系呢？下面就跟着Spark源码，看看二者到底是什么关系。/*** Read a text file from HDFS, a local file system (available on all nodes), or any* Hadoop-supported file
复制链接

扫一扫