【原创】大数据基础之Spark（7）spark读取文件split过程（即RDD分区数量）

最新推荐文章于 2023-12-05 07:30:00 发布

weixin_30681121

最新推荐文章于 2023-12-05 07:30:00 发布

阅读量1k

点赞数

文章标签：大数据 java

原文链接：http://www.cnblogs.com/barneywill/p/10192800.html

版权

本文探讨了在Spark 2.1.1中，如何确定初始化RDD时的分区数量，特别是当未指定最小partition数量时。以SparkContext.textfile为例，深入到HadoopRDD的getPartitions方法，说明了InputFormat.getSplits的角色，以及不同文件格式如avro、orc、parquet、textfile的split逻辑。默认实现中，文件会被拆分成多个split，考虑因素包括文件是否可再分、block信息、机架和host等，以优化分区数量。

摘要由CSDN通过智能技术生成

spark 2.1.1

spark初始化rdd的时候，需要读取文件，通常是hdfs文件，在读文件的时候可以指定最小partition数量，这里只是建议的数量，实际可能比这个要大（比如文件特别多或者特别大时），也可能比这个要小（比如文件只有一个而且很小时），如果没有指定最小partition数量，初始化完成的rdd默认有多少个partition是怎样决定的呢？

以SparkContext.textfile为例来看下代码：

org.apache.spark.SparkContext

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

  /**
   * Default min number of partitions for Hadoop RDDs when not given by user
   * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
   * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
   */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

  /** Get an RDD for a Hadoop file with an arbitrary InputFormat
   *
   * @note Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   */
  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

可见会直接返回一个HadoopRDD，如果不传最小partition数量，会使用defaultMinPartitions（通常情况下是2），那么HadoopRDD是怎样实现的？

org.apache.spark.rdd.HadoopRDD

class HadoopRDD[K, V](
    sc: SparkContext,
    broadcastedConf: Broadcast[SerializableConfiguration],
    initLocalJobConfFuncOpt: Option[JobConf => Unit],
    inputFormatClass: Class[_ <

最低0.47元/天解锁文章

weixin_30681121

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【原创】大数据基础之Spark（7）spark读取文件split过程（即RDD分区数量）

spark 2.1.1spark初始化rdd的时候，需要读取文件，通常是hdfs文件，在读文件的时候可以指定最小partition数量，这里只是建议的数量，实际可能比这个要大（比如文件特别多或者特别大时），也可能比这个要小（比如文件只有一个而且很小时），如果没有指定最小partition数量，初始化完成的rdd默认有多少个partition是怎样决定的呢？以SparkContext.tex...
复制链接

扫一扫