spark(四)-文件读写流程

1,spark读文件流程

从本地读取txt文件:
// path最后可以是文件或文件夹,还可以用通配符
val path = “file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/”
val rdd1 = sparkcontext.textFile(path,2)
从hdfs读取文件:sparkcontext.textFile("hdfs://s1:8020/user/hdfs/input”)

textFile()

Read a text file from HDFS, a local file system (available on all nodes),
or any Hadoop-supported file system URI, and return it as an RDD of Strings.

那么,使用textfile读取文件时候,到底是根据什么分区的呢?分区数和分区大小又是多少?

textfile返回RDD的Key、Value都是由InputFormat决定的。由传入hadoopFile()的参数可知,
value是Text类型的,即lines of the text file,那么key的类型LongWritable是什么意思呢?
https://blog.csdn.net/lzm1340458776/article/details/42707047
表示当前(每个split)在文件中的偏移量,也就是每个分区的起始Offset。

主要逻辑如下:

org.apache.spark.SparkContext.scala

传入text/HDFS文件的路径和建议的最小分区数:默认值是2,实际可能比这个要大,比如文件特别多或者特别大时,也可能更小。
返回’RDD of lines of the text file’,并把hadoopFile的(key, value) map()成 value.toString,只返回文本行。

  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // Hadoop的FileSystem,来自:org.apache.hadoop.fs.{FileSystem}
    // This is a hack to enforce loading hdfs-site.xml.
    FileSystem.getLocal(hadoopConfiguration)

    // 一个Hadoop配置文件能达到10KB,需要设为广播变量。
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

HadoopRDD继承了RDD,使用旧的MapReduce API (‘org.apache.hadoop.mapred’),
提供了读取存储在Hadoop上的数据的核心功能,
重载了getPartitions/compute/getPreferredLocations/persist方法。

org.apache.spark.rdd.HadoopRDD.scala

HadoopRDD的getPartitions()方法

输入数据会被切分成N份,每份数据对应到RDD中的一个Partition,Partition的数量决定了task的数量,影响着程序的并行度
HadoopRDD的getPartitions()方法解释了spark partition是怎么决定的。

添加了注释:

  override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    // 在这里添加证书,因为SparkContext初始化之前就可能被调用
    SparkHadoopUtil.get.addCredentials(jobConf)
    try {
      // 根据传入HadoopRDD的minPartitions和jobConf信息,决定输入文件的分区个数:allInputSplits
      // getInputFormat(): textFile()的InputFormat是之前 hadoopFile(path, classOf[TextInputFormat],...) 
      //                   传入的TextInputFormat,继承自 Hadoop的FileInputFormat
      // getSplits()根据传入的JobContext和JobContext中文件系统的blockSize,先计算合适的splitSize:
      //            取 Math.max(minSize, Math.min(maxSize, blockSize)),
      //            默认splitSize 就等于JobConf中文件系统的blockSize的默认值(128M);
      //            然后根据这个splitSize把输入文件切分成多份,partition数就是切分文件的个数。
      // Hadoop代码参考: https://www.cnblogs.com/barneywill/p/10192800.html
      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
      val inputSplits = if (ignoreEmptySplits) {
        allInputSplits.filter(_.getLength > 0)
      } else {
        allInputSplits
      }
      // 创建allInputSplits个Partition类型的数组,每个数组存放一个HadoopPartition对象
      // HadoopPartition是HadoopRDD的内部类 that wraps around a Hadoop InputSplit.
      // 最后返回这个数组
      val array = new Array[Partition](inputSplits.size)
      for (i <- 0 until inputSplits.size) {
        // HadoopPartition保存了(rddId, index, InputSplit)
        array(i) = new HadoopPartition(id, i, inputSplits(i))
      }
      array
    } catch {
      case e: InvalidInputException if ignoreMissingFiles =>
        logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist and no" +
            s" partitions returned from this path.", e)
        Array.empty[Partition]
    }
  }

HadoopRDD的getPreferredLocations()方法

从HadoopPartition的InputSplit信息中,获取分区所在的结点位置URI!!

  override def getPreferredLocations(split: Partition): Seq[String] = {
    val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
    val locs = hsplit match {
      case lsplit: InputSplitWithLocationInfo =>
        HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
      case _ => None
    }
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
  }

HadoopRDD的compute()方法

这是个典型的迭代器模式。迭代器封装了遍历操作的细节。

以前见过leveldb的两级迭代器设计,2013 写的真棒啊。
leveldb的TwoLevelIterator意思是:level1中的迭代器指向的是一个容器,level2中的迭代器才指向真正的元素。
它不仅可以迭代其中存储的sstable对象,它还接受了一个函数BlockFunction,可以遍历存储的Block对象数据。
这类似于C++ STL中的deque,map中每个元素都是指针,指向另一段较大的连续空间构成的缓冲区,
在这些分段的连续空间上用cur, first, last指针实现整体上连续访问。

HadoopRDD 由于有inputSplit,好像没有使用两级的迭代器模式。

override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
    val iter = new NextIterator[(K, V)] {...
    }
    new InterruptibleIterator[(K, V)](context, iter)
  }

1,从HadoopPartition获取文件信息,更新全局的InputFileBlockHolder;
2,接着定义了几个内部函数(updateBytesRead等);
3,然后根据jobConf、inputSplit,从getInputFormat(jobConf)(也就是TextInputFormat)中调用getRecordReader(),
  返回RecordReader[K, V]。

  reader =
        try {
          inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
        } catch {
        }

4,拿到RecordReader后,注册了一个on-task-completion的回调,来关闭input stream。
5,再然后重写getNext()方法,返回RecordReader[K, V]的 next(key, value),
  重写了close()方法。
6,最后返回当前迭代器和上下文对象

2,RDD写文件操作

saveAsTextFile函数在RDD.scala中,是通过saveAsHadoopFile实现的。
使用mapPartitions算子将每个分区都写入Text文件,rddToPairRDDFunctions函数转化为(NullWritable,Text)类型的RDD,
然后调用PairRDDFunctions的saveAsHadoopFile写入文件。

    RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
        .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)

源码见org.apache.spark.rdd下的PairRDDFunctions.scala。

/**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD.
   *
   * @note We should make sure our tasks are idempotent when speculation is enabled, i.e. 
   * do not use output committer that writes data directly.
   * There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
   * result of using direct output committer with speculation enabled.
   */
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
    // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
    val hadoopConf = conf
    hadoopConf.setOutputKeyClass(keyClass)
    hadoopConf.setOutputValueClass(valueClass)
    conf.setOutputFormat(outputFormatClass)
    for (c <- codec) {
      hadoopConf.setCompressMapOutput(true)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
      hadoopConf.setMapOutputCompressorClass(c)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", c.getCanonicalName)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.type",
        CompressionType.BLOCK.toString)
    }

    // Use configured output committer if already set
    if (conf.getOutputCommitter == null) {
      hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
    }

    // When speculation is on and output committer class name contains "Direct", we should warn
    // users that they may loss data if they are using a direct output committer.
    val speculationEnabled = self.conf.getBoolean("spark.speculation", false)
    val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "")
    if (speculationEnabled && outputCommitterClass.contains("Direct")) {
      val warningMessage =
        s"$outputCommitterClass may be an output committer that writes data directly to " +
          "the final location. Because speculation is enabled, this output committer may " +
          "cause data loss (see the case in SPARK-10063). If possible, please use an output " +
          "committer that does not have this behavior (e.g. FileOutputCommitter)."
      logWarning(warningMessage)
    }

    FileOutputFormat.setOutputPath(hadoopConf,
      SparkHadoopWriterUtils.createPathFromString(path, hadoopConf))
    saveAsHadoopDataset(hadoopConf)
  }
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值