saveAsTextFile原理-源码(spark3.0)

最新推荐文章于 2024-04-27 16:32:38 发布

best啊李

最新推荐文章于 2024-04-27 16:32:38 发布

阅读量451

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_27015119/article/details/120050340

版权

本文深入探讨了Spark 3.0中`saveAsTextFile`的实现原理，通过`mapPartitions`将数据转换为Text格式，然后利用MapPartitionRDD调用`SaveAsHadoopFile`方法。在这个过程中，设置了key-value对、压缩格式等，并使用PairRDDFunctions进行操作。最终，借助SparkHadoopWriter执行分区数据的写入任务，每个分区写完后会进行提交。

摘要由CSDN通过智能技术生成

用mapPartitions函数将数据封装成Text(hadoop的数据类型)，返回的是MapPartitionRDD，在调用SaveAsHadoopFile

/**
   * TODO：存储数据到文件中 并指定压缩格式
   * Save this RDD as a compressed text file, using string representations of elements.
   */
  def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = withScope {
    // TODO：mapPartitions 返回的是MapPartitionRDD
    this.mapPartitions { iter =>
      // TODO：将数据封装成Text  是hadoop的格式
      val text = new Text()
      iter.map { x =>
        require(x != null, "text files do not allow null rows")
        text.set(x.toString)
        (NullWritable.get(), text)
      }
      // TODO：TextOutputFormat
    }.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
  }

设置key value 压缩格式等PairRDDFunctions类

  /**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD. Compress the result with the
   * supplied codec.
   */
  def saveAsHadoopFile[F <: OutputFormat[K, V]](
      path: String,
      codec: Class[_ <: CompressionCodec])(implicit fm: ClassTag[F]): Unit = self.withScope {
    val runtimeClass = fm.runtimeClass
    // TODO：设置路径 key类型 value类型 压缩格式等
    saveAsHadoopFile(path, keyClass, valueClass, runtimeClass.asInstanceOf[Class[F]], codec)
  }


  /**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD. Compress with the supplied codec.
   */
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      codec: Class[_ <: CompressionCodec]): Unit = self.withScope {
    saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass,
      new JobConf(self.context.hadoopConfiguration), Option(codec))
  }