Spark修炼之道（高级篇）——Spark源码阅读：第三节 Spark Job的提交

最新推荐文章于 2020-12-19 14:56:23 发布

zhouzhihubeyond

最新推荐文章于 2020-12-19 14:56:23 发布

阅读量9.3k

点赞数 6

分类专栏： Spark Spark修炼之道文章标签： spark 源码分析

本文链接：https://blog.csdn.net/lovehuangjiaju/article/details/49256603

版权

前一我们分析了SparkContext的创建，这一节，我们介绍在RDD执行的时候，如何提交job进行分析，同样是下面的源码：

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount{
  def main(args: Array[String]) {
    if (args.length == 0) {
      System.err.println("Usage: SparkWordCount <inputfile> <outputfile>")
      System.exit(1)
    }

    val conf = new SparkConf().setAppName("SparkWordCount")
    val sc = new SparkContext(conf)

    val file=sc.textFile("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/README.md")
    val counts=file.flatMap(line=>line.split(" "))
                   .map(word=>(word,1))
                   .reduceByKey(_+_)
    counts.saveAsTextFile("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/countReslut.txt")

  }
}

上面的程序代码counts.saveAsTextFile(“file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/countReslut.txt”)会触发action操作，Spark会生成一个Job来执行相关计算

//将RDD保存为Hadoop支持的文件系统，包括本地文件、HDFS等，使用的是Hadoop的OutputFormat类
/**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD.
   */
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
    // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
    // hadoop配置信息
    val hadoopConf = conf
    hadoopConf.setOutputKeyClass(keyClass)
    hadoopConf.setOutputValueClass(valueClass)
    // Doesn't work in Scala 2.9 due to what may be a generics bug
    // TODO: Should we uncomment this for Scala 2.10?
    // conf.setOutputFormat(outputFormatClass)
    hadoopConf.set("mapred.output.format.class", outputFormatClass.getName)
    for (c <- codec) {
      hadoopConf.setCompressMapOutput(true)
      hadoopConf.set("mapred.output.compress", "true")
      hadoopConf.setMapOutputCompressorClass(c)
      hadoopConf.set("mapred.output.compression.codec", c.getCanonicalName)
      hadoopConf.set("mapred.output.compression.type", CompressionType.BLOCK.toString)
    }

    // Use configured output committer if already set
    if (conf.getOutputCommitter == null) {
      hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
    }

    FileOutputFormat.setOutputPath(hadoopConf,
      SparkHadoopWriter.createPathFromString(path, hadoopConf))
    //调用saveAsHadoopDataset方法进行RDD保存
    saveAsHadoopDataset(hadoopConf)
  }

跳转到saveAsHadoopDataset，并调用其self.context.runJob即SparkContext中的runJob方法

/**
   * Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for
   * that storage system. The JobConf should set an OutputFormat and any output paths required
   * (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
   * MapReduce job.
   */
  def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {

最低0.47元/天解锁文章

zhouzhihubeyond

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
5
评论
Spark修炼之道（高级篇）——Spark源码阅读：第三节 Spark Job的提交

前一我们分析了SparkContext的创建，这一节，我们介绍在RDD执行的时候，如何提交job进行分析，同样是下面的源码：import org.apache.spark.{SparkConf, SparkContext}object SparkWordCount{ def main(args: Array[String]) { if (args.length == 0) {
复制链接

扫一扫