saveAsHadoopDataset和saveAsNewAPIHadoopDataset源码分析及用法说明

最新推荐文章于 2021-04-25 11:36:33 发布

邢为栋

最新推荐文章于 2021-04-25 11:36:33 发布

阅读量1.1k

点赞数 1

分类专栏： Bigdata 文章标签： hadoop spark hbase

本文链接：https://blog.csdn.net/xwd127429/article/details/108649941

版权

Bigdata 专栏收录该内容

79 篇文章 4 订阅

订阅专栏

在研究Spark Streaming保存数据到HBase的时候，其中一种方案用到了Spark的两个算子，分别是saveAsHadoopDataset和saveAsNewAPIHadoopDataset，它们可以将RDD输出到Hadoop支持的存储系统中。

本文对这两个算子进行一些源码分析，并以HBase为目标存储系统描述算子的使用方法。

Spark版本：2.11-2.4.0-cdh6.3.2。

HBase版本：2.1.0-cdh6.3.2。

Hadoop版本：3.0.0-cdh6.3.2。

前置知识

在分析这两个算子之前，需要先补充一下Hadoop MapReduce的一些知识。

Hadoop MapReduce分为两套API，分别是：

mapred
mapreduce

其中mapreduce相对于mapred较新，一般称为新API。

在Hadoop生态中，大部分服务都对接了这两套API，比如本文涉及到的Spark。

对应关系：

MapReduce API	Spark API
mapred	saveAsHadoopDataset
mapreduce	saveAsNewAPIHadoopDataset

源码分析

两个算子的源码位置：core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

saveAsNewAPIHadoopDataset和saveAsHadoopDataset的源码：

/**
* Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop
* Configuration object for that storage system. The Conf should set an OutputFormat and any
* output paths required (e.g. a table name to write to) in the same way as it would be
* configured for a Hadoop MapReduce job.
*
* @note We should make sure our tasks are idempotent when speculation is enabled, i.e. do
* not use output committer that writes data directly.
* There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
* result of using direct output committer with speculation enabled.
*/
def saveAsNewAPIHadoopDataset(conf: Configuration): Unit = self.withScope {
val config = new HadoopMapReduceWriteConfigUtil[K, V](new SerializableConfiguration(conf))
SparkHadoopWriter.write(
  rdd = self,
  config = config)
}

/**
* Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for
* that storage system. The JobConf should set an OutputFormat and any output paths required
* (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
* MapReduce job.
*/
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
val config = new HadoopMapRedWriteConfigUtil[K, V](new SerializableJobConf(conf))
SparkHadoopWriter.write(
  rdd = self,
  config = config)
}

代码注释描述的很清楚。两个方法都可以将RDD输出到Hadoop支持的存储系统，不过使用的配置有些不同，而这个不同的地方就是mapreduce两套API之间的差异所在。

saveAsNewAPIHadoopDataset方法接收一个Hadoop Configuration对象。Configuration应该设置一个OutputFormat和输出路径，输出路径可以是一个表名。在代码注释中提到了注意事项，感兴趣的可以自行研究。

saveAsHadoopDataset方法接收一个Hadoop JobConf对象。JobConf应该设置一个OutputFormat和输出路径，输出路径可以是一个表名。

仔细观察代码，可以发现，重点是HadoopMapReduceWriteConfigUtil和HadoopMapRedWriteConfigUtil，以及SparkHadoopWriter，它们的源码位置在org/apache/spark/internal/io/SparkHadoopWriter.scala。

其中HadoopMapReduceWriteConfigUtil读取Configuration，创建一个输出工作流；HadoopMapRedWriteConfigUtil读取JobConf，创建一个输出工作流。

SparkHadoopWriter负责读取这个输出工作流，构建Job，然后将RDD输出到Hadoop支持的存储系统中。

这里分析下SparkHadoopWriter的输出逻辑，关键代码如下：

// Try to write all RDD partitions as a Hadoop OutputFormat.
try {
  val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: Iterator[(K, V)]) => {
    // SPARK-24552: Generate a unique "attempt ID" based on the stage and task attempt numbers.
    // Assumes that there won't be more than Short.MaxValue attempts, at least not concurrently.
    val attemptId = (context.stageAttemptNumber << 16) | context.attemptNumber

    executeTask(
      context = context,
      config = config,
      jobTrackerId = jobTrackerId,
      commitJobId = commitJobId,
      sparkPartitionId = context.partitionId,
      sparkAttemptNumber = attemptId,
      committer = committer,
      iterator = iter)
  })

  committer.commitJob(jobContext, ret)
  logInfo(s"Job ${jobContext.getJobID} committed.")
} catch {
  case cause: Throwable =>
    logError(s"Aborting job ${jobContext.getJobID}.", cause)
    committer.abortJob(jobContext)
    throw new SparkException("Job aborted.", cause)
}

在上述代码中，sparkContext.runJob会创建一组tasks(详见SparkContext源码)，每个task都会执行executeTask这个函数，这个函数会将一个RDD分区的数据输出。tasks组创建完成后，提交整个Job。

算子用法示例

在应用开发中，我们其实更关注算子的配置构造，也就是如何构造一个JobConf或Configuration。

接下来结合HBase描述一下这两个算子的实际用法。

saveAsNewAPIHadoopDataset用法

上文提到，saveAsNewAPIHadoopDataset方法接收一个Hadoop Configuration对象。Configuration应该设置一个OutputFormat和输出路径，输出路径可以是一个表名。

所以，我们需要创建一个HBase的Hadoop Configuration对象，并设置OutputFormat和输出表名。

示例：

public static void saveToHBase1(JavaPairRDD<String, Integer> rdd, List<String> resources, String hbaseTableName) {
    Configuration config = HBaseConfiguration.create();
    // 添加必要的配置文件 (hbase-site.xml, core-site.xml)
    for (int i = 0; i < resources.size(); i++) {
        config.addResource(new Path(resources.get(i)));
    }
    
    config.set("mapreduce.job.outputformat.class", "org.apache.hadoop.hbase.mapreduce.TableOutputFormat");
    config.set("hbase.mapred.outputtable", hbaseTableName);

    JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = rdd.mapToPair(line -> {
        Put put = new Put(Bytes.toBytes(line._1));
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(line._2.toString()));
        return new Tuple2<>(new ImmutableBytesWritable(), put);
    });
    hbasePuts.saveAsNewAPIHadoopDataset(config);
}

首先，我们创建Hadoop Configuration对象，然后设置OutputFormat和输出表名。

saveAsNewAPIHadoopDataset对应mapreduce API，所以我们应该使用Hadoop mapreduce API的设置方法，

通过分析Hadoop mapreduce API和源码，我发现Job类的setOutputFormatClass可以设置OutputFormat，源码如下：

/**
 * Set the {@link OutputFormat} for the job.
 * @param cls the <code>OutputFormat</code> to use
 * @throws IllegalStateException if the job is submitted
 */
public void setOutputFormatClass(Class<? extends OutputFormat> cls
) throws IllegalStateException {
    ensureState(JobState.DEFINE);
    conf.setClass(OUTPUT_FORMAT_CLASS_ATTR, cls,
            OutputFormat.class);
}

设置方法是将OUTPUT_FORMAT_CLASS_ATTR设置为我们需要的值，但是这对应用来说是无法使用的，我们需要了解OUTPUT_FORMAT_CLASS_ATTR的真面目。通过继续分析源码，发现Job类实现了JobContext接口，而这个接口继承自MRJobConfig接口，到此，我们找到了，源码如下：

public static final String OUTPUT_FORMAT_CLASS_ATTR = "mapreduce.job.outputformat.class";

所以，在应用中，我们可以使用mapreduce.job.outputformat.class来设置OutputFormat。

找到了key，那value应该是什么？

答案是Hadoop mapreduce API的OutputFormat的子类。

HBase实现了mapreduce API的OutputFormat的子类，是org.apache.hadoop.hbase.mapreduce.TableOutputFormat。

在这个类中，提供了设置输出表名的方法，源码如下：

/** Job parameter that specifies the output table. */
public static final String OUTPUT_TABLE = "hbase.mapred.outputtable";

saveAsHadoopDataset用法

上文提到，saveAsHadoopDataset方法接收一个Hadoop JobConf对象。JobConf应该设置一个OutputFormat和输出路径，输出路径可以是一个表名。

所以，我们需要创建一个HBase的Hadoop Configuration对象，并设置OutputFormat和输出表名。

示例：

public static void saveToHBase2(JavaPairRDD<String, Integer> rdd, List<String> resources, String hbaseTableName) {
    Configuration config = HBaseConfiguration.create();
    // 添加必要的配置文件 (hbase-site.xml, core-site.xml)
    for (int i = 0; i < resources.size(); i++) {
        config.addResource(new Path(resources.get(i)));
    }

    JobConf jobConfig = new JobConf(config);
    jobConfig.setOutputFormat(TableOutputFormat.class);
    jobConfig.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName);

    JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = rdd.mapToPair(line -> {
        Put put = new Put(Bytes.toBytes(line._1));
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(line._2.toString()));
        return new Tuple2<>(new ImmutableBytesWritable(), put);
    });
    hbasePuts.saveAsHadoopDataset(jobConfig);
}

首先，我们创建Hadoop Configuration对象，然后设置OutputFormat和输出表名。

saveAsHadoopDataset对应mapred API，所以我们应该使用Hadoop mapred API的设置方法，

通过分析Hadoop mapred API和源码，我发现JobConf类的setOutputFormatClass可以设置OutputFormat，源码如下：

/**
 * Set the {@link OutputFormat} implementation for the map-reduce job.
 *
 * @param theClass the {@link OutputFormat} implementation for the map-reduce 
 *                 job.
 */
public void setOutputFormat(Class<? extends OutputFormat> theClass) {
    setClass("mapred.output.format.class", theClass, OutputFormat.class);
}

由于我们需要创建JobConf的实例，所以可以直接使用这个方法设置：

jobConfig.setOutputFormat(TableOutputFormat.class);

这个TableOutputFormat来自org.apache.hadoop.hbase.mapred.TableOutputFormat。

设置输出表名：

jobConfig.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName);

其他

在上面的实例代码中，有一点需要注意：

JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = rdd.mapToPair(line -> {
        Put put = new Put(Bytes.toBytes(line._1));
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(line._2.toString()));
        return new Tuple2<>(new ImmutableBytesWritable(), put);
    });

这里构造了可以保存到HBase的RDD。

对于org.apache.hadoop.hbase.mapred.TableOutputFormat，继承自FileOutputFormat<ImmutableBytesWritable, Put>，所以需要构造符合输出格式的RDD。

对于org.apache.hadoop.hbase.mapreduce.TableOutputFormat，继承自OutputFormat<KEY, Mutation>，其中KEY被忽略，Mutation必须是Put或Delete实例。

代码注释如下：

/**
 * Convert Map/Reduce output and write it to an HBase table. The KEY is ignored
 * while the output value <u>must</u> be either a {@link Put} or a
 * {@link Delete} instance.
 */

总结

本文主要讨论了saveAsNewAPIHadoopDataset和saveAsHadoopDataset的源码分析和用法，同时向外做了一些拓展性的讨论。

其中记录了探索的过程，希望可以帮助读者理解的更好。

邢为栋

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
saveAsHadoopDataset和saveAsNewAPIHadoopDataset源码分析及用法说明

在研究Spark Streaming保存数据到HBase的时候，其中一种方案用到了Spark的两个算子，分别是saveAsHadoopDataset和saveAsNewAPIHadoopDataset，它们可以将RDD输出到Hadoop支持的存储系统中。本文对这两个算子进行一些源码分析，并以HBase为目标存储系统描述算子的使用方法。Spark版本：2.11-2.4.0-cdh6.3.2。HBase版本：2.1.0-cdh6.3.2。Hadoop版本：3.0.0-cdh6.3.2。前置知识在分析
复制链接

扫一扫