hadoopRDD newAPIHadoopRDD如何使用

最新推荐文章于 2023-07-26 02:13:35 发布

張萠飛

最新推荐文章于 2023-07-26 02:13:35 发布

阅读量3.5k

点赞数

分类专栏： Spark 文章标签： newAPIHadoopRDD hadoopRDD

本文链接：https://blog.csdn.net/zpf_940810653842/article/details/104815533

版权

大数据同时被 2 个专栏收录

97 篇文章 1 订阅

订阅专栏

Spark

23 篇文章 0 订阅

订阅专栏

Table of Contents

hadoopRDD

newAPIHadoopRDD

调用样例

hadoopRDD

从 Hadoop JobConf 获取一个 Hadoop 可读数据集的 RDD，给出它的 InputFormat 和其他必要的信息(例如，基于文件系统的数据集的文件名，HyperTable 的表名)，使用旧的 MapReduce API (' org.apache.hadoop.mapred ')。

/**
   * @param conf 设置数据集的 JobConf。注意:这将被放到广播中。因此，如果您计划重用这个 conf 来创建多个 RDD，那么您需要确保您不会修改 conf。
   * @param inputFormatClass Class of the InputFormat
   * @param keyClass Class of the keys
   * @param valueClass Class of the values
   * @param minPartitions Minimum number of Hadoop Splits to generate.
   *
   * @note 因为 Hadoo p的 RecordReader 类为每个记录重用相同的可写对象，所以直接缓存返回的 RDD 或直接将其传递给聚合或 shuffle 操作将创建对同一对象的许多引用。如果您计划直接缓存、排序或聚合 Hadoop 可写对象，您应该首先使用 map 函数复制它们。
   */
  def hadoopRDD[K, V](
      conf: JobConf,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // 这是一种强制加载hdfs-site.xml的方法。
    // See SPARK-11227 for details.
    FileSystem.getLocal(conf)

    // 在广播之前向 JobConf 添加必要的安全凭据。
    SparkHadoopUtil.get.addCredentials(conf)
    new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minPartitions)
  }

newAPIHadoopRDD

使用任意新的 API InputFormat 和额外的配置选项来传递给输入格式，为给定的 Hadoop 文件获取一个RDD。

/**
   * @param conf 设置数据集的配置。注意：这将被放到广播中。因此，如果您计划重用这个 conf 来创建多个 RDD，那么您需要确保您不会修改 conf。
   * @param fClass Class of the InputFormat
   * @param kClass Class of the keys
   * @param vClass Class of the values
   *
   * @note 因为 Hadoop 的 RecordReader 类为每个记录重用相同的可写对象，所以直接缓存返回的 RDD 或直接将其传递给聚合或洗 shuffle 操作将创建对同一对象的许多引用。如果您计划直接缓存、排序或聚合 Hadoop可写对象，您应该首先使用 map 函数复制它们。
   */
  def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
      conf: Configuration = hadoopConfiguration,
      fClass: Class[F],
      kClass: Class[K],
      vClass: Class[V]): RDD[(K, V)] = withScope {
    assertNotStopped()

    // 这是一种强制加载hdfs-site.xml的方法。
    // See SPARK-11227 for details.
    FileSystem.getLocal(conf)

    // 向 JobConf 添加必要的安全凭据。需要访问安全的 HDFS。
    val jconf = new JobConf(conf)
    SparkHadoopUtil.get.addCredentials(jconf)
    new NewHadoopRDD(this, fClass, kClass, vClass, jconf)
  }

调用样例

package com.zhangpengfei.spark.demo

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.spark.{SparkConf, SparkContext}

object MyHadoopRDD {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName(FileSourceDemo1.getClass.toString)
      .setMaster("local[2]")
    val sc = new SparkContext(conf)

    val jobConf = new JobConf
    FileInputFormat.setInputPaths(jobConf, new Path("hdfs://192.168.78.135:9000/user/hive/warehouse/testhivedrivertable/a.txt"));
    val hadoopRDD = sc.hadoopRDD(jobConf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], 2)
    println("hadoopRDD is Running.............")
    hadoopRDD.foreach("hadoopRDD的结果为：" + println(_))

    val configuration = new Configuration
    configuration.set(org.apache.hadoop.mapreduce.lib.input.FileInputFormat.INPUT_DIR, "hdfs://192.168.78.135:9000/user/hive/warehouse/testhivedrivertable/a.txt")
    configuration.addResource("hdfs://192.168.78.135:9000/user/hive/warehouse/testhivedrivertable/a.txt")
    val newAPIHadoopRDD = sc.newAPIHadoopRDD(configuration, classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat], classOf[LongWritable], classOf[Text])
    println("newAPIHadoopRDD is Running............")
    newAPIHadoopRDD.foreach("newAPIHadoopRDD的结果是：" + println(_))
  }
}

張萠飛

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
hadoopRDD newAPIHadoopRDD如何使用

Table of ContentshadoopRDDnewAPIHadoopRDD调用样例hadoopRDD从 Hadoop JobConf 获取一个 Hadoop 可读数据集的 RDD，给出它的 InputFormat 和其他必要的信息(例如，基于文件系统的数据集的文件名，HyperTable 的表名)，使用旧的 MapReduce API (' org.apache.had...
复制链接

扫一扫

专栏目录