理解Spark的RDD

最新推荐文章于 2024-07-17 09:49:29 发布

白乔

最新推荐文章于 2024-07-17 09:49:29 发布

阅读量3.7w

点赞数 1

分类专栏：大数据技术与系统

本文链接：https://blog.csdn.net/bluejoe2000/article/details/41415087

版权

本文深入探讨Spark的Resilient Distributed Datasets (RDD)。RDD是Spark的核心抽象，具有不可变、分区、容错等特点。通过并行转换创建，如map和filter。RDD的存储级别灵活，可选择内存、磁盘或两者结合。RDD通过getPartitions和compute方法定义数据分片和计算逻辑。转换操作是延迟执行的，直到遇到行动操作才触发计算。RDD的生成可通过Hadoop文件系统或从父RDD转换。计算时，RDD读取数据类似Hadoop MapReduce。

摘要由CSDN通过智能技术生成

RDD是个抽象类，定义了诸如map()、reduce()等方法，但实际上继承RDD的派生类一般只要实现两个方法：

def getPartitions: Array[Partition]
def compute(thePart: Partition, context: TaskContext): NextIterator[T]

getPartitions()用来告知怎么将input分片；

compute()用来输出每个Partition的所有行（行是我给出的一种不准确的说法，应该是被函数处理的一个单元）；

以一个hdfs文件HadoopRDD为例：

  override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    if (inputFormat.isInstanceOf[Configurable]) {
      inputFormat.asInstanceOf[Configurable].setConf(jobConf)
    }
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

它直接将各个split包装成RDD了，再看compute()：

  override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
    val iter = new NextIterator[(K, V)] {

      val split = theSplit.asInstanceOf[HadoopPartition]
      logInfo("Input split: " + split.inputSplit)
      var reader: RecordReader[K, V] = null
      val jobConf = getJobConf()
      val inputFormat = getInputFormat(jobConf)
      HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),
        context.stageId, theSplit.index, context.attemptId.toInt, jobConf)
      reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)

      // Register an on-task-completion callback to close the input stream.
      context.addTaskCompletionListener{ context => closeIfNeeded() }
      val key: K = reader.createKey()
      val value: V = reader.createValue()

      // Set the task input metrics.
      val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
      try {
        /* bytesRead may not exactly equal the bytes read by a task: split boundaries aren't
         * always at record boundaries, so tasks may need to read into other splits to complete
         * a record. */
        inputMetrics.bytesRead = split.inputSplit.value.getLength()
      } catch {
        case e: java.io.IOException =>
          logWarning("Unable to get input size to set InputMetrics for task", e)
      }
      context.taskMetrics.inputMetrics = Some(inputMetrics)

      override def getNext() = {
        try {
          finished = !reader.next(key, value)
        } catch {
          case eof: EOFException =>
            finished = true
        }
        (key, value)
      }

      override def close() {
        try {
          reader.close()
        } catch {
          case e: Exception => logWarning("Exception in RecordReader.close()", e)
        }
      }
    }
    new InterruptibleIterator[(K, V)](context, iter)
  }

它调用reader返回一系列的K,V键值对。

再来看看数据库的JdbcRDD：

  override def getPartitions: Array[Partition] = {
    // bounds are inclusive, hence the + 1 here and - 1 on end
    val length = 1 + upperBound - lowerBound
    (0 until numPartitions).map(i => {
      val start = lowerBound + ((i * length) / numPartitions).toLong
      val end = lowerBound + (((i + 1) * length) / numPartitions).toLong - 1
      new JdbcPartition(i, start, end)
    }).toArray
  }

它直接将结果集分成numPartitions份。其中很多参数都来自于构造函数：

class JdbcRDD[T: ClassTag](
    sc: SparkContext,
    getConnection: () => Connection,
    sql: String,
    lowerBound: Long,
    upperBound: Long,
    numPartitions: Int,
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)

再看看compute()函数：

  override def compute(thePart: Partition, context: TaskContext) = new NextIterator[T] {
    context.addTaskCompletionListener{ context => closeIfNeeded() }
    val part = thePart.asInstanceOf[JdbcPartition]
    val conn = getConnection()
    val stmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)

    // setFetchSize(Integer.MIN_VALUE) is a mysql driver specific way to force streaming results,
    // rather than pulling entire resultset into memory.
    // see http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
    if (conn.getMetaData.getURL.matches("jdbc:mysql:.*")) {
      stmt.setFetchSize(Integer.MIN_VALUE)
      logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
    }

    stmt.setLong(1, part.lower)
    stmt.setLong(2, part.upper)
    val rs = stmt.executeQuery()

    override def getNext: T = {
      if (rs.next()) {
        mapRow(rs)
      } else {
        finished = true
        null.asInstanceOf[T]
      }
    }

    override def close() {
      try {
        if (null != rs && ! rs.isClosed()) {
          rs.close()
        }
      } catch {
        case e: Exception => logWarning("Exception closing resultset", e)
      }
      try {
        if (null != stmt && ! stmt.isClosed()) {
          stmt.close()
        }
      } catch {
        case e: Exception => logWarning("Exception closing statement", e)
      }
      try {
        if (null != conn && ! conn.isClosed()) {
          conn.close()
        }
        logInfo("closed connection")

最低0.47元/天解锁文章

白乔

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
理解Spark的RDD

RDD（Resilient Distributed Datasets）弹性分布式数据集，是在集群应用中分享数据的一种高效，通用，容错的抽象，是Spark提供的最重要的抽象的概念，它是一种有容错机制的特殊集合，可以分布在集群的节点上，以函数式编操作集合的方式，进行各种并行操作。RDD是只读的，不可变的数据集。RDD也是容错的，假如其中一个RDD坏掉，RDD中有记录之前的依赖关系，依赖关系中记录算
复制链接

扫一扫

专栏目录