02 Spark on RDD

Direct Known Subclasses:

BaseRRDDCoGroupedRDDEdgeRDDHadoopRDDJdbcRDDNewHadoopRDDPartitionPruningRDDShuffledRDDUnionRDDVertexRDD

public abstract class RDD<T>
extends java.lang.Object
implements scala.Serializable

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as mapfilter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and joinDoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctionscontains operations available on RDDs that can be saved as SequenceFiles. All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit.

Internally, each RDD is characterized by five main properties:

- A list of partitions - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Please refer to the Spark paper for more details on RDD internals.


弹性分布式数据集(RDD),Spark中的基本抽象。表示可以并行操作的不可变、分区的元素集合。此类包含所有RDD上可用的基本操作,如映射、筛选和持久化。此外,PairRDDFunctions包含仅在键值对的RDD上可用的操作,例如groupByKey和join;DoubleRDDFFunction包含仅在Double的RDD上可用的操作;SequenceFileRDDFunctions包含RDD上可用的操作,这些操作可以另存为SequenceFile。所有操作在任何正确类型的RDD(例如RDD[(Int,Int)]上通过隐式操作自动可用。
在内部,每个RDD具有五个主要特性:

-分区列表-用于计算每个拆分的函数-对其他RDD的依赖关系列表-可选,键值RDD的分区器(例如,表示RDD是哈希分区的)-可选,用于计算每个拆分的首选位置列表(例如,HDFS文件的块位置)

Spark中的所有调度和执行都是基于这些方法完成的,允许每个RDD实现自己的计算方式。实际上,用户可以通过重写这些函数来实现自定义RDD(例如,用于从新存储系统读取数据)。有关RDD内部构件的更多详细信息,请参阅Spark论文。

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * @author YaPeng Li
  * @version 0.0.1
  */
object CaseRdd {

  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("case_rdd").setMaster("local[*]")

    val sparkContext = new SparkContext(conf)

    val value: RDD[Int] = sparkContext.makeRDD(List(1, 2, 3, 4))

    println("log info: create rdd by makeRDD function: ")
    value.collect().foreach(println)

    println("log info: rdd transformation by map *2 function: ")
    val value1: RDD[Int] = value.map(num => {
      num * 2
    })
    value1.collect().foreach(println)

    println("log info: rdd transformation by map num+ function: ")
    val value2: RDD[String] = value.map(num => {
      "num" + num
    })
    value2.collect().foreach(println)

    println("log info: rdd transformation by mapPartitions num+ function: ")
    val value3: RDD[Int] = value.mapPartitions(num => {
      num.filter(_ == 2)
    })
    value3.collect().foreach(println)


    println("log info: rdd transformation by flatMap num+ function: ")
    val listNum: RDD[List[Int]] = sparkContext.makeRDD(List(List(1, 2), List(3, 4)))

    val value5: RDD[Int] = listNum.flatMap(num => num)

    value5.collect().foreach(println)

    println("log info: rdd transformation by glom num+ function: ")
    var rddGlom = sparkContext.makeRDD(1 to 10, 3)

    println(rddGlom.glom().collect())

    println("log info: rdd transformation by groupBy num+ function: ")
    val dataRDD = sparkContext.makeRDD(List(1, 2, 3, 4), 1)
    val value6 = dataRDD.groupBy(
      _ % 2
    )
    value6.collect().foreach(print)

    val dataRDD1 = sparkContext.makeRDD(List(("a",1),("b",2),("c",3)))
    val dataRDD2 = sparkContext.makeRDD(List(("a",1),("b",2),("c",3)))
    val rdd: RDD[(String, (Int, Option[Int]))] = dataRDD1.leftOuterJoin(dataRDD2)
    rdd.collect().foreach(println)
    val value7: RDD[Int] = rdd.map(t => {
      (t._2._1)
    }
    )
    value7.collect().foreach(println)

  }




}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值