*)弹性分布式数据集,Resilent distributed DataSet
(*)Spark中数据的基本抽象
(*)结合源码,查看RDD的概念
* Internally, each RDD is characterized by five main properties:
*
* - A listof partitions
一组分区,把数据分成了的不同的分区,每个分区可能运行在不同的worker
* - A function for computing each split
一个函数,用于计算每个分区中的数据
RDD的函数(算子)
(1)Transformation(延时加载)
(2)Action(会触发计算)
* - A listof dependencies on other RDDs
RDD之间存在依赖关系:(1)窄依赖 (2)宽依赖
根据依赖的关系,来划分任务的Stage(阶段)
* - Optionally, a Partitioner for key-value RDDs (e.g. tosaythatthe RDD is hash-partitioned)
* - Optionally, a listof preferred locations to compute each split on (e.g. block locations for an HDFS file)
如何创建一个RDD?有两种方式
(1)使用sc.parallelize方法
val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),3)
(2)通过使用外部的数据源创建RDD:比如:HDFS
val rdd2 = sc.textFile("hdfs://bigdata11:9000/input/data.txt")
val rdd2 = sc.textFile("/root/temp/input/data.txt")
4、RDD的缓存机制:默认在内存中
(*)提高效率
(*)默认:缓存在Memory中
(*)调用:方法:persist或者cache
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
def cache(): this.type = persist()
(*)缓存的位置:StorageLevel定义的
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
(*)示例:
测试数据:Oracle数据库的订单变 sales表(大概92万)
步骤
(1)读入数据
val rdd1 = sc.textFile("hdfs://bigdata11:9000/input/sales")
(2)计算
rdd1.count ---> Action,这一次没有缓存
rdd1.cache ---> 缓存数据,但是不会触发计算,cache是一个Transformation
rdd1.count ----> 触发计算,将结果缓存
rdd1.count ----> ???会从哪里得到数据,从缓存中得到
5、RDD的容错机制:checkpoint检查点:两种类型 (1)本地目录 (2)HDFS目录
(1)复习检查点:HDFS中,合并元信息
Oracle中,会以最高优先级唤醒数据库写进程(DBWn),来写内存中的脏数据---> 数据文件
(2)RDD的检查点:容错机制,辅助Lineage(血统)---> 整个计算的过程
如果lineage越长,出错的概率就越大(生成检查点,如果出错,就从之前的检查点开始计算)
两种类型 (1)本地目录 : 需要将spark-shell运行在本地模式上
(2)HDFS目录: 需要将spark-shell运行在集群模式上
scala> sc.setCheckpointDir("hdfs://bigdata11:9000/spark/checkpoint")
scala> val rdd1 = sc.textFile("hdfs://bigdata11:9000/input/sales")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://bigdata11:9000/input/sales MapPartitionsRDD[41] at textFile at <console>:24
scala> rdd1.checkpoint
scala> rdd1.count
源码中的说明:
/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed onthis RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/