Spark的RDD
RDD:弹性分布式数据集特性
- RDD由分区组成,每个分区运行在不同的Worker上,通过这种方式来实现分布式计算(A list of partitions)
- 在RDD中,提供算子处理每个分区中的数据(A function for computing each split)
- RDD存在依赖关系:宽依赖和窄依赖(A list of dependencies on other RDDs)
- 可以自定义分区规则来创建RDD(Optionally, a Partitioner for key-value RDDs(The RDD is hash-partitioned))
- 优先选择离文件位置近的节点来执行(Optionally, a list of preferred locations to compute each split on(Block locations for hdfs file))
标识RDD可以被缓存(persist,cache),执行缓存,第二次执行语句才体现效果
缓存的位置
- NONE
- DISK_ONLY
- DISK_ONLY_2
- MEMORY_ONLY
- MEMORY_ONLY_2
- MEMORY_ONLY_SER
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK
- MEMORY_AND_DISK_2
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK_SER_2
- OFF_HEAP
RDD检查点的类型
- 基于本地目录
- 基于HDFS目录
//设置检查点
sc.textFile("hdfs://192.168.138.130:9000/tmp/text_Cache.txt")
//设置检查点目录
sc.setCheckpointDir("hdfs://192.168.138.130:9000/checkpoint")
//执行检查点操作
rdd1.checkpoint
(3)RDD的依赖关系(宽依赖,窄依赖)
宽依赖:多个子RDD的Partition会依赖同一个父RDD的Partition
窄依赖:每一个父RDD的Partition最多被子RDD的一个Partition使用
窄依赖是划分stage的依据
RDD的创建
(1)通过SparkContext.parallelize方法来创建val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)
(2)通过外部数据源来创建
val rdd = sc.textFile("/root/spark_WordCount.text")
RDD的算子
一、Transformation
(1)map(func) 相当于for循环,返回一个新的RDDval rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.map(_*2)
rdd1.collect
(2)filter(func) 过滤
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.filter(_>2)
rdd1.collect
(3)flatmap(func) flat+map 压平
val rdd = sc.parallelize(Array("a b c", "d e f", "g h i"))
val rdd1 = rdd.flatmap(_.split(" "))
rdd1.collect
(4)mapPartitions(func) 对RDD每个分区进行操作
(5)mapPartitionsWithIndex(func) 对RDD每个分区进行操作,可以取到分区号
(6)sample(withReplacement, fraction, seed) 采样
(7)union(otherDataset) 集合运算
val rdd = sc.parallelize(List(1,2,3,4,5))
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8))
val rdd2 = rdd.union(rdd1)
rdd2.collect
rdd2.distinct.collect //去重
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
rdd2.collect
(8)intersection(otherDataset) 集合运算
(9)distinct([numTasks]) 去重
(10)groupByKey([numTasks]) 聚合操作
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.groupByKey
rdd3.collect
(11)reduceByKey(func, [numTasks]) 聚合操作
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.reduceByKey(_+_)
rdd3.collect
(12)aggregateByKey(zeroValye)(seqop, combop, [numTasks]) 聚合操作
(13)sortByKey([asceding], [numTasks]) 聚合操作
(14)sortBy(func, [asceding], [numTasks]) 聚合操作
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.sortBy(x => x,true)
rdd1.collect
(15)join(otherDataset, [numTasks]) 聚合操作
(16)cogroup(otherDataset, [numTasks]) 聚合操作
val rdd = sc.parallelize(List(("Destiny",1000), ("Destiny",2000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.cogroup(rdd1)
rdd2.collect
(16)cartesian(otherDataset) 聚合操作
(17)pipe(command, [envVars]) 聚合操作
(18)coalesce(numPartitions) 聚合操作
(19)repartition(numPartitions) 聚合操作
(20)repartitionAnSortWithinPartitions(partitioner) 聚合操作
二、Action
(1)reduce(func)val rdd = sc.parallelize(Array(1,2,3,4,5,6))
val rdd1 = rdd.reduce(_+_)
rdd1.collect
(2)collect()
(3)count()
(4)first()
(5)take(n)
(6)takeSample(withReplacement, num, [seed])
(7)takeOrdered(n, [ordering])
(8)saveAsTextFile(path)
(9)saveAsSequenceFile(path)
(10)saveAsObjectFile(path)
(11)countByKey()
(12)foreach(func) 与map类似,没有返回值
RDD的高级算子
(1)mapPartitionsWithIndex对RDD中的每个分区进行操作,下标用index表示,可以获取分区号
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
def func(index: Int, iter: Iterator[Int]): Iterator[String] = {
iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
(2)aggregate
聚合操作。先对局部进行聚合操作,再对全局进行聚合操作
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
rdd.aggregate(0)(max(_,_), _+_).collect
rdd.aggregate(0)(_+_, _+_).collect
rdd.aggregate(10)(_+_, _+_) .collect //51
val rdd = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd.aggregate(*)(_+_, _+_) //**def*abc
(3)aggregateByKey
类似于aggregate,操作key-value的数据类型
val rdd = sc.parallelize(List(("cat",2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)),2)
def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
import scala.math._
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
rdd.aggregateByKey(0)(_+_,_+_).collect
<b>(4)coalesce</b><br>
<p>默认不会进行shuffle,默认为false,true为分区</p>
```scala
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.coalesce(3,true)
rdd1.partitions.length
(5)repartition
默认进行shuffle
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.repartition(3)
rdd1.partitions.length
(6)其他高级算子
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html