sparkRDD总结

最新推荐文章于 2024-07-16 22:38:10 发布

443441968

最新推荐文章于 2024-07-16 22:38:10 发布

阅读量1.1k

点赞数

分类专栏： spark parkRDD

本文链接：https://blog.csdn.net/u013761049/article/details/82499848

版权

spark 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

parkRDD

1 篇文章 0 订阅

订阅专栏

--------[pair]表示一个元组 ;如("ty",12) With必须添加分区的类型------------------------------------------
aggregate :聚合每个分区的值。每个分区中的聚合变量都是用零值初始化的。
aggregateByKey [Pair] :将相同的key进行聚合
cartesian :笛卡尔积
checkpoint :检查点
coalesce, repartition :重新分区，其中repartition调用coalesce方法
cogroup [pair], groupWith [Pair] :分组压缩compact, groupWith [Pair]封装了cogroup方法
collect, toArray :聚集操作(单个元素)如list或Array集合展示
collectAsMap [pair] :聚集操作(两个元组做为一组，map的键值)map--(key,value)
combineByKey [pair] :收集操作(key相同的元组放在一起)
compute :用户不应直接调用此函数--------------
context, sparkContext :获取上下文，获取SparkContext的实例变量
count :得到Array、List中元素的个数。和list.length功能一样
countApprox :标记为实验特征!实验特性目前不包括在本文档中!…………………………
countApproxDistinct :计算单个不同值的近似数目--【没有用过，不知道干嘛用】
countApproxDistinctByKey [pair]:计算(key,value)不同值的近似数目
countByKey [pair] :统计key相同的个数，并以(key,count)的形式返回
countByKeyApprox [pair] :统计key近似相同的个数，并以(key,count)的形式返回
countByValue :用于List中；list中元素相同的进行个数求和，同时支持list中的元组
countByValueApprox :用于List中；list中元素近似相同的进行个数求和，同时支持list中的元组
dependencies :返回RDD的依赖，使用b.dependencies.length获取依赖的个数
distinct :List中，去除重复的元素
first :获取list中的第一给元素
filter :过滤集合中的元素，map同样可以来做过滤,但是不成立时会返回()--Unit对象
filterByRange [Ordered] :添加一个范围，而这个返回中的元素必须能调用compareTo方法
filterWith :这里的With指定分区的类型，该类型会在后面的括号中使用。
这里的x是：x=> x将分区中的数据项赋值给x，并将x的类型指定为b== 后面的值的类型相同。这里的b是分区索引号，a有5个分区，0,1,2,3，4；则b的值可以为0,1，2,3,4(来自于后面的案例）
flatMap :将多个集合压平，放到一起
flatMapValues [Pair] :(key,value)的键值对压缩
flatMapWith :每个分区压缩时，将传入分区的类型
fold :合并每个分区的值。每个分区中的聚合变量都是用零值初始化的。
foldByKey [Pair] :(key,value)的分区聚合
foreach :为每个数据执行一次操纵
foreachPartition :每个分区各自执行foreach操作
foreachWith :迭代每个数据项，并指定每个分区索引的类型，默认为Int，但是可以变成String，tuple等，
则在with后面的参数中来操作
fullOuterJoin [Pair] :两个(key,value)的数据进行连接聚合---饭后一个带(k,v)的元组。
(book,(Some(4),None)), (mouse,(None,Some(4)))
generator, setGenerator :打印依赖图时，允许设置一个字符串并绑定到RDD的名字后面，设置rd的别名
getCheckpointFile :返回检查点文件的路径，如果RDD还没有找到这个检测点，则返回null。
preferredLocations :返回首先主机
getStorageLevel :检索RDD当前设置的存储级别
glom :组装一个包含分区所有元素的数组类似于----视图
groupBy :按照标签进行分组//even和odd代表两个组,如果表达式为true,则放入even，否则放入odd 中
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
groupByKey [Pair] :(k,v)来进行分组(4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)
histogram [Double] :并创建一个直方图，
id :获取RDD的ID 检索由其设备上下文分配给RDD的ID。
intersection :/中间相同额数据集
isCheckpointed :是否有检测点
iterator :返回一个通用的迭代器
join [pair] :(k,v)值相同的进行一个连接返回一个新的元组(k,(v1,v2))
keyBy :将List中的每个元素转成(k,v);其中k由用户来设计如a.keyBy(_.length)
keys [pair] :将list中的所有(k,v)中的k取出来
leftOuterJoin [pair] :(k,v)进行左外连接
lookup [pair] :根据k在list中(k,v)中查找元素
map :处理每行数据
mapPartitions :将每个分区执行map的处理
mapPartitionsWithContext :自定义一个方法，然后就可以获取执行时的一些信息
mapPartitionsWithIndex :打印每个分区的内容，需要自己写func
mapPartitionsWithSplit :这个方法在API中被标记为已废弃...........................
mapValues [pair] :将输入进来的tuple(k,v)直接操作v的结果
mapWith :(a => new scala.util.Random)指后面分区r的数据类型，在每个分区中执行一次这个操作。
(x,r)----r代指一个分区，该处为1,2,3, x代表每个分区的数据项。
  With操作都要指定后面操作时的实例对象。比如这里是一个随机生成Random对象。也可以是一个字符串
max :得到list中最大的值
mean [Double], meanApprox [Double] :得到整个list的平均值
min :得到最小值
name, setName :允许使用自定义名称标记RDD。
partitionBy [Pair] :(k,v)
partitioner :
partitions :获取partitions的信息。如得当分区数量r.partitions.length
persist, cache :设置存储级别
pipe :获取每个分区的RDD数据，并通过stdin将其发送到shell命令。
randomSplit :随机切分----机器学习
reduce :求和
reduceByKey [Pair], reduceByKeyLocally[Pair],:根据(k,v)来求和并将key相同的放在一个(k,多个v)
reduceByKeyToDriver[Pair] :
repartition :重置分区数
repartitionAndSortWithPartitions [Ordered] :重置分区数并进行排序
rightOuterJoin [Pair] :右外连接
sample :随机生成a.sample(false, 0.1, 0).count【没有用过，不知道干嘛用】
sampleByKey [Pair] :随机生成【没有用过，不知道干嘛用】
sampleByKeyExact [Pair] :随机生成【没有用过，不知道干嘛用】
saveAsHodoopFile [Pair], :将(k,v)结果保存到hadoop中
saveAsHadoopDataset :Pair], :将(k,v)结果保存到hadoop数据集中
saveAsNewAPIHadoopFile [Pair] :将(k,v)结果保存到新hadoop中
saveAsObjectFile :以对象进行保存
saveAsSequenceFile [SeqFile] :以序列进行保存
saveAsTextFile :以文本文件进行保存
stats [Double] :获取均值、方差和标准差。
sortBy :指定函数进行排序：y.sortBy(c => c, false).collect
sortByKey [Ordered]   :按照key进行排序c.sortByKey(true).collect//true是升序，false降序
stdev [Double], sampleStdev [Double] :
subtract :减法
subtractByKey [Pair] :(k,v)k相同的做减法
sum [Double], sumApprox[Double] :求和
take :获取前n项 take(n)
takeOrdered :进行排序后在调用take(n)
takeSample :获取一些样品-----大数据
treeAggregate :默认情况下，使用的树是depth 2，但是这可以通过depth参数进行更改。
treeReduce :它的工作方式与reduce类似，只是在多级树模式中减少了RDD的元素。
toDebugString :返回一个字符串，该字符串包含关于RDD及其依赖项的调试信息
toJavaRDD :将这个RDD对象嵌入到JavaRDD对象中并返回它。
toLocalIterator :将RDD转换为主节点上的scala迭代器
top ://排序后取前二个 c.top(2)
toString :转换成一个人类可读的RDD文本描述。
union, ++ :求和
unpersist    :反序列化
values [Pair] :得到(k,v)的所有v返回一个()
variance [Double], sampleVariance [Double] :方差
zip :多个list压缩在一起
zipPartitions :类似于zip。但是提供了对压缩过程的更多控制。
zipWithIndex :索引从0开始。如果RDD分布在多个分区中，则会启动spark作业来执行此操作。
zipWithUniquId :

aggregate
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// 让我们首先用分区标签打印出RDD的内容
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4],
[partID:1, val: 5], [partID:1, val: 6])

z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

//这个示例返回16，因为初始值是5
//分区0的约简为max(5,1,2,3) = 5
//分区1的约简为最大值(5,4,5,6)= 6
//最终分区的reduce为5 + 5 + 6 = 16
//注意，最终的缩小包括初始值
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16

val z = sc.parallelize(List("a","b","c","d","e","f"),2)

// 让我们首先用分区标签打印出RDD的内容
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e],
[partID:1, val: f])

z.aggregate("")(_ + _, _+_)
res115: String = abcdef

//看看这里初始值“x”是如何应用三次的。
// -每个分区一次
// -将第二个reduce函数中的所有分区合并一次。
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc

// 下面是一些更高级的例子。有些问题很难解决。

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10或者01
执行过程：""和"12"--->math.min(x.length,y.length).toString=====1
       "1"和"23"--->math.min(x.length,y.length).toString=====1
       ""和"345"--->math.min(x.length,y.length).toString=====1
       "1"和""--->math.min(x.length,y.length).toString=====0
       最后输出结果为：""+"1"+"0"="10"

val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11
1、""和12的长度取最小==0--->0.toString---->"0"
2、"0"和23的长度取最小==1---->1.toString----->"1"

3、""和""的长度取最小====0---->0.toString----->"0"
4、"0"和345的长度取最小=====1----->1.toString---->"1"

最后执行：""+"1"+"1"="11"
因此执行是有顺序的

aggregateByKey [Pair]

与聚合函数类似，只是聚合应用于具有相同键的值。与聚合函数不同的是，初始值不应用于第二个reduce。
Listing Variants

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// 让我们看看分区中有什么
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)],
[partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
//该操作不会讲100应用到第二个reduce上。
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

cartesian(笛卡尔积)

计算两个RDD之间的笛卡尔积(即第一个RDD的每个项与第二个RDD的每个项连接)，并将它们作为新的RDD返回。
警告:使用此功能时要小心。内存消耗很快就会成为一个问题!
Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
(4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

将在下一次计算RDD时创建检查点。检查点RDDs作为二进制文件存储在检查点目录中，可以使用Spark上下文指定该目录。
(警告:Spark应用惰性评估。在调用操作之前，检查点不会发生。

重要提示:目录“my_directory_name”应该存在于所有从目录中。作为一种替代方法，您也可以使用HDFS目录URL。

Listing Variants

def checkpoint()

sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
14/02/25 18:13:53 INFO SparkContext: Starting job: count at <console>:15
...
14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 115.7 KB, free 296.3 MB)
14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,
new parent is RDD 12
res23: Long = 4

coalesce(合并), repartition(重新分区)
repartition(5)其实调用的是coalesce(5,true)
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
将关联的数据合并到给定数量的分区中。repartition(numpartition)是coalesce(numpartition, shuffle = true)的缩写。
Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]

val y = sc.parallelize(1 to 10, 10)
//重新分区为2
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

cogroup [Pair]---分组压缩compact, groupWith [Pair]
/** Alias for cogroup. */封装了cogroup方法
def groupWith[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)])
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {
cogroup(other1, other2, other3, defaultPartitioner(self, other1, other2, other3))
}
一组非常强大的函数，允许使用它们的键将最多3个键值RDDs组合在一起。

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res13: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b))
res14: Array[(Int, String)] = Array((1,c), (2,c), (1,c), (3,c))
res12: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(1,(CompactBuffer(b, b),CompactBuffer(c, c))), 将每一个集合进行合并之后，再压缩，最后得到这个结果，
即使相同的数据在不同的集合中也不会压缩到同一个compact中
(3,(CompactBuffer(b),CompactBuffer(c))),
(2,(CompactBuffer(b),CompactBuffer(c))))
----Compact压缩，简洁的意思----
val d = a.map((_, "d"))
//一次最多压缩三个。超过三个将报错。将各个集合中的数据进行按key相同放入一个集合中，然后在各个集合中相同的key的value放到同一个元组中
b.cogroup(c, d).collect
res0: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(1,(CompactBuffer(b, b),CompactBuffer(c, c),CompactBuffer(d, d))),
(3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))),
(2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect //分组压缩在合并
res4: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(CompactBuffer(kiwi),CompactBuffer(iPad))),
(2,(CompactBuffer(banana),CompactBuffer())),
(1,(CompactBuffer(apple),CompactBuffer(laptop, desktop))),
(3,(CompactBuffer(orange),CompactBuffer())),
(5,(CompactBuffer(),CompactBuffer(computer)))
)

collect, toArray
/**封装了collect
* Return an array that contains all of the elements in this RDD.
*/
@deprecated("use collect", "1.0.0")
def toArray(): Array[T] = withScope {
collect()
}
将RDD转换为Scala数组并返回它。如果您提供了一个标准的map-function(即f = T -> U)，它将在将值插入结果数组之前应用。
Listing Variants

def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
//将各个分区中的数据收集上来
c.collect
res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

collectAsMap [Pair]

类似于collect，但使用键值RDDs并将其转换为Scala映射以保留其键值结构。
Listing Variants

def collectAsMap(): Map[K, V]

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap//已key--value的形式输出
res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
b.collect//已元组的形式输出(Int, Int)---表示一个元组
res6: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))

combineByKey[Pair] Combiner=-==组合器，组合

非常有效的实现，通过一个接一个地应用多个聚合器来组合由两个组件元组组成的RDD的值。
Listing Variants
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)]

Example

val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
res8: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee))
组合器x=>List(x)===[x=(1,dog)]===>List((1,dog))====>List(1,(dog,cat))====>List(1,(dog,cat))+List(1,(turkey))==>List(1,(dog,cat,turkey))
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute
执行依赖项并计算RDD的实际表示。用户不应直接调用此函数。
Listing Variants
def compute(split: Partition, context: TaskContext): Iterator[T]

context--语境，上下文, sparkContext
返回用于创建RDD的SparkContext。

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.context

res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1

count
返回RDD中存储的项数。

Listing Variants

def count(): Long

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
//返回数据的长度。该list的长度为4
c.count
res2: Long = 4

countApprox Approx---大约
标记为实验特征!实验特性目前不包括在本文档中!
Listing Variants
def (timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]

countApproxDistinct

计算不同值的近似数目。对于分布在许多节点上的大型RDDs，该函数的执行速度可能比其他计数方法快。参数相对论控制着计算的准确性。
Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05): Long

Example

val a = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224

b.countApproxDistinct(0.05)
res15: Long = 9750

b.countApproxDistinct(0.01)
res16: Long = 9947

b.countApproxDistinct(0.001)
res0: Long = 10000

countApproxDistinctByKey [Pair]

类似于countApproxDistinct，但是计算每个不同键的不同值的大致数目。因此，RDD必须由两个组件元组组成。
对于分布在许多节点上的大型RDDs，该函数的执行速度可能比其他计数方法快。参数相对论控制着计算的准确性。
Listing Variants

def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))

countByKey [Pair]
与count非常相似，但是对于每个不同的键，计算由两个组件元组组成的RDD的值。
Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
//count求个数，相同的key的个数累加
c.countByKey
res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)

countByKeyApprox [Pair]
标记为实验特征!实验特性目前不包括在本文档中!
Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

返回一个映射，该映射包含RDD的所有惟一值及其各自的出现计数。(警告:此操作将最终在单个减速器中聚合信息。)
Listing Variants

def countByValue(): Map[T, Long]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
//值相同的进行个数求和，不管是key，value类型还是value类型
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox
Marked as experimental feature! Experimental features are currently not covered by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[T, BoundedDouble]]

dependencies

返回此RDD所依赖的RDD。

Listing Variants

final def dependencies: Seq[Dependency[_]]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:12
b.dependencies.length
Int = 0

b.map(a => a).dependencies.length
res40: Int = 1

b.cartesian(a).dependencies.length
res41: Int = 2

b.cartesian(a).dependencies
res42: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,
org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

返回一个新的RDD，该RDD仅包含每个惟一值一次。

Listing Variants

def distinct(): RDD[T]
def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
//去重数据
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2
//其中2,3指的是分区数量，设置多少似乎没有关系
a.distinct(3).partitions.length
res17: Int = 3

first

查找RDD的第一个数据项并返回它。

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res1: String = Gnu

filter

计算RDD的每个数据项的布尔函数，并将返回true的函数项放入结果的RDD中。
和map的区别：如果没有结果不符合，那么将会放一个()空元素到新数组中。
Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res3: Array[Int] = Array(2, 4, 6, 8, 10)
当您提供筛选器函数时，它必须能够处理RDD中包含的所有数据项。Scala提供了所谓的部分函数来处理混合数据类型。
(提示:部分函数非常有用，如果你有一些数据可能是坏的，你不想处理，但对于好的数据(匹配数据)你想应用某种映射函数。
下面的文章很好。它以一种很好的方式教你关于部分函数的知识并解释了为什么部分函数必须使用case:

没有部分函数的混合数据示例

val b = sc.parallelize(1 to 8)
b.filter(_ < 4).collect
res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.filter(_ < 4).collect
<console>:15: error: value < is not a member of Any
这将失败，因为a的一些组件不能隐式地与整数进行比较。Collect使用函数对象的isDefinedAt属性来确定测试函数是否与每个数据项兼容。
只有通过此测试(=filter)的数据项才能使用function-object进行映射。
Examples for mixed data with partial functions

可通过case来过滤不需要的东西
val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.collect({case a: Int => "is integer" |
case b: String => "is string" }).collect
res17: Array[String] = Array(is string, is string, is integer, is string)

val myfunc: PartialFunction[Any, Any] = {
case a: Int => "is integer" |
case b: String => "is string" }
myfunc.isDefinedAt("")
res21: Boolean = true

myfunc.isDefinedAt(1)
res22: Boolean = true

myfunc.isDefinedAt(1.5)
res23: Boolean = false

小心!上面的代码可以工作，因为它只检查类型本身!如果对这种类型使用操作，则必须显式声明您想要的类型，而不是任何类型。
否则，编译器(显然)不知道它应该生成什么字节码:
val myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"}
<console>:10: error: value < is not a member of Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x < 4) => "x"}
myfunc2: PartialFunction[Int,Any] = <function1>

filterByRange [Ordered]

返回一个RDD，其中只包含指定的键范围内的项。从我们的测试来看，这似乎只适用于您的数据在键值对，并且它已经按键排序。
Listing Variants

def filterByRange(lower: K, upper: K): RDD[P]

Example

val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
val sortedRDD = randRDD.sortByKey()
//只适用于键值对。---对结果集的一个子集获取
sortedRDD.filterByRange(1, 3).collect
res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book))

filterWith (deprecated)

这是过滤器的扩展版本。它有两个函数参数。
第一个参数必须符合Int -> T，每个分区执行一次。它将把分区索引转换为类型T。
第二个函数看起来像(U, T) ->Boolean。T是转换后的分区索引，U是RDD中的数据项。
最后，函数必须返回true或false(即应用过滤器)。
Listing Variants

def filterWith[A: ClassTag](constructA: Int => A)(p: (T, A) => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
这里的x是：x=> x将分区中的数据项赋值给x，并将x的类型指定为b== 后面的值的类型相同。
这里的b是分区索引号，a有5个分区，0,1,2,3，4；则b的值可以为0,1，2,3,4
a.filterWith(x=> x)((a, b) => b == 0).collect
res30: Array[Int] = Array(1, 2)

//b=0,1,2,3,4==>a%(b+1)==0 b的值将会被带进去。(a, b) => a % (b+1) == 0--（a代表每个分区的数据项，b代表分区索引号，并且b的类型是x=> x ，该
a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)
//x=>x.toString 说明分区索引号的类型为字符串，因此b == "2"才能成功执行
a.filterWith(x=> x.toString)((a, b) => b == "2").collect
res34: Array[Int] = Array(5, 6)

flatMap

类似于map，但允许在map函数中发出多个项目。
Listing Variants

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

Example
val a=sc.parallelize(1 to 4,2) 0:1 2 ;1:3 4
a.flatMap(1 to _).collect-----1 to 1=(1)；1 to 2 (1,2);1 to 3 (1,2,3);1 to 4 (1,2,3,4);
res42: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)
val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2,
3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

//下面的程序生成列表项的随机副本(最多10份)。
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9,
9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues

非常类似于mapValues，但是在映射期间折叠值的固有结构。
Listing Variants

def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
res2: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (7,panther), (5,eagle))
//先执行"x" + _ + "x"===>xdogx;,然后进行压平操作，然后和key生成2元tuple
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l), (4,i),
(4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a),
(5,g), (5,l), (5,e), (5,x))

flatMapWith (deprecated)

类似于flatMap，但允许从flatMap函数中访问分区索引或分区索引的派生。
Listing Variants

def flatMapWith[A: ClassTag, U: ClassTag](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => Seq[U]): RDD[U]

Example

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect
res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)

fold

聚合每个分区的值。每个分区中的聚合变量都是用零值初始化的。
Listing Variants

def fold(zeroValue: T)(op: (T, T) => T): T

Example

val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6

foldByKey [Pair]

非常类似于聚合，但是对RDD的每个键分别执行聚合。此函数仅在RDD包含两个组件元组时可用。
Listing Variants

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]

Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
//key相同的聚合在一起。
b.foldByKey("")(_ + _).collect
res84: Array[(Int, String)] = Array((3,dogcatowlgnuant)

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
//key相同的聚合在一起
b.foldByKey("")(_ + _).collect
res85: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

foreach

为每个数据项执行一个无参数函数。
Listing Variants

def foreach(f: T => Unit)

Example

val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3)
c.foreach(x => println(x + "s are yummy"))
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy

foreachPartition

为每个分区执行一个无参数函数。通过迭代器参数提供对分区中包含的数据项的访问。
Listing Variants

def foreachPartition(f: Iterator[T] => Unit)

Example
//每个分区各自执行自己的求和
val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
b.foreachPartition(x => println(x.reduce(_ + _)))
6
15
24

foreachWith (Deprecated)

为每个分区执行一个无参数函数。通过迭代器参数提供对分区中包含的数据项的访问
Listing Variants

def foreachWith[A: ClassTag](constructA: Int => A)(f: (T, A) => Unit)

Example

val a = sc.parallelize(1 to 9, 3)
a.foreachWith(i => i)((x,i) => if (x % 2 == 1 && i % 2 == 0) println(x) )
1
3
7
9

fullOuterJoin [Pair]

执行两个成对的RDDs之间的完整外部连接。
Listing Variants

def fullOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], Option[W]))]

Example

val pairRDD1 = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12)))
val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12)))
pairRDD1.fullOuterJoin(pairRDD2).collect

res5: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)), (mouse,(None,Some(4))), (cup,(None,Some(5))),
(cat,(Some(2),Some(2))), (cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))), (cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))),
(cat,(Some(12),Some(12))))

generator, setGenerator

Allows setting a string that is attached to the end of the RDD's name when printing the dependency graph.
在打印依赖图时，允许设置一个字符串并绑定到RDD的名字后面

Listing Variants

@transient var generator
def setGenerator(_generator: String)

getCheckpointFile

Returns the path to the checkpoint file or null if RDD has not yet been checkpointed.

返回检查点文件的路径，如果RDD还没有找到这个检测点，则返回null。
Listing Variants

def getCheckpointFile: Option[String]

Example

sc.setCheckpointDir("/home/cloudera/Documents")
val a = sc.parallelize(1 to 500, 5)
val b = a++a++a++a++a
b.getCheckpointFile
res49: Option[String] = None

b.checkpoint
b.getCheckpointFile
res54: Option[String] = None

b.collect
b.getCheckpointFile
res57: Option[String] = Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40)

preferredLocations

Returns the hosts which are preferred by this RDD. The actual preference of a specific host depends on various assumptions.
返回此RDD首选的主机。特定主机的实际首选项取决于各种假设。
Listing Variants

final def preferredLocations(split: Partition): Seq[String]

getStorageLevel

检索RDD当前设置的存储级别。只有当RDD还没有设置存储级别时，才可以使用此命令来分配新的存储级别。
下面的示例显示了当您试图重新分配存储级别时将会出现的错误。
Listing Variants

def getStorageLevel

Example
persist 存留坚持; 存留; 固执; 继续存在;

val a = sc.parallelize(1 to 100000, 2)
a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
a.getStorageLevel.description
String = Disk Serialized 1x Replicated

a.cache
java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after it was already assigned a level

glom

组装一个包含分区所有元素的数组，并将其嵌入到RDD中。每个返回的数组包含一个分区的内容。
Listing Variants

def glom(): RDD[Array[T]]

Example

glom 抢; 看; <俚>偷; 瞪;意思类似于视图
val a = sc.parallelize(1 to 100, 3)
a.glom.collect
res8: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))

groupBy

Listing Variants

def groupBy[K: ClassTag](f: T => K): RDD[(K, Iterable[T])]
def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K, Iterable[T])]
def groupBy[K: ClassTag](f: T => K, p: Partitioner): RDD[(K, Iterable[T])]

Example

val a = sc.parallelize(1 to 9, 3)
//even和odd代表两个组，如果表达式为true，则放入even，否则放入odd中
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a % 2
}
a.groupBy(myfunc).collect
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a % 2
}
a.groupBy(x => myfunc(x), 3).collect
a.groupBy(myfunc(_), 1).collect
res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))

import org.apache.spark.Partitioner
class MyPartitioner extends Partitioner {
def numPartitions: Int = 2
def getPartition(key: Any): Int =
{
key match
{
case null => 0
case key: Int => key % numPartitions
case _ => key.hashCode % numPartitions
}
}
override def equals(other: Any): Boolean =
{
other match
{
case h: MyPartitioner => true
case _ => false
}
}
}
val a = sc.parallelize(1 to 9, 3)
val p = new MyPartitioner()
val b = a.groupBy((x:Int) => { x }, p)
val c = b.mapWith(i => i)((a, b) => (b, a))
c.collect
res42: Array[(Int, (Int, Seq[Int]))] = Array(
(0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))), (0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))),
(1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))), (1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))), (1,(5,ArrayBuffer(5))))

groupByKey [Pair]

与groupBy非常相似，但不是提供函数，每一对的键组件将自动呈现给partitioner。
Listing Variants

def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
keyBy:以括号的内容作为key，以当前值作为value生成tuple元组("dog".length,"dog")
val b = a.keyBy(_.length)
//相同的key放到一个集合中。
b.groupByKey.collect
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, e

histogram [Double]

这些函数采用双精度浮点数的RDD，并创建一个直方图，其间距为偶数(桶数等于桶数)，或者根据用户通过双值数组提供的自定义桶边界创建任意间距。
这两个变体的结果类型略有不同，
第一个函数将返回由两个数组组成的元组。
       第一个数组包含计算出的桶边界值，
       第二个数组包含相应的值计数(即直方图)。
函数的第二个变体只是将直方图作为整数数组返回。
Listing Variants

def histogram(bucketCount: Int): Pair[Array[Double], Array[Long]]
def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]

Example with even spacing

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(5)
res11: (Array[Double], Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0, 0, 1, 4))

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.histogram(6)
res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3, 4))

Example with custom spacing

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(Array(0.0, 3.0, 8.0))
res14: Array[Long] = Array(5, 3)

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.histogram(Array(0.0, 5.0, 10.0))
res1: Array[Long] = Array(6, 9)

a.histogram(Array(0.0, 5.0, 10.0, 15.0))
res1: Array[Long] = Array(6, 8, 1)

检索由其设备上下文分配给RDD的ID。
Listing Variants

val id: Int

Example

val y = sc.parallelize(1 to 10, 10)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[57] at parallelize at <console>:27

y.id
res16: Int = 57 ===读取的是ParallelCollectionRDD[57]中的57

intersection//中间相同额数据集

返回两个相同的RDDs中的元素。

Listing Variants

def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

Example

val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)

z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)

isCheckpointed

Indicates whether the RDD has been checkpointed. The flag will only raise once the checkpoint has really been created.
指示是否检查了RDD。只有在真正创建了检查点之后，该标志一旦聚集到一个检测点才会被真正创建。
Listing Variants

def isCheckpointed: Boolean

Example

sc.setCheckpointDir("/home/cloudera/Documents")
c.isCheckpointed
res6: Boolean = false

c.checkpoint
c.isCheckpointed
res8: Boolean = false

c.collect
c.isCheckpointed
res9: Boolean = true

iterator

Returns a compatible iterator object for a partition of this RDD. This function should never be called directly.
返回此RDD分区的兼容迭代器对象。永远不要直接调用这个函数。
Listing Variants

final def iterator(split: Partition, context: TaskContext): Iterator[T]

join [Pair]

Performs an inner join using two key-value RDDs. Please note that the keys must be generally comparable to make this work.
使用两个键-值RDDs执行内部连接。请注意，键通常必须是可比较的，才可以使这个工作。
Listing Variants

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
//只有做比较才能做等值连接
b.join(d).collect

res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)),
(3,(rat,gnu)), (3,(rat,bee)))

keyBy

Constructs two-component tuples (key-value pairs) by applying a function on each data item. The result of the function becomes the
key and the original data item becomes the value of the newly created tuples.
通过对每个数据项应用一个函数来构造两个组件元组(键-值对)。函数的结果成为键，原始数据项成为新创建元组的值。
Listing Variants

def keyBy[K](f: T => K): RDD[(K, T)]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

keys [Pair]

Extracts the keys from all contained tuples and returns them in a new RDD.
从所有包含的元组中提取键并在新的RDD中返回它们。
Listing Variants

def keys: RDD[K]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

b.values.collect
res52: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

leftOuterJoin [Pair]

Performs an left outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work correctly.
使用两个键-值RDDs执行左外连接。请注意，这些键通常必须具有可比性，才能正确工作。
Listing Variants

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

Example

res1: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))),
(6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))),
(3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant

lookup

扫描RDD，查找与提供的值匹配的所有键，并以Scala序列返回它们的值。

Listing Variants

def lookup(key: K): Seq[V]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.collect
res57: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (7,panther), (5,eagle))
//获取所有key为5的数据，以元组返回
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)

map

对RDD的每个项应用转换函数，并将结果作为新的RDD返回。

Listing Variants

def map[U: ClassTag](f: T => U): RDD[U]

Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))

mapPartitions

这是一个专门的映射，每个分区只调用一次。
通过输入参数(Iterarator[T])，各个分区的整个内容都可以作为顺序的值流使用。
自定义函数必须返回另一个迭代器[U]。合并的结果迭代器将自动转换为新的RDD。
请注意，由于我们选择的分区，下面的结果中缺少元组(3,4)和(6,7)。

Listing Variants

def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example 1

val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
var res = List[(T, T)]()
var pre = iter.next
while (iter.hasNext)
{
val cur = iter.next;
res .::= (pre, cur)
pre = cur;
}
res.iterator
}
//每个分区将会执行myfunc这个函数
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
Example 2

val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)

自定义返回类型，但是返回类型必须是迭代器
def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
var res = List[Int]()
while (iter.hasNext) {
val cur = iter.next;
res = res ::: List.fill(scala.util.Random.nextInt(10))(cur)
}
res.iterator
}
x.mapPartitions(myfunc).collect
//有些数字根本没有输出。这是因为为它生成的随机数是零。
res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10)
上面的程序也可以用平面图来编写，如下所示。
Example 2 using flatmap

val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8,
8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

mapPartitionsWithContext (deprecated and developer API)

类似于mapPartitions，但允许访问关于mapper内部处理状态的信息。

Listing Variants

def mapPartitionsWithContext[U: ClassTag](f: (TaskContext, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example

val a = sc.parallelize(1 to 9, 3)
import org.apache.spark.TaskContext
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
tc.addOnCompleteCallback(() => println(
"Partition: " + tc.partitionId +
", AttemptID: " + tc.attemptId ))

iter.toList.filter(_ % 2 == 0).iterator
}
//获取内部处理信息。
a.mapPartitionsWithContext(myfunc).collect

14/04/01 23:05:48 INFO SparkContext: Starting job: collect at <console>:20
...
14/04/01 23:05:48 INFO Executor: Running task ID 0
Partition: 0, AttemptID: 0, Interrupted: false
...
14/04/01 23:05:48 INFO Executor: Running task ID 1
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 0 in 470 ms on localhost (progress: 0/3)
...
14/04/01 23:05:48 INFO Executor: Running task ID 2
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms on localhost (progress: 1/3)
14/04/01 23:05:48 INFO DAGScheduler: Completed ResultTask(0, 1)

?
res0: Array[Int] = Array(2, 6, 4, 8)

Partition: 0, AttemptID: 184
Partition: 1, AttemptID: 185
Partition: 2, AttemptID: 186
res62: Array[Int] = Array(2, 4, 6, 8)

mapPartitionsWithIndex

与变形术相似，但需要两个参数。
第一个参数是分区的索引，第二个参数是遍历该分区中的所有项的迭代器。
输出是一个迭代器，在应用函数所编码的任何转换之后，它将包含项目列表。

Listing Variants
def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + x)
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

mapPartitionsWithSplit

这个方法在API中被标记为已废弃。所以，你不应该再使用这个方法了。本文将不介绍不赞成使用的方法。

Listing Variants
def mapPartitionsWithSplit[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

mapValues [Pair]

获取由两个组件元组组成的RDD的值，并应用提供的函数来转换每个值。然后，它使用键和转换后的值形成新的双组件元组，并将它们存储在新的RDD中。

Listing Variants

def mapValues[U](f: V => U): RDD[(K, U)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.collect
res65: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (7,panther), (5,eagle))
//对val值进行重新处理
b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))

mapWith (deprecated)

这是地图的扩展版。
它有两个函数参数。
   第一个参数必须符合Int -> T，每个分区执行一次。它会将分区索引映射到某个类型为t的经过转换的分区索引，
       这就是在每个分区执行一次初始化代码的好处。比如创建一个随机数生成器对象。
   第二个函数必须符合(U, T) -> U - T是转换分区索引，U是RDD的数据项。最后，函数必须返回类型为U的转换数据项。

Listing Variants

def mapWith[A: ClassTag, U: ClassTag](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]

Example

// generates 9 random numbers less than 1000.
val x = sc.parallelize(1 to 9, 3)
(a => new scala.util.Random)指后面分区r的数据类型，在每个分区中执行一次这个操作。
(x,r)----r代指一个分区，该处为1,2,3, x代表每个分区的数据项。
With操作都要指定后面操作时的实例对象。比如这里是一个随机生成Random对象。也可以是一个字符串。
x.mapWith(a => new scala.util.Random)((x, r) => r.nextInt(1000)).collect
res0: Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15)

val a = sc.parallelize(1 to 9, 3)
val b = a.mapWith("Index:" + _)((a, b) => ("Value:" + a, b))
b.collect
res0: Array[(String, String)] = Array((Value:1,Index:0), (Value:2,Index:0), (Value:3,Index:0), (Value:4,Index:1), (Value:5,Index:1),
(Value:6,Index:1), (Value:7,Index:2), (Value:8,Index:2), (Value:9,Index:2)

max

返回RDD中的最大元素
Listing Variants

def max()(implicit ord: Ordering[T]): T

Example

val y = sc.parallelize(10 to 30)
y.max
res75: Int = 30

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat")))
a.max
res6: (Int, String) = (18,cat)

mean [Double], meanApprox [Double] 平均值

调用stats并提取平均值组件。在某些情况下，函数的近似版本可以更快地完成。然而，它用准确性换取速度。

Listing Variants

def mean(): Double
def meanApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]

Example

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.mean
res0: Double = 5.3

min

返回RDD中的最小元素

Listing Variants

def min()(implicit ord: Ordering[T]): T

Example

val y = sc.parallelize(10 to 30)
y.min
res75: Int = 10

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8, "cat")))
a.min
res4: (Int, String) = (3,tiger)

name, setName 设置一个别名

Allows a RDD to be tagged with a custom name.
允许使用自定义名称标记RDD。
Listing Variants

@transient var name: String
def setName(_name: String)

Example

val y = sc.parallelize(1 to 10, 10)
y.name
res13: String = null
y.setName("Fancy RDD Name")
y.name
res15: String = Fancy RDD Name

partitionBy [Pair]

使用键重新分区为键值RDD。分区器实现可以作为第一个参数提供。
Listing Variants

def partitionBy(partitioner: Partitioner): RDD[(K, V)]

partitioner

Specifies a function pointer to the default partitioner that will be used for groupBy, subtract, reduceByKey (from PairedRDDFunctions),
etc. functions.
指定一个指向默认分区器的函数指针，用于groupBy、相减、reduceByKey(来自PairedRDDFunctions)等函数。
Listing Variants

@transient val partitioner: Option[Partitioner]

partitions

Returns an array of the partition objects associated with this RDD.
返回与此RDD关联的分区对象数组。
Listing Variants

final def partitions: Array[Partition]

Example

val b = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
b.partitions
res48: Array[org.apache.spark.Partition] = Array(
org.apache.spark.rdd.ParallelCollectionPartition@18aa,
org.apache.spark.rdd.ParallelCollectionPartition@18ab)

persist, cache

These functions can be used to adjust the storage level of a RDD. When freeing up memory,
Spark will use the storage level identifier to decide which partitions should be kept.
The parameterless variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY).
(Warning: Once the storage level has been changed, it cannot be changed again!)
这些函数可用于调整RDD的存储级别。在释放内存时，Spark将使用存储级别标识符来决定应该保留哪些分区。
无参数变量persist()和cache()只是persist(StorageLevel.MEMORY_ONLY)的缩写。(警告:一旦存储级别被更改，就不能再更改!)
Listing Variants

def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
Spark将使用存储级别标识符来决定应该保留哪些分区。StorageLevel(false, false, false, false, 1) 4个级别
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1) persist()和cache()代表1,3位置

pipe

获取每个分区的RDD数据，并通过stdin将其发送到shell命令。命令的结果输出被捕获并作为字符串值的RDD返回。
Listing Variants

def pipe(command: String): RDD[String]
def pipe(command: String, env: Map[String, String]): RDD[String]
def pipe(command: Seq[String], env: Map[String, String] = Map(), printPipeContext: (String => Unit) => Unit = null,
printRDDElement: (T, String => Unit) => Unit = null): RDD[String]

Example

val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res2: Array[String] = Array(1, 4, 7)

randomSplit

Randomly splits an RDD into multiple smaller RDDs according to a weights Array which
specifies the percentage of the total data elements that is assigned to each smaller RDD.
Note the actual size of each smaller RDD is only approximately equal to
the percentages specified by the weights Array. The second example below shows
the number of items in each smaller RDD does not exactly match the weights Array.
A random optional seed can be specified. This function is useful for spliting data into a
training set and a testing set for machine learning.
根据权重数组将RDD随机分割为多个较小的RDDs，权重数组指定分配给每个较小RDD的数据元素总数的百分比。
注意，每个较小RDD的实际大小仅大约等于weights数组指定的百分比。下面的第二个示例显示了每个较小RDD中的项数与weights数组不完全匹配。
可以指定一个随机的可选种子。该函数用于将数据分割为训练集和机器学习测试集。
Listing Variants

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

Example
//用于机器学习测试集
val y = sc.parallelize(1 to 10)
val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)//拿到splits集合中的第0个元素
val test = splits(1)//拿到splits集合中的第1个元素
training.collect
res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10)
test.collect
res86: Array[Int] = Array(2, 3, 7, 9)

val y = sc.parallelize(1 to 10)
val splits = y.randomSplit(Array(0.1, 0.3, 0.6))

val rdd1 = splits(0)
val rdd2 = splits(1)
val rdd3 = splits(2)

rdd1.collect
res87: Array[Int] = Array(4, 10)
rdd2.collect
res88: Array[Int] = Array(1, 3, 5, 8)
rdd3.collect
res91: Array[Int] = Array(2, 6, 7, 9)

reduce

这个函数提供了Spark中众所周知的reduce功能。请注意，您提供的任何函数都应该是可交换的，以便产生可重复的结果。
Listing Variants

def reduce(f: (T, T) => T): T

Example

val a = sc.parallelize(1 to 100, 3)
a.reduce(_ + _)
res41: Int = 5050

reduceByKey [Pair], reduceByKeyLocally [Pair], reduceByKeyToDriver [Pair]

This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be
commutative in order to generate reproducible results.
这个函数提供了Spark中众所周知的reduce功能。请注意，您提供的任何函数都应该是可交换的，以便产生可重复的结果。
Listing Variants

def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
def reduceByKeyToDriver(func: (V, V) => V): Map[K, V]

Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

repartition

This function changes the number of partitions to the number specified by the numPartitions parameter
此函数将分区数量更改为numPartitions参数指定的分区数量
Listing Variants

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

Example

val rdd = sc.parallelize(List(1, 2, 10, 4, 5, 2, 1, 1, 1), 3)
rdd.partitions.length
res2: Int = 3
val rdd2 = rdd.repartition(5)
rdd2.partitions.length
res6: Int = 5

repartitionAndSortWithinPartitions [Ordered]

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
根据给定的分区器重新分区RDD，并在每个生成的分区中，按键对记录进行排序。
Listing Variants

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

Example

//首先，我们将进行未排序的范围分区
val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
val rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD)
val partitioned = randRDD.partitionBy(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res0: Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:0, val: (1,screen)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2, val: (7,cup)])

//现在我们重新分区，但这次是排序
val partitioned = randRDD.repartitionAndSortWithinPartitions(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res1: Array[String] = Array([partID:0, val: (1,screen)], [partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2, val: (7,cup)])

rightOuterJoin [Pair]

使用两个键-值RDDs执行右外连接。请注意，这些键通常必须具有可比性，才能正确工作。
Listing Variants

def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], W))]

Example

res2: Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)),
(6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)),
(3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)),
(4,(None,wolf)), (4,(None,bear)))

sample

随机选择RDD中的一部分，并在新的RDD中返回它们。
Listing Variants

def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

Example

val a = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1, 0).count
res24: Long = 960

a.sample(true, 0.3, 0).count
res25: Long = 2888

a.sample(true, 0.3, 13).count
res26: Long = 2985

sampleByKey [Pair]

根据您希望在最终RDD中显示的每个键的比例，对键值对RDD进行随机抽样。
Listing Variants

def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K, V)]

Example

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater")))
val sampleMap = List((7, 0.4), (6, 0.6)).toMap
randRDD.sampleByKey(false, sampleMap,42).collect

res6: Array[(Int, String)] = Array((7,cat), (6,mouse), (6,book), (6,screen), (7,heater))

sampleByKeyExact [Pair, experimental]

这被标记为实验，所以我们不记录它。
Listing Variants

def sampleByKeyExact(withReplacement: Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K, V)]

saveAsHadoopFile [Pair], saveAsHadoopDataset [Pair], saveAsNewAPIHadoopFile [Pair]

Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat class the user specifies.
使用用户指定的任何Hadoop outputFormat类以Hadoop兼容的格式保存RDD。
Listing Variants

def saveAsHadoopDataset(conf: JobConf)
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String, codec: Class[_ <: CompressionCodec]) (implicit fm: ClassTag[F])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_],
   outputFormatClass: Class[_ <: OutputFormat[_, _]], codec: Class[_ <: CompressionCodec])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_],
   outputFormatClass: Class[_ <: OutputFormat[_, _]],
   conf: JobConf = new JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <: CompressionCodec]] = None)
def saveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])
def saveAsNewAPIHadoopFile(path: String, keyClass: Class[_],
   valueClass: Class[_], outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
   conf: Configuration = self.context.hadoopConfiguration)

saveAsObjectFile

Saves the RDD in binary format.

Listing Variants

def saveAsObjectFile(path: String)

Example

val x = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Int]("objFile")
y.collect
res52: Array[Int] = Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

saveAsSequenceFile [SeqFile]

Saves the RDD as a Hadoop sequence file.

Listing Variants

def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None)

Example

val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task 'attempt_201404190545_0000_m_000001_191' to file:/home/cloudera/hd_seq_file

[cloudera@localhost ~]$ ll ~/hd_seq_file
total 8
-rwxr-xr-x 1 cloudera cloudera 117 Apr 19 05:45 part-00000
-rwxr-xr-x 1 cloudera cloudera 133 Apr 19 05:45 part-00001
-rwxr-xr-x 1 cloudera cloudera 0 Apr 19 05:45 _SUCCESS

saveAsTextFile

Saves the RDD as text files. One line at a time.
将RDD保存为文本文件。一行一行。
Listing Variants

def saveAsTextFile(path: String)
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec])

Example without compression

val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_a")
14/04/03 21:11:36 INFO FileOutputCommitter: Saved output of task 'attempt_201404032111_0000_m_000002_71' to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a

[cloudera@localhost ~]$ head -n 5 ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000
1
2
3
4
5

// Produces 3 output files since we have created the a RDD with 3 partitions
[cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/
-rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11 part-00000
-rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11 part-00001
-rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11 part-00002

Example with compression

import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])

[cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/
total 24
-rwxr-xr-x 1 cloudera cloudera 7276 Apr 3 21:29 part-00000.gz
-rwxr-xr-x 1 cloudera cloudera 6517 Apr 3 21:29 part-00001.gz
-rwxr-xr-x 1 cloudera cloudera 6525 Apr 3 21:29 part-00002.gz

val x = sc.textFile("mydata_b")
x.count
res2: Long = 10000

Example writing into HDFS

val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");

val sp = sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")

stats [Double]
同时计算RDD中所有值的均值、方差和标准差。

Listing Variants

def stats(): StatCounter

Example

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.stats
res16: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859)

sortBy

该函数对输入RDD的数据进行排序，并将其存储在一个新的RDD中。
第一个参数要求您指定一个函数，该函数将输入数据映射到要分类的键中。
第二个参数(可选)指定数据是按升序排序还是降序排序。
Listing Variants

def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.size)(implicit ord: Ordering[K],
ctag: ClassTag[K]): RDD[T]

Example

val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
res101: Array[Int] = Array(1, 1, 2, 3, 5, 7)

y.sortBy(c => c, false).collect
res102: Array[Int] = Array(7, 5, 3, 2, 1, 1)

val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res109: Array[(String, Int)] = Array((A,26), (H,10), (L,5), (Z,1))

z.sortBy(c => c._2, true).collect
res108: Array[(String, Int)] = Array((Z,1), (L,5), (H,10), (A,26))

sortByKey [Ordered]

该函数对输入RDD的数据进行排序，并将其存储在一个新的RDD中。
输出RDD是洗牌后的RDD，因为它存储了已洗牌的还原程序输出的数据。这个函数的实现实际上非常聪明。
首先，它使用范围分区器将数据划分到打乱后的RDD中的范围内。然后，它使用标准的排序机制分别使用mapPartitions对这些范围进行排序。
Listing Variants

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]

Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))

stdev [Double], sampleStdev [Double]

调用stats并提取stdev-component或正确的sampleStdev-component。
Listing Variants

def stdev(): Double
def sampleStdev(): Double

Example

val d = sc.parallelize(List(0.0, 0.0, 0.0), 3)
d.stdev
res10: Double = 0.0
d.sampleStdev
res11: Double = 0.0

val d = sc.parallelize(List(0.0, 1.0), 3)
d.stdev
d.sampleStdev
res18: Double = 0.5
res19: Double = 0.7071067811865476

val d = sc.parallelize(List(0.0, 0.0, 1.0), 3)
d.stdev
res14: Double = 0.4714045207910317
d.sampleStdev
res15: Double = 0.5773502691896257

subtract

执行众所周知的标准集减法运算:A - B

Listing Variants

def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)

subtractByKey [Pair]

非常类似于相减，但不是提供一个函数，每一对的键组件将被自动用作从第一个RDD中删除项的标准。
Listing Variants

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)
b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))

sum [Double], sumApprox [Double]

Computes the sum of all values contained in the RDD. The approximate version of the function can finish somewhat faster in some scenarios.
However, it trades accuracy for speed.
计算RDD中包含的所有值的和。在某些情况下，函数的近似版本可以更快地完成。然而，它用准确性换取速度。

Listing Variants

def sum(): Double
def sumApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]

Example

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res17: Double = 101.39999999999999

take

提取RDD的前n项并将它们作为数组返回。(注意:这听起来很简单，但对于Spark实现者来说，
这实际上是一个相当棘手的问题，因为所涉及的项可能位于许多不同的分区中。)

Listing Variants

def take(num: Int): Array[T]

Example

val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.take(2)
res18: Array[String] = Array(dog, cat)

val b = sc.parallelize(1 to 10000, 5000)
b.take(100)
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100)

takeOrdered

Orders the data items of the RDD using their inherent implicit ordering function and returns the first n items as an array.
使用RDD的固有的隐式排序函数对数据项进行排序，并将前n项作为数组返回。

Listing Variants

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

Example

val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
//对取到的前2个数据进行排序
b.takeOrdered(2)
res19: Array[String] = Array(ape, cat)

takeSample

在以下几个方面与样品不同:
它将返回精确的样本数量(提示:第二个参数)
它返回一个数组而不是RDD。
它在内部随机化返回项的顺序。

Listing Variants

def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]

Example

val x = sc.parallelize(1 to 1000, 3)
x.takeSample(true, 100, 1)
res3: Array[Int] = Array(339, 718, 810, 105, 71, 268, 333, 360, 341, 300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937,
301, 167, 69, 330, 864, 40, 645, 65, 349, 613, 468, 982, 314, 160, 675, 232, 794, 577, 571, 805, 317, 136, 860, 522, 45, 628,
178, 321, 482, 657, 114, 332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559, 301, 694, 460, 839, 952, 664, 851, 260, 729,
823, 880, 792, 964, 614, 821, 683, 364, 80, 875, 813, 951, 663, 344, 546, 918, 436, 451, 397, 670, 756, 512, 391, 70, 213, 896, 123, 858)

toDebugString

Returns a string that contains debug information about the RDD and its dependencies.
返回一个字符串，该字符串包含关于RDD及其依赖项的调试信息。

Listing Variants

def toDebugString: String

Example

val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res6: String =
MappedRDD[15] at subtract at <console>:16 (3 partitions)
SubtractedRDD[14] at subtract at <console>:16 (3 partitions)
MappedRDD[12] at subtract at <console>:16 (3 partitions)
ParallelCollectionRDD[10] at parallelize at <console>:12 (3 partitions)
MappedRDD[13] at subtract at <console>:16 (3 partitions)
ParallelCollectionRDD[11] at parallelize at <console>:12 (3 partitions)

toJavaRDD

Embeds this RDD object within a JavaRDD object and returns it.
将这个RDD对象嵌入到JavaRDD对象中并返回它。

Listing Variants

def toJavaRDD() : JavaRDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.toJavaRDD
res3: org.apache.spark.api.java.JavaRDD[String] = ParallelCollectionRDD[6] at parallelize at <console>:12

toLocalIterator

Converts the RDD into a scala iterator at the master node.
将RDD转换为主节点上的scala迭代器。
Listing Variants

def toLocalIterator: Iterator[T]

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
val iter = z.toLocalIterator

iter.next
res51: Int = 1

iter.next
res52: Int = 2

top

Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns them as an array.
利用$T$的隐式排序来确定$k$值，并将它们作为数组返回。
Listing Variants

ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]

Example

val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
//排序后取前二个
c.top(2)
res28: Array[Int] = Array(9, 8)

toString

Assembles a human-readable textual description of the RDD.
汇编一个人类可读的RDD文本描述。

Listing Variants

override def toString: String

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.toString
res61: String = ParallelCollectionRDD[80] at parallelize at <console>:21

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater")))
val sortedRDD = randRDD.sortByKey()
sortedRDD.toString
res64: String = ShuffledRDD[88] at sortByKey at <console>:23

treeAggregate

Computes the same thing as aggregate, except it aggregates the elements of the RDD in a multi-level tree pattern.
Another difference is that it does not use the initial value for the second reduce function (combOp).
By default a tree of depth 2 is used, but this can be changed via the depth parameter.
计算与聚合相同的东西，只是它将RDD的元素聚合在一个多层次的树模式中。
另一个区别是，它不使用第二个reduce函数(combOp)的初始值。
默认情况下，使用的树是depth 2，但是这可以通过depth参数进行更改。

Listing Variants

def treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0: ClassTag[U]): U

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

//让我们首先用分区标签打印出RDD的内容
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5],
[partID:1, val: 6])

z.treeAggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// Note unlike normal aggregrate. Tree aggregate does not apply the initial value for the second reduce
// This example returns 11 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(4, 5, 6) = 6
// final reduce across partitions will be 5 + 6 = 11
// note the final reduce does not include the initial value
//注意不同于普通的聚合体。树聚合不应用第二次减法的初始值
//这个示例返回11，因为初始值是5
//分区0的约简为max(5,1,2,3) = 5
//分区1的约简为最大值(4,5,6)= 6
//分区间的最终reduce为5 + 6 = 11
//注意，最终的减少不包括初始值
z.treeAggregate(5)(math.max(_, _), _ + _)
res42: Int = 11

treeReduce

Works like reduce except reduces the elements of the RDD in a multi-level tree pattern.
它的工作方式与reduce类似，只是在多级树模式中减少了RDD的元素。

Listing Variants

def treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_)
res49: Int = 21

union, ++

Performs the standard set operation: A union B
执行标准设置操作:A union B

Listing Variants

def ++(other: RDD[T]): RDD[T]
def union(other: RDD[T]): RDD[T]

Example

val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)

unpersist

Dematerializes the RDD (i.e. Erases all data items from hard-disk and memory). However, the RDD object remains.
If it is referenced in a computation, Spark will regenerate it automatically using the stored dependency graph.

反序列化RDD(即从硬盘和内存中删除所有数据项)。然而，RDD对象仍然存在。如果在计算中引用它，Spark将使用存储的依赖关系图自动重新生成它。
Listing Variants

def unpersist(blocking: Boolean = true): RDD[T]

Example

val y = sc.parallelize(1 to 10, 10)
val z = (y++y)
z.collect
z.unpersist(true)
14/04/19 03:04:57 INFO UnionRDD: Removing RDD 22 from persistence list
14/04/19 03:04:57 INFO BlockManager: Removing RDD 22

values

Extracts the values from all contained tuples and returns them in a new RDD.
从所有包含的元组中提取值，并在新的RDD中返回它们。

Listing Variants

def values: RDD[V]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

variance [Double], sampleVariance [Double]

Calls stats and extracts either variance-component or corrected sampleVariance-component.
调用stats并提取变量组件或正确的samplevarious组件。
Listing Variants

def variance(): Double
def sampleVariance(): Double

Example

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.variance 方差
res70: Double = 10.605333333333332

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res14: Double = 66.04584444444443

x.sampleVariance
res13: Double = 74.30157499999999

zip

Joins two RDDs by combining the i-th of either partition with each other.
The resulting RDD will consist of two-component tuples which are interpreted as
key-value pairs by the methods provided by the PairRDDFunctions extension.
通过将每个分区的i-th组合在一起来连接两个RDDs。生成的RDD将由两个组件元组组成，由PairRDDFunctions扩展提供的方法将其解释为键值对。
Listing Variants

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
各个分区对应连接
a.zip(b).collect
res1: Array[(Int, Int)] = Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107), (8,108), (9,109), (10,110),
(11,111), (12,112), (13,113), (14,114), (15,115), (16,116), (17,117), (18,118), (19,119), (20,120), (21,121), (22,122), (23,123),
(24,124), (25,125), (26,126), (27,127), (28,128), (29,129), (30,130), (31,131), (32,132), (33,133), (34,134), (35,135), (36,136),
(37,137), (38,138), (39,139), (40,140), (41,141), (42,142), (43,143), (44,144), (45,145), (46,146), (47,147), (48,148), (49,149),
(50,150), (51,151), (52,152), (53,153), (54,154), (55,155), (56,156), (57,157), (58,158), (59,159), (60,160), (61,161), (62,162),
(63,163), (64,164), (65,165), (66,166), (67,167), (68,168), (69,169), (70,170), (71,171), (72,172), (73,173), (74,174), (75,175),
(76,176), (77,177), (78,...

val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
val c = sc.parallelize(201 to 300, 3)
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect
res12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202), (3,103,203), (4,104,204), (5,105,205), (6,106,206), (7,107,207),
(8,108,208), (9,109,209), (10,110,210), (11,111,211), (12,112,212), (13,113,213), (14,114,214), (15,115,215), (16,116,216), (17,117,217),
(18,118,218), (19,119,219), (20,120,220), (21,121,221), (22,122,222), (23,123,223), (24,124,224), (25,125,225), (26,126,226),
(27,127,227), (28,128,228), (29,129,229), (30,130,230), (31,131,231), (32,132,232), (33,133,233), (34,134,234), (35,135,235),
(36,136,236), (37,137,237), (38,138,238), (39,139,239), (40,140,240), (41,141,241), (42,142,242), (43,143,243), (44,144,244),
(45,145,245), (46,146,246), (47,147,247), (48,148,248), (49,149,249), (50,150,250), (51,151,251), (52,152,252), (53,153,253),
(54,154,254), (55,155,255)...

zipParititions
多个空间压缩
Similar to zip. But provides more control over the zipping process.
类似于zip。但是提供了对压缩过程的更多控制。
Listing Variants

def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C])
                   (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)
                   (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])
                   (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag]
                   (rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)
                   (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]

Example

val a = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)
def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
{
var res = List[String]()
while (aiter.hasNext && biter.hasNext && citer.hasNext)
{
val x = aiter.next + " " + biter.next + " " + citer.next
res ::= x
}
res.iterator
}
a.zipPartitions(b, c)(myfunc).collect
res50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7 17 107, 6 16 106)

zipWithIndex

Zips the elements of the RDD with its element indexes. The indexes start from 0. If the RDD is spread across multiple partitions
then a spark Job is started to perform this operation.
使用其元素索引压缩RDD的元素。索引从0开始。如果RDD分布在多个分区中，则会启动spark作业来执行此操作。
Listing Variants

def zipWithIndex(): RDD[(T, Long)]

Example
压缩程序自身的索引值，并组成元组
val z = sc.parallelize(Array("A", "B", "C", "D"))
val r = z.zipWithIndex
res110: Array[(String, Long)] = Array((A,0), (B,1), (C,2), (D,3))

val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithIndex
r.collect
res11: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10),
(111,11), (112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18), (119,19), (120,20))

zipWithUniqueId

This is different from zipWithIndex since just gives a unique id to each data element but the ids may not match the index number
of the data element. This operation does not start a spark job even if the RDD is spread across multiple partitions.
Compare the results of the example below with that of the 2nd example of zipWithIndex. You should be able to see the difference.
这与zipWithIndex不同，因为它只是为每个数据元素提供了唯一的id，但id可能与数据元素的索引号不匹配。即使RDD分布在多个分区中
，这个操作也不会启动spark作业。
将下面示例的结果与zipWithIndex的第二个示例进行比较。你应该能看出区别。
Listing Variants

def zipWithUniqueId(): RDD[(T, Long)]

Example

val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithUniqueId
r.collect

res12: Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15), (104,1), (105,6), (106,11), (107,16), (108,2), (109,7),
(110,12), (111,17), (112,3), (113,8), (114,13), (115,18), (116,4), (117,9), (118,14), (119,19), (120,24))

443441968

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
sparkRDD总结

--------[pair]表示一个元组 ;如("ty",12) With必须添加分区的类型------------------------------------------aggregate :聚合每个分区的值。每个分区中的聚合变量都是用零值初始化的。aggregateByKey [Pair] ...
复制链接

扫一扫

专栏目录