The RDD API By Example

最新推荐文章于 2024-04-04 21:18:39 发布

houzhizhen

最新推荐文章于 2024-04-04 21:18:39 发布

阅读量1.1k

点赞数

分类专栏： spark

spark 专栏收录该内容

158 篇文章 2 订阅

订阅专栏

Zhen He

Associate Professor

Department of Computer Science and Computer Engineering
La Trobe University
Bundoora, Victoria 3086
Australia

Tel : + 61 3 9479 3036
Email: z.he@latrobe.edu.au

Building: Beth Gleeson, Room: 235

Original Page:http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Our research group has a very strong focus on using and improvingApache Spark to solve real world programs. In order to do this we needto have a very solid understanding of the capabilities of Spark. So oneof the first things we have done is to go through the entire Spark RDDAPI and write examples to test their functionality. This has been avery useful exercise and we would like to share the examples witheveryone.

Authors of examples: Matthias Langer and Zhen He
Emails addresses: m.langer@latrobe.edu.au, z.he@latrobe.edu.au

These examples have only been tested for Spark version 1.4. We assumethe functionality of Spark is stable and therefore the examples shouldbe valid for later releases.

If you find any errors in the example we would love to hear about them so we can fix them up. So please email us to let us know.

The RDD API By Example

RDD is short for Resilient Distributed Dataset.RDDsare the workhorse of the Spark system. As a user, one can consider aRDD as a handle for a collection of individual data partitions, whichare the result of some computation.

However, an RDD is actually more than that. Oncluster installations, separate data partitions can be on separatenodes. Using the RDD as a handle one can access all partitions andperform computations and transformations using the contained data.Whenever a part of a RDD or an entire RDD is lost, the system is ableto reconstruct the data of lost partitions by using lineageinformation. Lineage refers to the sequence of transformations used toproduce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive eitherdirectlyor indirectly from the class RDD. This class comes with a large set ofmethods that perform operations on the data within the associatedpartitions. The class RDD is abstract. Whenever, one uses a RDD, one isactually using a concertized implementation of RDD. Theseimplementations have to overwrite some core functions to make the RDDbehave as expected.

One reason why Spark has lately become a verypopular system for processing big data is that it does not imposerestrictions regarding what data can be stored within RDD partitions.The RDD API already contains many useful operations. But, because thecreators of Spark had to keep the core API of RDDs common enough tohandle arbitrary data-types, many convenience functionsare missing.

The basic RDD API considers each data item asa single value. However, users often want to work with key-value pairs. Therefore Spark extended the interface of RDD to provideadditional functions (PairRDDFunctions), which explicitly work on key-value pairs. Currently, there are four extensions to the RDD API available inspark. They are as follows:

DoubleRDDFunctions

This extension contains manyuseful methods for aggregating numeric values. They become available ifthe data items of an RDD are implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined inthis interfaceextension become available when the data items have a two componenttuple structure. Spark will interpret the first tuple item (i.e.tuplename. 1) as the key and the second item (i.e. tuplename. 2) as theassociated value.

OrderedRDDFunctions

Methods defined inthis interfaceextension become available if the data items are two-component tupleswhere the key is implicitly sortable.

SequenceFileRDDFunctions

This extensioncontainsseveral methods that allow users to create Hadoop sequence- les fromRDDs. The data items must be two compo- nent key-value tuples as required by the PairRDDFunctions. However, there areadditional requirements considering the convertibility of the tuplecomponents to Writable types.

Since Spark will make methods with extendedfunctionality automatically available to users when the data itemsfulfill the above described requirements, we decided to list allpossibleavailable functions in strictly alphabetical order. We will appendeither of the followingto the function-name to indicateit belongs to an extension that requires the data items to conform to acertain format or type.

[Double] - DoubleRDD Functions

[Ordered] -OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile]- SequenceFileRDDFunctions

aggregate

The aggregate function allows the user to apply twodifferent reduce functions to the RDD. The first reduce function isapplied within each partition to reduce the data within each partitioninto a single result. The second reduce function is used to combine thedifferent reduced results of all partitions together to arrive at onefinal result. The ability to have two separate reduce functions forintra partition versus across partition reducing adds a lot offlexibility. For example the first reduce function can be the maxfunction and the second one can be the sum function. The user alsospecifies an initial value. Here are some important facts.

The initial value is applied at both levels of reduce. So both at the intra partition reduction and across partition reduction.
Both reduce functions have to be commutative andassociative.
Do not assume any execution order for either partitioncomputations or combining partitions.
Why would one want to use two input data types? Let usassume we do an archaeological site survey using a metal detector.While walking through the site we take GPS coordinates of importantfindings based on the output of the metal detector. Later, we intend todraw an image of a map that highlights these locations using the aggregate function. In this casethe zeroValuecould be an area map with no highlights. The possibly huge set of inputdata is stored as GPS coordinates across many partitions.seqOp (first reducer) could convert the GPScoordinates to map coordinates and put a marker on the map at therespective position.combOp (second reducer)willreceive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U:ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U):U

Examples 1

valz = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2],[partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1,val: 6])

z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16

val z = sc.parallelize(List("a","b","c","d","e","f"),2)

//lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b],[partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1,val: f])

z.aggregate("")(_ + _, _+_)
res115: String = abcdef

// See here how the initial value "x" is applied three times.
// - once for each partition
// - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc

// Below are some more advanced examples. Some are quite tricky to work out.

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString,(x,y) => x + y)
res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString,(x,y) => x + y)
res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,(x,y) => x + y)
res143: String = 10

The main issue with the code above is that the result ofthe inner min is a string oflength 1.
The zero in the output is due to the empty string being the last stringin the list. We see this result because we are not recursively reducingany further within the partition for the final string.

Examples 2

valz = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,(x,y) => x + y)
res144: String = 11

In contrast to the previous example, this example has the empty stringat the beginning of the second partition. This results in length ofzero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above exampleshows bad design since the output is dependent on the order of the datainside the partitions.)

aggregateByKey [Pair]

Works like theaggregate function except the aggregation is applied to the values withthe same key. Also unlike the aggregate function the initial value isnot applied to the second reduce.

Listing Variants

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp:(U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Example

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val:(cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)],[partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

cartesian

Computes the cartesian productbetween two RDDs (i.e. Each item of the first RDD is joined with eachitem of the second RDD) and returns them as a new RDD. (Warning: Be careful when using thisfunction.! Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other:RDD[U]): RDD[(T, U)]

Example

valx = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10),(2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),(4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computed next. CheckpointedRDDs are stored as a binary file within the checkpoint directory whichcan be specified using the Spark context. (Warning: Spark applies lazy evaluation.Checkpointing will not occur until an action is invoked.)

Important note: the directory "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL aswell.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
14/02/25 18:13:53 INFO SparkContext: Starting job: count at<console>:15
...
14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as valuesto memory (estimated size 115.7 KB, free 296.3 MB)
14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12
res23: Long = 4

coalesce,repartition

Coalesces the associated data intoa given number of partitions. repartition(numPartitions)is simply an abbreviation for coalesce(numPartitions,shuffle = true).

Listing Variants

def coalesce ( numPartitions :Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]

Example

valy = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

cogroup [Pair], groupWith[Pair]

A very powerful set of functionsthat allow grouping up to 3 key-value RDDs together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K,W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,(Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,(Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]):RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)],numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)],partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2]))]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V],Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]):RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]

Examples

vala = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

val d = a.map((_, "d"))
b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String],Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"),(4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1,"desktop"), (4, "iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),
(2,(ArrayBuffer(banana),ArrayBuffer())),
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect,toArray

Converts the RDD into a Scalaarray and returns it. If you provide a standard map-function (i.e. f =T -> U) it will be applied before inserting the values into theresult array.

Listing Variants

def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect
res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

collectAsMap [Pair]

Similar to collect, but works on key-valueRDDs and converts them into Scala maps to preserve their key-valuestructure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

vala = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap
res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 ->3)

combineByKey[Pair]

Very efficient implementation thatcombines the values of a RDD consisting of two-component tuples byapplying multiple aggregators one after another.

Listing Variants

defcombineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) =>C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) =>C, mergeCombiners: (C, C) => C, partitioner: Partitioner,mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

Example

vala =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y ::x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)),(2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute

Executes dependencies and computesthe actual representation of the RDD. This function should not becalled directly by users.

Listing Variants

def compute(split: Partition,context: TaskContext): Iterator[T]

context,sparkContext

Returns the SparkContext that was used tocreate the RDD.

Listing Variants

def compute(split: Partition,context: TaskContext): Iterator[T]

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.context
res8: org.apache.spark.SparkContext =org.apache.spark.SparkContext@58c1c2f1

count

Returns the number of items storedwithin a RDD.

Listing Variants

def count(): Long

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res2: Long = 4

countApprox

Marked as experimental feature! Experimental features are currently notcovered by this document!

Listing Variants

def (timeout: Long, confidence:Double = 0.95): PartialResult[BoundedDouble]

val rdd = sc.parallelize(1 to 1000, 10)
rdd.countApprox(100,0.5)
res27: org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] = (final: [1000.000, 1000.000])
scala> rdd.countApprox(7,0.95)
res39: org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] = (partial: [959.824, 1041.033])

countApproxDistinct

Computes the approximate number of distinct values. For large RDDswhich are spread across many nodes, this function may execute fasterthan other counting methods. The parameter relativeSD controls the accuracy ofthe computation.

Listing Variants

defcountApproxDistinct(relativeSD: Double = 0.05): Long

Example

vala = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224

b.countApproxDistinct(0.05)
res15: Long = 9750

b.countApproxDistinct(0.01)
res16: Long = 9947

b.countApproxDistinct(0.001)
res0: Long = 10000

countApproxDistinctByKey [Pair]

Similar to countApproxDistinct,but computes the approximate number of distinct values for eachdistinct key. Hence, the RDD must consist of two-component tuples. Forlarge RDDs which are spread across many nodes, this function mayexecute faster than other counting methods. The parameter relativeSD controls the accuracy ofthe computation.

Listing Variants

defcountApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int):RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner:Partitioner): RDD[(K, Long)]

Example

vala = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357),(Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455),(Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451),(Gnu,2521))

countByKey [Pair]

Very similar to count, but counts the values of a RDD consisting oftwo-component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

valc = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3,"Dog")), 2)
c.countByKey
res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)

countByKeyApprox [Pair]

Marked as experimental feature! Experimental features are currently notcovered by this document!

Listing Variants

def countByKeyApprox(timeout:Long, confidence: Double = 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

Returns a map that contains all unique values of the RDD and theirrespective occurrence counts. (Warning: This operation will finally aggregate the information in asingle reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

valb = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3-> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox

Marked as experimental feature! Experimental features are currently notcovered by this document!

Listing Variants

def countByValueApprox(timeout:Long, confidence: Double = 0.95): PartialResult[Map[T, BoundedDouble]]

dependencies

Returns the RDD on which this RDD depends.

Listing Variants

final def dependencies:Seq[Dependency[_]]

Example

valb = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] atparallelize at <console>:12
b.dependencies.length
Int = 0

b.map(a => a).dependencies.length
res40: Int = 1

b.cartesian(a).dependencies.length
res41: Int = 2

b.cartesian(a).dependencies
res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

Returns a new RDD that contains each unique value only once.

Listing Variants

def distinct(): RDD[T]
def distinct(numPartitions: Int): RDD[T]

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2

a.distinct(3).partitions.length
res17: Int = 3

first

Looks for the very first data item of the RDD and returns it.

Listing Variants

def first(): T

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res1: String = Gnu

filter

Evaluates a boolean function for each data item of the RDD and puts theitems for which the function returned trueinto the resulting RDD.

Listing Variants

def filter(f: T => Boolean):RDD[T]

Example

vala = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able to handle all dataitems contained in the RDD. Scala provides so-called partial functionsto deal with mixed data-types. (Tip: Partial functions are very usefulif you have some data which may be bad and you do not want to handlebut for the good data (matching data) you want to apply some kind ofmap function. The following article is good. It teaches you aboutpartial functions in a very nice way and explains why case has to beused for partial functions: article)

Examples for mixed data withoutpartial functions

valb = sc.parallelize(1 to 8)
b.filter(_ < 4).collect
res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.filter(_ < 4).collect
<console>:15: error: value < is not a member of Any

This fails because some components of a are not implicitly comparable against integers. Collectuses the isDefinedAtpropertyof a function-object to determine whether the test-function iscompatible with each data item. Only data items that pass this test (=filter)are then mapped usingthe function-object.

Examples for mixed data withpartial functions

vala = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.collect({case a: Int    => "is integer" |
           case b:String => "is string" }).collect
res17: Array[String] = Array(is string, is string, is integer, isstring)

val myfunc: PartialFunction[Any, Any] = {
case a: Int    => "is integer" |
case b: String => "is string" }
myfunc.isDefinedAt("")
res21: Boolean = true

myfunc.isDefinedAt(1)
res22: Boolean = true

myfunc.isDefinedAt(1.5)
res23: Boolean = false

Be careful! The above code works because it only checks the typeitself! If you use operations on this type, you have to explicitlydeclare what type you want instead of any. Otherwise the compiler does(apparently) not know what bytecode it should produce:

valmyfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"}
<console>:10: error: value < is not a member of Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x < 4) =>"x"}
myfunc2: PartialFunction[Int,Any] = <function1>

filterByRange [Ordered]

Returns an RDD containing only the items in the key range specified.From our testing, it appears this only works if your data is in keyvalue pairs and it has already been sorted by key.

Listing Variants

def filterByRange(lower: K, upper: K): RDD[P]

Example

val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
val sortedRDD = randRDD.sortByKey()

sortedRDD.filterByRange(1, 3).collect
res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book))

filterWith (deprecated)

This is an extended version of filter.It takes two function arguments. The first argument must conform to Int -> T and is executed onceper partition. It will transform the partition index to type T. The second function looks like (U, T) -> Boolean. T is the transformed partitionindex and U are the dataitems from the RDD. Finally the function has to return either true orfalse (i.e. Apply the filter).

Listing Variants

def filterWith[A:ClassTag](constructA: Int => A)(p: (T, A) => Boolean): RDD[T]

Example

vala = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect
res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) => b == "2").collect
res34: Array[Int] = Array(5, 6)

flatMap

Similar to map, but allowsemitting more than one item in the map function.

Listing Variants

def flatMap[U: ClassTag](f: T=> TraversableOnce[U]): RDD[U]

Example

vala = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5,1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3,4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

// The program below generates a random number of copies (up to 10) ofthe items in the list.
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5,5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9,9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues

Very similar to mapValues,but collapses the inherent structure of the values during mapping.

Listing Variants

def flatMapValues[U](f: V =>TraversableOnce[U]): RDD[(K, U)]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x),(5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l), (4,i),(4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p),(7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a),(5,g), (5,l), (5,e), (5,x))

flatMapWith (deprecated)

Similar to flatMap, butallows accessing the partition index or a derivative of the partitionindex from within the flatMap-function.

Listing Variants

def flatMapWith[A: ClassTag, U:ClassTag](constructA: Int => A, preservesPartitioning: Boolean =false)(f: (T, A) => Seq[U]): RDD[U]

Example

vala = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect
res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2,8, 2, 9)

fold

Aggregates the values of each partition. The aggregation variablewithin each partition is initialized with zeroValue.

Listing Variants

def fold(zeroValue: T)(op: (T, T)=> T): T

Example

vala = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6

foldByKey [Pair]

Very similar to fold, butperforms the folding separately for each key of the RDD. This functionis only available if the RDD consists of two-component tuples.

Listing Variants

def foldByKey(zeroValue: V)(func:(V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V):RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V)=> V): RDD[(K, V)]

Example

vala = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.foldByKey("")(_ + _).collect
res84: Array[(Int, String)] = Array((3,dogcatowlgnuant)

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.foldByKey("")(_ + _).collect
res85: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),(5,tigereagle))

foreach

Executes an parameterless function for each data item.

Listing Variants

def foreach(f: T => Unit)

Example

valc = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu","crocodile", "ant", "whale", "dolphin", "spider"), 3)
c.foreach(x => println(x + "s are yummy"))
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy

foreachPartition

Executes an parameterless function for each partition. Access to thedata items contained in the partition is provided via the iteratorargument.

Listing Variants

def foreachPartition(f:Iterator[T] => Unit)

Example

valb = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
b.foreachPartition(x => println(x.reduce(_ + _)))
6
15
24

foreachWith (Deprecated)

Executes an parameterless function for each partition. Access to thedata items contained in the partition is provided via the iteratorargument.

Listing Variants

def foreachWith[A:ClassTag](constructA: Int => A)(f: (T, A) => Unit)

Example

vala = sc.parallelize(1 to 9, 3)
a.foreachWith(i => i)((x,i) => if (x % 2 == 1 && i % 2 ==0) println(x) )
1
3
7
9

fullOuterJoin [Pair]

Performs the full outer join between two paired RDDs.

Listing Variants

def fullOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], Option[W]))]

Example

val pairRDD1 = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12)))
val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12)))
pairRDD1.fullOuterJoin(pairRDD2).collect

res5: Array[(String, (Option[Int], Option[Int]))] =Array((book,(Some(4),None)), (mouse,(None,Some(4))),(cup,(None,Some(5))), (cat,(Some(2),Some(2))),(cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))),(cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))),(cat,(Some(12),Some(12))))

generator, setGenerator

Allows setting a string that is attached to the end of the RDD's namewhen printing the dependency graph.

Listing Variants

@transient var generator
def setGenerator(_generator: String)

getCheckpointFile

Returns the path to the checkpoint file or null if RDD has not yet beencheckpointed.

Listing Variants

def getCheckpointFile:Option[String]

Example

sc.setCheckpointDir("/home/cloudera/Documents")
val a = sc.parallelize(1 to 500, 5)
val b = a++a++a++a++a
b.getCheckpointFile
res49: Option[String] = None

b.checkpoint
b.getCheckpointFile
res54: Option[String] = None

b.collect
b.getCheckpointFile
res57: Option[String] =Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40)

preferredLocations

Returns the hosts which are preferred by this RDD. The actualpreference of a specific host depends on various assumptions.

Listing Variants

final defpreferredLocations(split: Partition): Seq[String]

getStorageLevel

Retrieves the currently set storage level of the RDD. This can only beused to assign a new storage level if the RDD does not have a storagelevel set yet. The example below shows the error you will get, when youtry to reassign the storage level.

Listing Variants

def getStorageLevel

Example

vala = sc.parallelize(1 to 100000, 2)
a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
a.getStorageLevel.description
String = Disk Serialized 1x Replicated

a.cache
java.lang.UnsupportedOperationException: Cannot change storage level ofan RDD after it was already assigned a level

glom

Assembles an array that contains all elements of the partition andembeds it in an RDD. Each returned array contains the contents of one partition.

Listing Variants

def glom(): RDD[Array[T]]

Example

vala = sc.parallelize(1 to 100, 3)
a.glom.collect
res8: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,97, 98, 99, 100))

groupBy

Listing Variants

def groupBy[K: ClassTag](f: T=> K): RDD[(K, Iterable[T])]
def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K,Iterable[T])]
def groupBy[K: ClassTag](f: T => K, p: Partitioner): RDD[(K,Iterable[T])]

Example

vala = sc.parallelize(1 to 9, 3)
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6,8)), (odd,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a % 2
}
a.groupBy(myfunc).collect
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)),(1,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a % 2
}
a.groupBy(x => myfunc(x), 3).collect
a.groupBy(myfunc(_), 1).collect
res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)),(1,ArrayBuffer(1, 3, 5, 7, 9)))

import org.apache.spark.Partitioner
class MyPartitioner extends Partitioner {
def numPartitions: Int = 2
def getPartition(key: Any): Int =
{
    key match
    {
      case null     => 0
      case key: Int =>key          %numPartitions
      case_        => key.hashCode %numPartitions
    }
}
override def equals(other: Any): Boolean =
{
    other match
    {
      case h: MyPartitioner => true
      case_               => false
    }
}
}
val a = sc.parallelize(1 to 9, 3)
val p = new MyPartitioner()
val b = a.groupBy((x:Int) => { x }, p)
val c = b.mapWith(i => i)((a, b) => (b, a))
c.collect
res42: Array[(Int, (Int, Seq[Int]))] = Array((0,(4,ArrayBuffer(4))),(0,(2,ArrayBuffer(2))), (0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))),(1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))), (1,(1,ArrayBuffer(1))),(1,(7,ArrayBuffer(7))), (1,(5,ArrayBuffer(5))))

groupByKey [Pair]

Very similar to groupBy, butinstead of supplying a function, the key-component of each pair willautomatically be presented to the partitioner.

Listing Variants

def groupByKey(): RDD[(K,Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider","eagle"), 2)
val b = a.keyBy(_.length)
b.groupByKey.collect
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)),(6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)),(5,ArrayBuffer(tiger, eagle)))

histogram [Double]

These functions take an RDD of doubles and create a histogram witheither even spacing (the number of buckets equals to bucketCount)or arbitrary spacing based on custom bucket boundaries suppliedby the user via an array of double values. The result type of bothvariants is slightly different, the first function will return a tupleconsisting of two arrays. The first array contains the computed bucketboundary values and the second array contains the corresponding countof values (i.e. the histogram).The second variant of the function will just return the histogram as anarray of integers.

Listing Variants

def histogram(bucketCount: Int):Pair[Array[Double], Array[Long]]
def histogram(buckets: Array[Double], evenBuckets: Boolean = false):Array[Long]

Example with even spacing

vala = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8,9.0), 3)
a.histogram(5)
res11: (Array[Double], Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84,7.42, 9.0),Array(5, 0, 0, 1, 4))

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.histogram(6)
res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0,8.5, 10.0),Array(6, 0, 1, 1, 3, 4))

Example with custom spacing

vala = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8,9.0), 3)
a.histogram(Array(0.0, 3.0, 8.0))
res14: Array[Long] = Array(5, 3)

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.histogram(Array(0.0, 5.0, 10.0))
res1: Array[Long] = Array(6, 9)

a.histogram(Array(0.0, 5.0, 10.0, 15.0))
res1: Array[Long] = Array(6, 8, 1)

id

Retrieves the ID which has been assigned to the RDD by its devicecontext.

Listing Variants

val id: Int

Example

valy = sc.parallelize(1 to 10, 10)
y.id
res16: Int = 19

intersection

Returns the elements in the two RDDs which are the same.

Listing Variants

def intersection(other: RDD[T],numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord:Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

Example

val x = sc.parallelize(1to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)

z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)

isCheckpointed

Indicates whether the RDD has been checkpointed. The flag will onlyraise once the checkpoint has really been created.

Listing Variants

def isCheckpointed: Boolean

Example

sc.setCheckpointDir("/home/cloudera/Documents")
c.isCheckpointed
res6: Boolean = false

c.checkpoint
c.isCheckpointed
res8: Boolean = false

c.collect
c.isCheckpointed
res9: Boolean = true

iterator

Returns a compatible iterator object for a partition of this RDD. Thisfunction should never be called directly.

Listing Variants

final def iterator(split:Partition, context: TaskContext): Iterator[T]

join[Pair]

Performs an inner join using two key-value RDDs. Please note that thekeys must be generally comparable to make this work.

Listing Variants

def join[W](other: RDD[(K, W)]):RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V,W))]

Example

vala = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"),3)
val b = a.keyBy(_.length)
val c =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)
val d = c.keyBy(_.length)
b.join(d).collect

res0:Array[(Int, (String, String))] = Array((6,(salmon,salmon)),(6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)),(6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)),(3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)),(3,(rat,gnu)), (3,(rat,bee)))

keyBy

Constructs two-component tuples (key-value pairs) by applying afunction on each data item. The result of the function becomes the keyand the original data item becomes the value of the newly createdtuples.

Listing Variants

def keyBy[K](f: T => K):RDD[(K, T)]

Example

vala = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"),3)
val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon),(3,rat), (8,elephant))

keys [Pair]

Extracts the keys from all contained tuples and returnsthem in a new RDD.

Listing Variants

def keys: RDD[K]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

leftOuterJoin [Pair]

Performs an left outer join using two key-value RDDs. Please note thatthe keys must be generally comparable to make this work correctly.

Listing Variants

def leftOuterJoin[W](other:RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,(V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner):RDD[(K, (V, Option[W]))]

Example

vala = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"),3)
val b = a.keyBy(_.length)
val c =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)
val d = c.keyBy(_.length)
b.leftOuterJoin(d).collect

res1: Array[(Int, (String, Option[String]))] =Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))),(6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))),(6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))),(3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))),(3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))),(3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))

lookup

Scans the RDD for all keys that match the provided value and returnstheir values as a Scala sequence.

Listing Variants

def lookup(key: K): Seq[V]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)

map

Applies a transformation function on each item of the RDD and returnsthe result as a new RDD.

Listing Variants

def map[U: ClassTag](f: T =>U): RDD[U]

Example

vala = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"),3)
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6),(rat,3), (elephant,8))

mapPartitions

This is a specialized map that is called only once for each partition.The entire content of the respective partitions is available as asequential stream of values via the input argument ( Iterarator[T]). The custom functionmust return yet another Iterator[U].The combined result iterators are automatically converted into a newRDD. Please note, that the tuples (3,4) and (6,7) are missing from thefollowing result due to the partitioning we chose.

Listing Variants

def mapPartitions[U: ClassTag](f:Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false):RDD[U]

Example 1

vala = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
var res = List[(T, T)]()
var pre = iter.next
while (iter.hasNext)
{
    val cur = iter.next;
    res .::= (pre, cur)
    pre = cur;
}
res.iterator
}
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9),(7,8))

Example 2

valx = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10),3)
def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
var res = List[Int]()
while (iter.hasNext) {
val cur = iter.next;
res = res :::List.fill(scala.util.Random.nextInt(10))(cur)
}
res.iterator
}
x.mapPartitions(myfunc).collect
// some of the number are not outputted at all. This is because therandom number generated for it is zero.
res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10)

The above program can also be written using flatMap as follows.

Example 2 using flatmap

valx = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5,5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9,9, 10, 10, 10, 10, 10, 10, 10, 10)

mapPartitionsWithContext (deprecated and developer API)

Similar to mapPartitions, butallows accessing information about the processing state within themapper.

Listing Variants

def mapPartitionsWithContext[U:ClassTag](f: (TaskContext, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U]

Example

vala = sc.parallelize(1 to 9, 3)
import org.apache.spark.TaskContext
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
tc.addOnCompleteCallback(() => println(
"Partition: " +tc.partitionId +
", AttemptID: " +tc.attemptId ))

iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc).collect

14/04/01 23:05:48 INFO SparkContext: Starting job: collect at<console>:20
...
14/04/01 23:05:48 INFO Executor: Running task ID 0
Partition: 0, AttemptID: 0, Interrupted: false
...
14/04/01 23:05:48 INFO Executor: Running task ID 1
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 0 in 470 ms onlocalhost (progress: 0/3)
...
14/04/01 23:05:48 INFO Executor: Running task ID 2
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms onlocalhost (progress: 1/3)
14/04/01 23:05:48 INFO DAGScheduler: Completed ResultTask(0, 1)

?
res0: Array[Int] = Array(2, 6, 4, 8)

mapPartitionsWithIndex

Similar to mapPartitions, buttakes two parameters. The first parameter is the index of the partitionand the second is an iterator through all the items within thispartition. The output is an iterator containing the list of items afterapplying whatever transformation the function encodes.

Listing Variants

def mapPartitionsWithIndex[U:ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U]

Example

valx = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.toList.map(x => index + "," + x).iterator
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8,2,9, 2,10)

mapPartitionsWithSplit

This method has been marked as deprecated in the API. So, you shouldnot use this method anymore. Deprecated methods will not be covered inthis document.

Listing Variants

def mapPartitionsWithSplit[U:ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U]

mapValues [Pair]

Takes the values of a RDD that consists of two-component tuples, andapplies the provided function to transform each value. Then, it formsnew two-component tuples using the key and the transformed value andstores them in a new RDD.

Listing Variants

def mapValues[U](f: V => U):RDD[(K, U)]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx),(3,xcatx), (7,xpantherx), (5,xeaglex))

mapWith (deprecated)

This is an extended version of map.It takes two function arguments. The first argument must conform to Int -> T and is executed onceper partition. It will map the partition index to some transformedpartition index of type T.This is where it is nice to do some kind of initialization code onceper partition. Like create a Random number generator object.The second function must conform to (U,T) -> U. T is thetransformed partition index and Uis a data item of the RDD. Finally the function has to return atransformed data item of type U.

Listing Variants

def mapWith[A: ClassTag, U:ClassTag](constructA: Int => A, preservesPartitioning: Boolean =false)(f: (T, A) => U): RDD[U]

Example

//generates 9 random numbers less than 1000.
val x = sc.parallelize(1 to 9, 3)
x.mapWith(a => new scala.util.Random)((x, r) =>r.nextInt(1000)).collect
res0: Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15)

val a = sc.parallelize(1 to 9, 3)
val b = a.mapWith("Index:" + _)((a, b) => ("Value:" + a, b))
b.collect
res0: Array[(String, String)] = Array((Value:1,Index:0),(Value:2,Index:0), (Value:3,Index:0), (Value:4,Index:1),(Value:5,Index:1), (Value:6,Index:1), (Value:7,Index:2),(Value:8,Index:2), (Value:9,Index:2)

max

Returns the largest element in the RDD

Listing Variants

def max()(implicit ord:Ordering[T]): T

Example

val y = sc.parallelize(10to 30)
y.max
res75: Int = 30

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat")))
a.max
res6: (Int, String) = (18,cat)

mean [Double],meanApprox [Double]

Calls stats and extracts themean component. The approximate version of the function can finishsomewhat faster in some scenarios. However, it trades accuracy forspeed.

Listing Variants

def mean(): Double
def meanApprox(timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

Example

vala = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4,7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.mean
res0: Double = 5.3

min

Returns the smallest element in the RDD

Listing Variants

def min()(implicit ord:Ordering[T]): T

Example

val y = sc.parallelize(10to 30)
y.min
res75: Int = 10

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8, "cat")))
a.min
res4: (Int, String) = (3,tiger)

name, setName

Allows a RDD to be tagged with a custom name.

Listing Variants

@transient var name: String
def setName(_name: String)

Example

valy = sc.parallelize(1 to 10, 10)
y.name
res13: String = null
y.setName("Fancy RDD Name")
y.name
res15: String = Fancy RDD Name

partitionBy [Pair]

Repartitions as key-value RDD using its keys. The partitionerimplementation can be supplied as the first argument.

Listing Variants

def partitionBy(partitioner:Partitioner): RDD[(K, V)]

partitioner

Specifies a function pointer to the default partitioner that will beused for groupBy, subtract, reduceByKey (from PairedRDDFunctions), etc. functions.

Listing Variants

@transient val partitioner:Option[Partitioner]

partitions

Returns an array of the partition objects associated with this RDD.

Listing Variants

final def partitions:Array[Partition]

Example

valb = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
b.partitions
res48: Array[org.apache.spark.Partition] =Array(org.apache.spark.rdd.ParallelCollectionPartition@18aa,org.apache.spark.rdd.ParallelCollectionPartition@18ab)

persist, cache

These functions can be used to adjust the storage level of a RDD. Whenfreeing up memory, Spark will use the storage level identifier todecide which partitions should be kept. The parameterless variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY). (Warning: Once the storage level has beenchanged, it cannot be changed again!)

Listing Variants

def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false,false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true,false, true, 1)

pipe

Takes the RDD data of each partition and sends it via stdin to ashell-command. The resulting output of the command is captured andreturned as a RDD of string values.

Listing Variants

def pipe(command: String):RDD[String]
def pipe(command: String, env: Map[String, String]): RDD[String]
def pipe(command: Seq[String], env: Map[String, String] = Map(),printPipeContext: (String => Unit) => Unit = null,printRDDElement: (T, String => Unit) => Unit = null): RDD[String]

Example

vala = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res2: Array[String] = Array(1, 4, 7)

randomSplit

Randomly splits an RDD into multiple smaller RDDs according to aweights Array which specifies the percentage of the total data elementsthat is assigned to each smaller RDD. Note the actual size of eachsmaller RDD is only approximately equal to the percentages specified bythe weights Array. The second example below shows the number of itemsin each smaller RDD does not exactly match the weights Array. Arandom optional seed can be specified. This function is useful forspliting data into a training set and a testing set for machinelearning.

Listing Variants

def randomSplit(weights:Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

Example

val y = sc.parallelize(1to 10)
val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
training.collect
res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10)
test.collect
res86: Array[Int] = Array(2, 3, 7, 9)

val y = sc.parallelize(1 to 10)
val splits = y.randomSplit(Array(0.1, 0.3, 0.6))

val rdd1 = splits(0)
val rdd2 = splits(1)
val rdd3 = splits(2)

rdd1.collect
res87: Array[Int] = Array(4, 10)
rdd2.collect
res88: Array[Int] = Array(1, 3, 5, 8)
rdd3.collect
res91: Array[Int] = Array(2, 6, 7, 9)

reduce

This function provides the well-known reducefunctionality in Spark. Please note that any function f you provide, should becommutative in order to generate reproducible results.

Listing Variants

def reduce(f: (T, T) => T): T

Example

vala = sc.parallelize(1 to 100, 3)
a.reduce(_ + _)
res41: Int = 5050

reduceByKey [Pair], reduceByKeyLocally[Pair], reduceByKeyToDriver[Pair]

This function provides the well-known reducefunctionality in Spark. Please note that any function f you provide, should becommutative in order to generate reproducible results.

Listing Variants

def reduceByKey(func: (V, V)=> V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V):RDD[(K, V)]
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
def reduceByKeyToDriver(func: (V, V) => V): Map[K, V]

Example

vala = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),(5,tigereagle))

repartition

This function changes the number of partitions to the number specified by the numPartitions parameter

Listing Variants

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

Example

val rdd = sc.parallelize(List(1, 2, 10, 4, 5, 2, 1, 1, 1), 3)
rdd.partitions.length
res2: Int = 3
val rdd2 = rdd.repartition(5)
rdd2.partitions.length
res6: Int = 5

repartitionAndSortWithinPartitions [Ordered]

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.

Listing Variants

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

Example

// first we will do range partitioning which is not sorted
val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
val rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD)
val partitioned = randRDD.partitionBy(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res0: Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val:(3,book)], [partID:0, val: (1,screen)], [partID:1, val: (4,tv)],[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,val: (7,cup)])

// now lets repartition but this time have it sorted
val partitioned = randRDD.repartitionAndSortWithinPartitions(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res1: Array[String] = Array([partID:0, val: (1,screen)], [partID:0,val: (2,cat)], [partID:0, val: (3,book)], [partID:1, val: (4,tv)],[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,val: (7,cup)])

rightOuterJoin[Pair]

Performs an right outer join using two key-value RDDs. Please note thatthe keys must be generally comparable to make this work correctly.

Listing Variants

def rightOuterJoin[W](other:RDD[(K, W)]): RDD[(K, (Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,(Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner):RDD[(K, (Option[V], W))]

Example

vala = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"),3)
val b = a.keyBy(_.length)
val c =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)
val d = c.keyBy(_.length)
b.rightOuterJoin(d).collect

res2: Array[(Int, (Option[String], String))] =Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)),(6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)),(6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)),(3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)),(3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)),(3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)),(4,(None,bear)))

sample

Randomly selects a fraction of the items of a RDD and returns them in anew RDD.

Listing Variants

def sample(withReplacement:Boolean, fraction: Double, seed: Int): RDD[T]

Example

vala = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1, 0).count
res24: Long = 960

a.sample(true, 0.3, 0).count
res25: Long = 2888

a.sample(true, 0.3, 13).count
res26: Long = 2985

sampleByKey [Pair]

Randomly samples the key value pair RDD according to the fraction of each key you want to appear in the final RDD.

Listing Variants

def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K, V)]

Example

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater")))
val sampleMap = List((7, 0.4), (6, 0.6)).toMap
randRDD.sampleByKey(false, sampleMap,42).collect

res6: Array[(Int, String)] = Array((7,cat), (6,mouse), (6,book), (6,screen), (7,heater))

sampleByKeyExact [Pair, experimental]

This is labelled as experimental and so we do not document it.

Listing Variants

def sampleByKeyExact(withReplacement: Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K, V)]

saveAsHadoopFile [Pair],saveAsHadoopDataset[Pair], saveAsNewAPIHadoopFile[Pair]

Saves the RDD in a Hadoop compatible format using any HadoopoutputFormat class the user specifies.

Listing Variants

def saveAsHadoopDataset(conf:JobConf)
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicitfm: ClassTag[F])
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String, codec:Class[_ <: CompressionCodec]) (implicit fm: ClassTag[F])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass:Class[_], outputFormatClass: Class[_ <: OutputFormat[_, _]], codec:Class[_ <: CompressionCodec])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass:Class[_], outputFormatClass: Class[_ <: OutputFormat[_, _]], conf:JobConf = new JobConf(self.context.hadoopConfiguration), codec:Option[Class[_ <: CompressionCodec]] = None)
def saveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](path:String)(implicit fm: ClassTag[F])
def saveAsNewAPIHadoopFile(path: String, keyClass: Class[_],valueClass: Class[_], outputFormatClass: Class[_ <:NewOutputFormat[_, _]], conf: Configuration =self.context.hadoopConfiguration)

saveAsObjectFile

Saves the RDD in binary format.

Listing Variants

def saveAsObjectFile(path: String)

Example

valx = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Int]("objFile")
y.collect
res52: Array[Int] = Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,100)

saveAsSequenceFile [SeqFile]

Saves the RDD as a Hadoop sequence file.

Listing Variants

def saveAsSequenceFile(path:String, codec: Option[Class[_ <: CompressionCodec]] = None)

Example

valv = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2),("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task'attempt_201404190545_0000_m_000001_191' tofile:/home/cloudera/hd_seq_file

[cloudera@localhost ~]$ ll ~/hd_seq_file
total 8
-rwxr-xr-x 1 cloudera cloudera 117 Apr 19 05:45 part-00000
-rwxr-xr-x 1 cloudera cloudera 133 Apr 19 05:45 part-00001
-rwxr-xr-x 1 cloudera cloudera 0 Apr 19 05:45 _SUCCESS

saveAsTextFile

Saves the RDD as text files. One line at a time.

Listing Variants

def saveAsTextFile(path: String)
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec])

Example without compression

vala = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_a")
14/04/03 21:11:36 INFO FileOutputCommitter: Saved output of task'attempt_201404032111_0000_m_000002_71' tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a

[cloudera@localhost ~]$ head -n 5~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000
1
2
3
4
5

// Produces 3 output files since we have created the a RDD with 3partitions
[cloudera@localhost ~]$ ll~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/
-rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11 part-00000
-rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11 part-00001
-rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11 part-00002

Example with compression

importorg.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])

[cloudera@localhost ~]$ ll~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/
total 24
-rwxr-xr-x 1 cloudera cloudera 7276 Apr 3 21:29 part-00000.gz
-rwxr-xr-x 1 cloudera cloudera 6517 Apr 3 21:29 part-00001.gz
-rwxr-xr-x 1 cloudera cloudera 6525 Apr 3 21:29 part-00002.gz

val x = sc.textFile("mydata_b")
x.count
res2: Long = 10000

Example writing into HDFS

valx = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");

val sp = sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")
sp.flatMap(_.split("")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")

stats [Double]

Simultaneously computes the mean, variance and the standard deviationof all values in the RDD.

Listing Variants

def stats(): StatCounter

Example

valx = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09,21.0), 2)
x.stats
res16: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667,stdev: 8.126859)

sortBy

This function sorts the input RDD's data and stores it in a new RDD.The first parameter requires you to specify a function which mapsthe input data into the key that you want to sortBy. The secondparameter (optional) specifies whether you want the data to be sortedin ascending or descending order.

Listing Variants

def sortBy[K](f: (T) ⇒ K,ascending: Boolean = true, numPartitions: Int =this.partitions.size)(implicit ord: Ordering[K], ctag: ClassTag[K]):RDD[T]

Example

val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
res101: Array[Int] = Array(1, 1, 2, 3, 5, 7)

y.sortBy(c => c, false).collect
res102: Array[Int] = Array(7, 5, 3, 2, 1, 1)

val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
res109: Array[(String, Int)] = Array((A,26), (H,10), (L,5), (Z,1))

z.sortBy(c => c._2, true).collect
res108: Array[(String, Int)] = Array((Z,1), (L,5), (H,10), (A,26))

sortByKey [Ordered]

This function sorts the input RDD's data and stores it in a new RDD.The output RDD is a shuffled RDD because it stores data that is outputby a reducer which has been shuffled. The implementation of thisfunction is actually very clever. First, it uses a range partitioner topartition the data in ranges within the shuffled RDD. Then it sortsthese ranges individually with mapPartitions using standard sortmechanisms.

Listing Variants

def sortByKey(ascending: Boolean= true, numPartitions: Int = self.partitions.size): RDD[P]

Example

vala = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4),(owl,3))
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2),(ant,5))

val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65),(52,4))

stdev [Double],sampleStdev [Double]

Calls stats and extractseither stdev-component orcorrected sampleStdev-component.

Listing Variants

def stdev(): Double
def sampleStdev(): Double

Example

vald = sc.parallelize(List(0.0, 0.0, 0.0), 3)
d.stdev
res10: Double = 0.0
d.sampleStdev
res11: Double = 0.0

val d = sc.parallelize(List(0.0, 1.0), 3)
d.stdev
d.sampleStdev
res18: Double = 0.5
res19: Double = 0.7071067811865476

val d = sc.parallelize(List(0.0, 0.0, 1.0), 3)
d.stdev
res14: Double = 0.4714045207910317
d.sampleStdev
res15: Double = 0.5773502691896257

subtract

Performs the well known standard set subtraction operation: A - B

Listing Variants

def subtract(other: RDD[T]):RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner): RDD[T]

Example

vala = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)

subtractByKey [Pair]

Very similar to subtract, butinstead of supplying a function, the key-component of each pair will beautomatically used as criterion for removing items from the first RDD.

Listing Variants

def subtractByKey[W:ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int):RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner):RDD[(K, V)]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider","eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)
b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))

sum [Double],sumApprox [Double]

Computes the sum of all values contained in the RDD. The approximateversion of the function can finish somewhat faster in some scenarios.However, it trades accuracy for speed.

Listing Variants

def sum(): Double
def sumApprox(timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

Example

valx = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09,21.0), 2)
x.sum
res17: Double = 101.39999999999999

take

Extracts the first n items ofthe RDD and returns them as an array. (Note:This sounds very easy, but it is actually quite a tricky problem forthe implementors of Spark because the items in question can be in manydifferent partitions.)

Listing Variants

def take(num: Int): Array[T]

Example

valb = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.take(2)
res18: Array[String] = Array(dog, cat)

val b = sc.parallelize(1 to 10000, 5000)
b.take(100)
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

takeOrdered

Orders the data items of the RDD using their inherent implicit orderingfunction and returns the first nitems as an array.

Listing Variants

def takeOrdered(num:Int)(implicit ord: Ordering[T]): Array[T]

Example

valb = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.takeOrdered(2)
res19: Array[String] = Array(ape, cat)

takeSample

Behaves different from samplein the following respects:

It will return an exact number of samples (Hint: 2nd parameter)
It returns an Array instead of RDD.
It internally randomizes the order of the itemsreturned.

Listing Variants

def takeSample(withReplacement:Boolean, num: Int, seed: Int): Array[T]

Example

valx = sc.parallelize(1 to 1000, 3)
x.takeSample(true, 100, 1)
res3: Array[Int] = Array(339, 718, 810, 105, 71, 268, 333, 360, 341,300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937, 301, 167,69, 330, 864, 40, 645, 65, 349, 613, 468, 982, 314, 160, 675, 232, 794,577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482, 657, 114,332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559, 301, 694, 460,839, 952, 664, 851, 260, 729, 823, 880, 792, 964, 614, 821, 683, 364,80, 875, 813, 951, 663, 344, 546, 918, 436, 451, 397, 670, 756, 512,391, 70, 213, 896, 123, 858)

toDebugString

Returns a string that contains debug information about the RDD and itsdependencies.

Listing Variants

def toDebugString: String

Example

vala = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res6: String =
MappedRDD[15] at subtract at <console>:16 (3 partitions)
SubtractedRDD[14] at subtract at <console>:16 (3partitions)
    MappedRDD[12] at subtract at <console>:16 (3partitions)
      ParallelCollectionRDD[10] at parallelizeat <console>:12 (3 partitions)
    MappedRDD[13] at subtract at <console>:16 (3partitions)
      ParallelCollectionRDD[11] at parallelizeat <console>:12 (3 partitions)

toJavaRDD

Embeds this RDD object within a JavaRDD object and returns it.

Listing Variants

def toJavaRDD() : JavaRDD[T]

Example

valc = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.toJavaRDD
res3: org.apache.spark.api.java.JavaRDD[String] =ParallelCollectionRDD[6] at parallelize at <console>:12

toLocalIterator

Converts the RDD into a scala iterator at the master node.

Listing Variants

def toLocalIterator: Iterator[T]

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
val iter = z.toLocalIterator

iter.next
res51: Int = 1

iter.next
res52: Int = 2

top

Utilizes the implicit ordering of $T$ to determine the top $k$ valuesand returns them as an array.

Listing Variants

ddef top(num: Int)(implicit ord:Ordering[T]): Array[T]

Example

valc = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res28: Array[Int] = Array(9, 8)

toString

Assembles a human-readable textual description of the RDD.

Listing Variants

override def toString: String

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.toString
res61: String = ParallelCollectionRDD[80] at parallelize at <console>:21

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater")))
val sortedRDD = randRDD.sortByKey()
sortedRDD.toString
res64: String = ShuffledRDD[88] at sortByKey at <console>:23

treeAggregate

Computes the same thing as aggregate, except it aggregates the elementsof the RDD in a multi-level tree pattern. Another difference is that itdoes not use the initial value for the second reduce function(combOp). By default a tree of depth 2 is used, but this can bechanged via the depth parameter.

Listing Variants

def treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0: ClassTag[U]): U

Example

treeReduce

Works like reduce except reduces the elements of the RDD in a multi-level tree pattern.

Listing Variants

def treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_)
res49: Int = 21

union, ++

Performs the standard set operation: A union B

Listing Variants

def ++(other: RDD[T]): RDD[T]
def union(other: RDD[T]): RDD[T]

Example

vala = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)

unpersist

Dematerializes the RDD (i.e. Erasesall data items from hard-disk and memory).However, the RDD object remains. If it is referenced in a computation,Spark will regenerate it automatically using the stored dependencygraph.

Listing Variants

def unpersist(blocking: Boolean =true): RDD[T]

Example

valy = sc.parallelize(1 to 10, 10)
val z = (y++y)
z.collect
z.unpersist(true)
14/04/19 03:04:57 INFO UnionRDD: Removing RDD 22 from persistence list
14/04/19 03:04:57 INFO BlockManager: Removing RDD 22

values

Extracts the values from all contained tuples and returns them in a newRDD.

Listing Variants

def values: RDD[V]

Example

vala = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther","eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

variance [Double],sampleVariance[Double]

Calls stats and extracts either variance-componentor corrected sampleVariance-component.

Listing Variants

def variance(): Double
def sampleVariance(): Double

Example

vala = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4,7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.variance
res70: Double = 10.605333333333332

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29,11.09, 21.0), 2)
x.variance
res14: Double = 66.04584444444443

x.sampleVariance
res13: Double = 74.30157499999999

zip

Joins two RDDs by combining the i-th of either partition with eachother. The resulting RDD will consist of two-component tuples which areinterpreted as key-value pairs by the methods provided by thePairRDDFunctions extension.

Listing Variants

def zip[U: ClassTag](other:RDD[U]): RDD[(T, U)]

Example

vala = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
a.zip(b).collect
res1: Array[(Int, Int)] = Array((1,101), (2,102), (3,103), (4,104),(5,105), (6,106), (7,107), (8,108), (9,109), (10,110), (11,111),(12,112), (13,113), (14,114), (15,115), (16,116), (17,117), (18,118),(19,119), (20,120), (21,121), (22,122), (23,123), (24,124), (25,125),(26,126), (27,127), (28,128), (29,129), (30,130), (31,131), (32,132),(33,133), (34,134), (35,135), (36,136), (37,137), (38,138), (39,139),(40,140), (41,141), (42,142), (43,143), (44,144), (45,145), (46,146),(47,147), (48,148), (49,149), (50,150), (51,151), (52,152), (53,153),(54,154), (55,155), (56,156), (57,157), (58,158), (59,159), (60,160),(61,161), (62,162), (63,163), (64,164), (65,165), (66,166), (67,167),(68,168), (69,169), (70,170), (71,171), (72,172), (73,173), (74,174),(75,175), (76,176), (77,177), (78,...

val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
val c = sc.parallelize(201 to 300, 3)
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect
res12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202),(3,103,203), (4,104,204), (5,105,205), (6,106,206), (7,107,207),(8,108,208), (9,109,209), (10,110,210), (11,111,211), (12,112,212),(13,113,213), (14,114,214), (15,115,215), (16,116,216), (17,117,217),(18,118,218), (19,119,219), (20,120,220), (21,121,221), (22,122,222),(23,123,223), (24,124,224), (25,125,225), (26,126,226), (27,127,227),(28,128,228), (29,129,229), (30,130,230), (31,131,231), (32,132,232),(33,133,233), (34,134,234), (35,135,235), (36,136,236), (37,137,237),(38,138,238), (39,139,239), (40,140,240), (41,141,241), (42,142,242),(43,143,243), (44,144,244), (45,145,245), (46,146,246), (47,147,247),(48,148,248), (49,149,249), (50,150,250), (51,151,251), (52,152,252),(53,153,253), (54,154,254), (55,155,255)...

zipParititions

Similar to zip. But providesmore control over the zipping process.

Listing Variants

def zipPartitions[B: ClassTag, V:ClassTag](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) =>Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B],preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) =>Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B],rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) =>Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B],rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T],Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V:ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T],Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V:ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D],preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B],Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]

Example

vala = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)
def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer:Iterator[Int]): Iterator[String] =
{
var res = List[String]()
while (aiter.hasNext && biter.hasNext &&citer.hasNext)
{
val x = aiter.next + " " + biter.next + " " +citer.next
res ::= x
}
res.iterator
}
a.zipPartitions(b, c)(myfunc).collect
res50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 414 104, 3 13 103, 9 19 109, 8 18 108, 7 17 107, 6 16 106)

zipWithIndex

Zips the elements of the RDD with its element indexes. The indexesstart from 0. If the RDD is spread across multiple partitions then aspark Job is started to perform this operation.

Listing Variants

def zipWithIndex(): RDD[(T, Long)]

Example

val z =sc.parallelize(Array("A", "B", "C", "D"))
val r = z.zipWithIndex
res110: Array[(String, Long)] = Array((A,0), (B,1), (C,2), (D,3))

val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithIndex
r.collect
res11:Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4),(105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11),(112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18),(119,19), (120,20))

zipWithUniqueId

This is different from zipWithIndex since just gives a unique id toeach data element but the ids may not match the index number of thedata element. This operation does not start a spark job even if the RDDis spread across multiple partitions.
Compare the results of the example below with that of the 2nd exampleof zipWithIndex. You should be able to see the difference.

Listing Variants

def zipWithUniqueId(): RDD[(T,Long)]

Example

val z = sc.parallelize(100to 120, 5)
val r = z.zipWithUniqueId
r.collect

res12:Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15),(104,1), (105,6), (106,11), (107,16), (108,2), (109,7), (110,12),(111,17), (112,3), (113,8), (114,13), (115,18), (116,4), (117,9),(118,14), (119,19), (120,24))