这里转载自:http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
The RDD API By Example
RDD is short for Resilient Distributed Dataset. RDDs are the workhorse of the Spark system. As a user, one can consider a RDD as a handle for a collection of individual data partitions, which are the result of some computation.
However, an RDD is actually more than that. On cluster installations, separate data partitions can be on separate nodes. Using the RDD as a handle one can access all partitions and perform computations and transformations using the contained data. Whenever a part of a RDD or an entire RDD is lost, the system is able to reconstruct the data of lost partitions by using lineage information. Lineage refers to the sequence of transformations used to produce the current RDD. As a result, Spark is able to recover automatically from most failures.
All RDDs available in Spark derive either directly or indirectly from the class RDD. This class comes with a large set of methods that perform operations on the data within the associated partitions. The class RDD is abstract. Whenever, one uses a RDD, one is actually using a concertized implementation of RDD. These implementations have to overwrite some core functions to make the RDD behave as expected.
One reason why Spark has lately become a very popular system for processing big data is that it does not impose restrictions regarding what data can be stored within RDD partitions. The RDD API already contains many useful operations. But, because the creators of Spark had to keep the core API of RDDs common enough to handle arbitrary data-types, many convenience functions are missing.
The basic RDD API considers each data item as a single value. However, users often want to work with key-value pairs. Therefore Spark extended the interface of RDD to provide additional functions (PairRDDFunctions), which explicitly work on key-value pairs. Currently, there are four extensions to the RDD API available in spark. They are as follows:
DoubleRDDFunctions
PairRDDFunctions
Methods defined in this interface extension become available when the data items have a two component tuple structure. Spark will interpret the first tuple item (i.e. tuplename. 1) as the key and the second item (i.e. tuplename. 2) as the associated value.
OrderedRDDFunctions
Methods defined in this interface extension become available if the data items are two-component tuples where the key is implicitly sortable.
SequenceFileRDDFunctions
This extension contains several methods that allow users to create Hadoop sequence- les from RDDs. The data items must be two compo- nent key-value tuples as required by the PairRDDFunctions. However, there are additional requirements considering the convertibility of the tuple components to Writable types.
Since Spark will make methods with extended functionality automatically available to users when the data items fulfill the above described requirements, we decided to list all possible available functions in strictly alphabetical order. We will append either of the followingto the function-name to indicate it belongs to an extension that requires the data items to conform to a certain format or type.
[Double] - Double RDD Functions
[Ordered] - OrderedRDDFunctions
[Pair] - PairRDDFunctions
[SeqFile] - SequenceFileRDDFunctions
aggregate
The aggregate function allows the user to apply two different reduce functions to the RDD. The first reduce function is applied within each partition to reduce the data within each partition into a single result. The second reduce function is used to combine the different reduced results of all partitions together to arrive at one final result. The ability to have two separate reduce functions for intra partition versus across partition reducing adds a lot of flexibility. For example the first reduce function can be the max function and the second one can be the sum function. The user also specifies an initial value. Here are some important facts.
- The initial value is applied at both levels of reduce. So both at the intra partition reduction and across partition reduction.
- Both reduce functions have to be commutative and associative.
- Do not assume any execution order for either partition computations or combining partitions.
- Why would one want to use two input data types? Let us assume we do an archaeological site survey using a metal detector. While walking through the site we take GPS coordinates of important findings based on the output of the metal detector. Later, we intend to draw an image of a map that highlights these locations using the aggregate function. In this case the zeroValue could be an area map with no highlights. The possibly huge set of input data is stored as GPS coordinates across many partitions. seqOp (first reducer) could convert the GPS coordinates to map coordinates and put a marker on the map at the respective position. combOp (second reducer) will receive these highlights as partial maps and combine them into a single final output map.
Examples 1
val z = sc.parallelize(List(1,2,3,4,5,6), 2) // lets first print out the contents of the RDD with partition labels def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = { iter.map(x => "[partID:" + index + ", val: " + x + "]") } z.mapPartitionsWithIndex(myfunc).collect res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6]) z.aggregate(0)(math.max(_, _), _ + _) res40: Int = 9 // This example returns 16 since the initial value is 5 // reduce of partition 0 will be max(5, 1, 2, 3) = 5 // reduce of partition 1 will be max(5, 4, 5, 6) = 6 // final reduce across partitions will be 5 + 5 + 6 = 16 // note the final reduce include the initial value z.aggregate(5)(math.max(_, _), _ + _) res29: Int = 16 val z = sc.parallelize(List("a","b","c","d","e","f"),2) //lets first print out the contents of the RDD with partition labels def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = { iter.map(x => "[partID:" + index + ", val: " + x + "]") } z.mapPartitionsWithIndex(myfunc).collect res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f]) z.aggregate("")(_ + _, _+_) res115: String = abcdef // See here how the initial value "x" is applied three times. // - once for each partition // - once when combining all the partitions in the second reduce function. z.aggregate("x")(_ + _, _+_) res116: String = xxdefxabc // Below are some more advanced examples. Some are quite tricky to work out. val z = sc.parallelize(List("12","23","345","4567"),2) z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y) res141: String = 42 z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res142: String = 11 val z = sc.parallelize(List("12","23","345",""),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res143: String = 10 |
The main issue with the code above is that the result of the inner min is a string of length 1.
The zero in the output is due to the empty string being the last string in the list. We see this result because we are not recursively reducing any further within the partition for the final string.
Examples 2
val z = sc.parallelize(List("12","23","","345"),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res144: String = 11 |
aggregateByKey [Pair]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Example
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2) // lets have a look at what is in the partitions def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = { iter.map(x => "[partID:" + index + ", val: " + x + "]") } pairRDD.mapPartitionsWithIndex(myfunc).collect res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)]) pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6)) pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200)) |
cartesian
Example
val x = sc.parallelize(List(1,2,3,4,5)) val y = sc.parallelize(List(6,7,8,9,10)) x.cartesian(y).collect res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10)) |
checkpoint
Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are stored as a binary file within the checkpoint directory which can be specified using the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will not occur until an action is invoked.)
Important note: the directory "my_directory_name" should exist in all slaves. As an alternative you could use an HDFS directory URL as well.
Listing Variants
Example
sc.setCheckpointDir("my_directory_name") val a = sc.parallelize(1 to 4) a.checkpoint a.count 14/02/25 18:13:53 INFO SparkContext: Starting job: count at <console>:15 ... 14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 115.7 KB, free 296.3 MB) 14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 to file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11, new parent is RDD 12 res23: Long = 4 |
coalesce, repartition
def repartition ( numPartitions : Int ): RDD [T]
Example
val y = sc.parallelize(1 to 10, 10) val z = y.coalesce(2, false) z.partitions.length res9: Int = 2 |
cogroup [Pair], groupWith [Pair]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]
Example s
val a = sc.parallelize(List(1, 2, 1, 3), 1) val b = a.map((_, "b")) val c = a.map((_, "c")) b.cogroup(c).collect res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array( (2,(ArrayBuffer(b),ArrayBuffer(c))), (3,(ArrayBuffer(b),ArrayBuffer(c))), (1,(ArrayBuffer(b, b),ArrayBuffer(c, c))) ) val d = a.map((_, "d")) b.cogroup(c, d).collect res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array( (2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))), (3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))), (1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d))) ) val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2) val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2) x.cogroup(y).collect res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array( (4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))), (2,(ArrayBuffer(banana),ArrayBuffer())), (3,(ArrayBuffer(orange),ArrayBuffer())), (1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))), (5,(ArrayBuffer(),ArrayBuffer(computer)))) |
collect, toArray
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.collect res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat) |
collectAsMap [Pair]
Example
val a = sc.parallelize(List(1, 2, 1, 3), 1) val b = a.zip(a) b.collectAsMap res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3) |
combineByKey[Pair]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)]
Example
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3) val c = b.zip(a) val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y) d.collect res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf))) |
compute
context, sparkContext
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.context res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1 |
count
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.count res2: Long = 4 |
countApprox
countApproxDistinct
Computes the approximate number of distinct values. For large RDDs which are spread across many nodes, this function may execute faster than other counting methods. The parameter relativeSD controls the accuracy of the computation.
Listing Variants
Example
val a = sc.parallelize(1 to 10000, 20) val b = a++a++a++a++a b.countApproxDistinct(0.1) res14: Long = 8224 b.countApproxDistinct(0.05) res15: Long = 9750 b.countApproxDistinct(0.01) res16: Long = 9947 b.countApproxDistinct(0.001) res0: Long = 10000 |
countApproxDistinctByKey [Pair]
Similar to countApproxDistinct , but computes the approximate number of distinct values for each distinct key. Hence, the RDD must consist of two-component tuples. For large RDDs which are spread across many nodes, this function may execute faster than other counting methods. The parameter relativeSD controls the accuracy of the computation.
Listing Variants
def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]
Example
val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) val b = sc.parallelize(a.takeSample(true, 10000, 0), 20) val c = sc.parallelize(1 to b.count().toInt, 20) val d = b.zip(c) d.countApproxDistinctByKey(0.1).collect res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494)) d.countApproxDistinctByKey(0.01).collect res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513)) d.countApproxDistinctByKey(0.001).collect res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521)) |
countByKey [Pair]
Example
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2) c.countByKey res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1) |
countByKeyApprox [Pair]
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
Example
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b.countByValue res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1) |
countByValueApprox
dependencies
Returns the RDD on which this RDD depends.
Listing Variants
Example
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:12 b.dependencies.length Int = 0 b.map(a => a).dependencies.length res40: Int = 1 b.cartesian(a).dependencies.length res41: Int = 2 b.cartesian(a).dependencies res42: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa, org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd) |
distinct
Returns a new RDD that contains each unique value only once.
Listing Variants
def distinct(numPartitions: Int): RDD[T]
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.distinct.collect res6: Array[String] = Array(Dog, Gnu, Cat, Rat) val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)) a.distinct(2).partitions.length res16: Int = 2 a.distinct(3).partitions.length res17: Int = 3 |
first
Looks for the very first data item of the RDD and returns it.
Listing Variants
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.first res1: String = Gnu |
filter
Evaluates a boolean function for each data item of the RDD and puts the items for which the function returned true into the resulting RDD.
Listing Variants
Example
val a = sc.parallelize(1 to 10, 3) val b = a.filter(_ % 2 == 0) b.collect res3: Array[Int] = Array(2, 4, 6, 8, 10) |
When you provide a filter function, it must be able to handle all data items contained in the RDD. Scala provides so-called partial functions to deal with mixed data-types. (Tip: Partial functions are very useful if you have some data which may be bad and you do not want to handle but for the good data (matching data) you want to apply some kind of map function. The following article is good. It teaches you about partial functions in a very nice way and explains why case has to be used for partial functions: article )
Examples for mixed data without partial functions
val b = sc.parallelize(1 to 8) b.filter(_ < 4).collect res15: Array[Int] = Array(1, 2, 3) val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog")) a.filter(_ < 4).collect <console>:15: error: value < is not a member of Any |
This fails because some components of a are not implicitly comparable against integers. Collect uses the isDefinedAt property of a function-object to determine whether the test-function is compatible with each data item. Only data items that pass this test (=filter) are then mapped using the function-object.
Examples for mixed data with partial functions
val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog")) a.collect({case a: Int => "is integer" | case b: String => "is string" }).collect res17: Array[String] = Array(is string, is string, is integer, is string) val myfunc: PartialFunction[Any, Any] = { case a: Int => "is integer" | case b: String => "is string" } myfunc.isDefinedAt("") res21: Boolean = true myfunc.isDefinedAt(1) res22: Boolean = true myfunc.isDefinedAt(1.5) res23: Boolean = false |
Be careful! The above code works because it only checks the type itself! If you use operations on this type, you have to explicitly declare what type you want instead of any. Otherwise the compiler does (apparently) not know what bytecode it should produce:
val myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"} <console>:10: error: value < is not a member of Any val myfunc2: PartialFunction[Int, Any] = {case x if (x < 4) => "x"} myfunc2: PartialFunction[Int,Any] = <function1> |
filterByRange [Ordered]
Returns an RDD containing only the items in the key range specified. From our testing, it appears this only works if your data is in key value pairs and it has already been sorted by key.
Listing Variants
Example
val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3) val sortedRDD = randRDD.sortByKey() sortedRDD.filterByRange(1, 3).collect res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book)) |
filterWith (deprecated)
This is an extended version of filter . It takes two function arguments. The first argument must conform to Int -> T and is executed once per partition. It will transform the partition index to type T . The second function looks like (U, T) -> Boolean . T is the transformed partition index and U are the data items from the RDD. Finally the function has to return either true or false (i.e. Apply the filter) .
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3) val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0) b.collect res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9) val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5) a.filterWith(x=> x)((a, b) => b == 0).collect res30: Array[Int] = Array(1, 2) a.filterWith(x=> x)((a, b) => a % (b+1) == 0).collect res33: Array[Int] = Array(1, 2, 4, 6, 8, 10) a.filterWith(x=> x.toString)((a, b) => b == "2").collect res34: Array[Int] = Array(5, 6) |
flatMap
Similar to map , but allows emitting more than one item in the map function.
Listing Variants
Example
val a = sc.parallelize(1 to 10, 5) a.flatMap(1 to _).collect res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3) // The program below generates a random number of copies (up to 10) of the items in the list. val x = sc.parallelize(1 to 10, 3) x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10) |
flatMapValues
Very similar to mapValues , but collapses the inherent structure of the values during mapping.
Listing Variants
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.flatMapValues("x" + _ + "x").collect res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g), (5,l), (5,e), (5,x)) |
flatMapWith (deprecated)
Similar to flatMap , but allows accessing the partition index or a derivative of the partition index from within the flatMap-function.
Listing Variants
Example
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3) a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9) |
fold
Aggregates the values of each partition. The aggregation variable within each partition is initialized with zeroValue .
Listing Variants
Example
val a = sc.parallelize(List(1,2,3), 3) a.fold(0)(_ + _) res59: Int = 6 |
foldByKey [Pair]
Very similar to fold , but performs the folding separately for each key of the RDD. This function is only available if the RDD consists of two-component tuples.
Listing Variants
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]
Example
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) val b = a.map(x => (x.length, x)) b.foldByKey("")(_ + _).collect res84: Array[(Int, String)] = Array((3,dogcatowlgnuant) val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.foldByKey("")(_ + _).collect res85: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle)) |
foreach
Executes an parameterless function for each data item.
Listing Variants
Example
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3) c.foreach(x => println(x + "s are yummy")) lions are yummy gnus are yummy crocodiles are yummy ants are yummy whales are yummy dolphins are yummy spiders are yummy |
foreachPartition
Executes an parameterless function for each partition. Access to the data items contained in the partition is provided via the iterator argument.
Listing Variants
Example
val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3) b.foreachPartition(x => println(x.reduce(_ + _))) 6 15 24 |
foreachWith (Deprecated)
Executes an parameterless function for each partition. Access to the data items contained in the partition is provided via the iterator argument.
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3) a.foreachWith(i => i)((x,i) => if (x % 2 == 1 && i % 2 == 0) println(x) ) 1 3 7 9 |
fullOuterJoin [Pair]
Performs the full outer join between two paired RDDs.
Listing Variants
def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], Option[W]))]
Example
val pairRDD1 = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12))) val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12))) pairRDD1.fullOuterJoin(pairRDD2).collect res5: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)), (mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))), (cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))), (cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))), (cat,(Some(12),Some(12)))) |
generator, setGenerator
Allows setting a string that is attached to the end of the RDD's name when printing the dependency graph.
Listing Variants
def setGenerator(_generator: String)
getCheckpointFile
Returns the path to the checkpoint file or null if RDD has not yet been checkpointed.
Listing Variants
Example
sc.setCheckpointDir("/home/cloudera/Documents") val a = sc.parallelize(1 to 500, 5) val b = a++a++a++a++a b.getCheckpointFile res49: Option[String] = None b.checkpoint b.getCheckpointFile res54: Option[String] = None b.collect b.getCheckpointFile res57: Option[String] = Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40) |
preferredLocations
Returns the hosts which are preferred by this RDD. The actual preference of a specific host depends on various assumptions.
Listing Variants
getStorageLevel
Retrieves the currently set storage level of the RDD. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. The example below shows the error you will get, when you try to reassign the storage level.
Listing Variants
Example
val a = sc.parallelize(1 to 100000, 2) a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY) a.getStorageLevel.description String = Disk Serialized 1x Replicated a.cache java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after it was already assigned a level |
glom
Assembles an array that contains all elements of the partition and embeds it in an RDD. Each returned array contains the contents of one partition.
Listing Variants
Example
val a = sc.parallelize(1 to 100, 3) a.glom.collect res8: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)) |
groupBy
Listing Variants
def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K, Iterable[T])]
def groupBy[K: ClassTag](f: T => K, p: Partitioner): RDD[(K, Iterable[T])]
Example
val a = sc.parallelize(1 to 9, 3) a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7, 9))) val a = sc.parallelize(1 to 9, 3) def myfunc(a: Int) : Int = { a % 2 } a.groupBy(myfunc).collect res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9))) val a = sc.parallelize(1 to 9, 3) def myfunc(a: Int) : Int = { a % 2 } a.groupBy(x => myfunc(x), 3).collect a.groupBy(myfunc(_), 1).collect res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9))) import org.apache.spark.Partitioner class MyPartitioner extends Partitioner { def numPartitions: Int = 2 def getPartition(key: Any): Int = { key match { case null => 0 case key: Int => key % numPartitions case _ => key.hashCode % numPartitions } } override def equals(other: Any): Boolean = { other match { case h: MyPartitioner => true case _ => false } } } val a = sc.parallelize(1 to 9, 3) val p = new MyPartitioner() val b = a.groupBy((x:Int) => { x }, p) val c = b.mapWith(i => i)((a, b) => (b, a)) c.collect res42: Array[(Int, (Int, Seq[Int]))] = Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))), (0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))), (1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))), (1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))), (1,(5,ArrayBuffer(5)))) |
groupByKey [Pair]
Very similar to groupBy , but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.
Listing Variants
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) b.groupByKey.collect res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle))) |
histogram [Double]
These functions take an RDD of doubles and create a histogram with either even spacing (the number of buckets equals to bucketCount ) or arbitrary spacing based on custom bucket boundaries supplied by the user via an array of double values. The result type of both variants is slightly different, the first function will return a tuple consisting of two arrays. The first array contains the computed bucket boundary values and the second array contains the corresponding count of values (i.e. the histogram) . The second variant of the function will just return the histogram as an array of integers.
Listing Variants
def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]
Example with even spacing
val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3) a.histogram(5) res11: (Array[Double], Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0, 0, 1, 4)) val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3) a.histogram(6) res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3, 4)) |
Example with custom spacing
val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3) a.histogram(Array(0.0, 3.0, 8.0)) res14: Array[Long] = Array(5, 3) val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3) a.histogram(Array(0.0, 5.0, 10.0)) res1: Array[Long] = Array(6, 9) a.histogram(Array(0.0, 5.0, 10.0, 15.0)) res1: Array[Long] = Array(6, 8, 1) |
id
Retrieves the ID which has been assigned to the RDD by its device context.
Listing Variants
Example
val y = sc.parallelize(1 to 10, 10) y.id res16: Int = 19 |
intersection
Returns the elements in the two RDDs which are the same.
Listing Variants
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]
Example
val x = sc.parallelize(1 to 20) val y = sc.parallelize(10 to 30) val z = x.intersection(y) z.collect res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11) |
isCheckpointed
Indicates whether the RDD has been checkpointed. The flag will only raise once the checkpoint has really been created.
Listing Variants
Example
sc.setCheckpointDir("/home/cloudera/Documents") c.isCheckpointed res6: Boolean = false c.checkpoint c.isCheckpointed res8: Boolean = false c.collect c.isCheckpointed res9: Boolean = true |
iterator
Returns a compatible iterator object for a partition of this RDD. This function should never be called directly.
Listing Variants
join [Pair]
Performs an inner join using two key-value RDDs. Please note that the keys must be generally comparable to make this work.
Listing Variants
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) b.join(d).collect res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee))) |
keyBy
Constructs two-component tuples (key-value pairs) by applying a function on each data item. The result of the function becomes the key and the original data item becomes the value of the newly created tuples.
Listing Variants
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) b.collect res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant)) |
keys [Pair]
Extracts the keys from all contained tuples and returns them in a new RDD.
Listing Variants
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.keys.collect res2: Array[Int] = Array(3, 5, 4, 3, 7, 5) |
leftOuterJoin [Pair]
Performs an left outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work correctly.
Listing Variants
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) b.leftOuterJoin(d).collect res1: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None))) |
lookup
Scans the RDD for all keys that match the provided value and returns their values as a Scala sequence.
Listing Variants
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.lookup(5) res0: Seq[String] = WrappedArray(tiger, eagle) |
map
Applies a transformation function on each item of the RDD and returns the result as a new RDD.
Listing Variants
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.map(_.length) val c = a.zip(b) c.collect res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8)) |
mapPartitions
This is a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument ( Iterarator[T] ). The custom function must return yet another Iterator[U] . The combined result iterators are automatically converted into a new RDD. Please note, that the tuples (3,4) and (6,7) are missing from the following result due to the partitioning we chose.
Listing Variants
Example 1
val a = sc.parallelize(1 to 9, 3) def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = { var res = List[(T, T)]() var pre = iter.next while (iter.hasNext) { val cur = iter.next; res .::= (pre, cur) pre = cur; } res.iterator } a.mapPartitions(myfunc).collect res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8)) |
Example 2
val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3) def myfunc(iter: Iterator[Int]) : Iterator[Int] = { var res = List[Int]() while (iter.hasNext) { val cur = iter.next; res = res ::: List.fill(scala.util.Random.nextInt(10))(cur) } res.iterator } x.mapPartitions(myfunc).collect // some of the number are not outputted at all. This is because the random number generated for it is zero. res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10) |
The above program can also be written using flatMap as follows.
Example 2 using flatmap
val x = sc.parallelize(1 to 10, 3) x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10) |
mapPartitionsWithContext (deprecated and developer API)
Similar to mapPartitions , but allows accessing information about the processing state within the mapper.
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3) import org.apache.spark.TaskContext def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = { tc.addOnCompleteCallback(() => println( "Partition: " + tc.partitionId + ", AttemptID: " + tc.attemptId )) iter.toList.filter(_ % 2 == 0).iterator } a.mapPartitionsWithContext(myfunc).collect 14/04/01 23:05:48 INFO SparkContext: Starting job: collect at <console>:20 ... 14/04/01 23:05:48 INFO Executor: Running task ID 0 Partition: 0, AttemptID: 0, Interrupted: false ... 14/04/01 23:05:48 INFO Executor: Running task ID 1 14/04/01 23:05:48 INFO TaskSetManager: Finished TID 0 in 470 ms on localhost (progress: 0/3) ... 14/04/01 23:05:48 INFO Executor: Running task ID 2 14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms on localhost (progress: 1/3) 14/04/01 23:05:48 INFO DAGScheduler: Completed ResultTask(0, 1) ? res0: Array[Int] = Array(2, 6, 4, 8) |
mapPartitionsWithIndex
Similar to mapPartitions , but takes two parameters. The first parameter is the index of the partition and the second is an iterator through all the items within this partition. The output is an iterator containing the list of items after applying whatever transformation the function encodes.
Listing Variants
Example
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.map(x => index + "," + x) } x.mapPartitionsWithIndex(myfunc).collect() res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10) |
mapPartitionsWithSplit
This method has been marked as deprecated in the API. So, you should not use this method anymore. Deprecated methods will not be covered in this document.
Listing Variants
mapValues [Pair]
Takes the values of a RDD that consists of two-component tuples, and applies the provided function to transform each value. Then, it forms new two-component tuples using the key and the transformed value and stores them in a new RDD.
Listing Variants
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.mapValues("x" + _ + "x").collect res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex)) |
mapWith (deprecated)
This is an extended version of map . It takes two function arguments. The first argument must conform to Int -> T and is executed once per partition. It will map the partition index to some transformed partition index of type T . This is where it is nice to do some kind of initialization code once per partition. Like create a Random number generator object. The second function must conform to (U, T) -> U . T is the transformed partition index and U is a data item of the RDD. Finally the function has to return a transformed data item of type U .
Listing Variants
Example
// generates 9 random numbers less than 1000. val x = sc.parallelize(1 to 9, 3) x.mapWith(a => new scala.util.Random)((x, r) => r.nextInt(1000)).collect res0: Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15) val a = sc.parallelize(1 to 9, 3) val b = a.mapWith("Index:" + _)((a, b) => ("Value:" + a, b)) b.collect res0: Array[(String, String)] = Array((Value:1,Index:0), (Value:2,Index:0), (Value:3,Index:0), (Value:4,Index:1), (Value:5,Index:1), (Value:6,Index:1), (Value:7,Index:2), (Value:8,Index:2), (Value:9,Index:2) |
max
Returns the largest element in the RDD
Listing Variants
Example
val y = sc.parallelize(10 to 30) y.max res75: Int = 30 val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat"))) a.max res6: (Int, String) = (18,cat) |
mean [Double], meanApprox [Double]
Calls stats and extracts the mean component. The approximate version of the function can finish somewhat faster in some scenarios. However, it trades accuracy for speed.
Listing Variants
def meanApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
Example
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3) a.mean res0: Double = 5.3 |
min
Returns the smallest element in the RDD
Listing Variants
Example
val y = sc.parallelize(10 to 30) y.min res75: Int = 10 val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8, "cat"))) a.min res4: (Int, String) = (3,tiger) |
name, setName
Allows a RDD to be tagged with a custom name.
Listing Variants
def setName(_name: String)
Example
val y = sc.parallelize(1 to 10, 10) y.name res13: String = null y.setName("Fancy RDD Name") y.name res15: String = Fancy RDD Name |
partitionBy [Pair]
Repartitions as key-value RDD using its keys. The partitioner implementation can be supplied as the first argument.
Listing Variants
partitioner
Specifies a function pointer to the default partitioner that will be used for groupBy , subtract , reduceByKey (from PairedRDDFunctions ), etc. functions.
Listing Variants
partitions
Returns an array of the partition objects associated with this RDD.
Listing Variants
Example
val b = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) b.partitions res48: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@18aa, org.apache.spark.rdd.ParallelCollectionPartition@18ab) |
persist, cache
These functions can be used to adjust the storage level of a RDD. When freeing up memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameterless variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY) . (Warning: Once the storage level has been changed, it cannot be changed again!)
Listing Variants
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]
Example
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2) c.getStorageLevel res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1) c.cache c.getStorageLevel res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1) |
pipe
Takes the RDD data of each partition and sends it via stdin to a shell-command. The resulting output of the command is captured and returned as a RDD of string values.
Listing Variants
def pipe(command: String, env: Map[String, String]): RDD[String]
def pipe(command: Seq[String], env: Map[String, String] = Map(), printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String => Unit) => Unit = null): RDD[String]
Example
val a = sc.parallelize(1 to 9, 3) a.pipe("head -n 1").collect res2: Array[String] = Array(1, 4, 7) |
randomSplit
Randomly splits an RDD into multiple smaller RDDs according to a weights Array which specifies the percentage of the total data elements that is assigned to each smaller RDD. Note the actual size of each smaller RDD is only approximately equal to the percentages specified by the weights Array. The second example below shows the number of items in each smaller RDD does not exactly match the weights Array. A random optional seed can be specified. This function is useful for spliting data into a training set and a testing set for machine learning.
Listing Variants
Example
val y = sc.parallelize(1 to 10) val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1) training.collect res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10) test.collect res86: Array[Int] = Array(2, 3, 7, 9) val y = sc.parallelize(1 to 10) val splits = y.randomSplit(Array(0.1, 0.3, 0.6)) val rdd1 = splits(0) val rdd2 = splits(1) val rdd3 = splits(2) rdd1.collect res87: Array[Int] = Array(4, 10) rdd2.collect res88: Array[Int] = Array(1, 3, 5, 8) rdd3.collect res91: Array[Int] = Array(2, 6, 7, 9) |
reduce
This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.
Listing Variants
Example
val a = sc.parallelize(1 to 100, 3) a.reduce(_ + _) res41: Int = 5050 |
reduceByKey [Pair], reduceByKeyLocally [Pair], reduceByKeyToDriver [Pair]
This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.
Listing Variants
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
def reduceByKeyToDriver(func: (V, V) => V): Map[K, V]
Example
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) val b = a.map(x => (x.length, x)) b.reduceByKey(_ + _).collect res86: Array[(Int, String)] = Array((3,dogcatowlgnuant)) val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) b.reduceByKey(_ + _).collect res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle)) |
repartition
This function changes the number of partitions to the number specified by the numPartitions parameter
Listing Variants
Example
val rdd = sc.parallelize(List(1, 2, 10, 4, 5, 2, 1, 1, 1), 3) rdd.partitions.length res2: Int = 3 val rdd2 = rdd.repartition(5) rdd2.partitions.length res6: Int = 5 |
repartitionAndSortWithinPartitions [Ordered]
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
Listing Variants
Example
// first we will do range partitioning which is not sorted val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3) val rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD) val partitioned = randRDD.partitionBy(rPartitioner) def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = { iter.map(x => "[partID:" + index + ", val: " + x + "]") } partitioned.mapPartitionsWithIndex(myfunc).collect res0: Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:0, val: (1,screen)], [partID:1, val: (4,tv)], [partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2, val: (7,cup)]) // now lets repartition but this time have it sorted val partitioned = randRDD.repartitionAndSortWithinPartitions(rPartitioner) def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = { iter.map(x => "[partID:" + index + ", val: " + x + "]") } partitioned.mapPartitionsWithIndex(myfunc).collect res1: Array[String] = Array([partID:0, val: (1,screen)], [partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:1, val: (4,tv)], [partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2, val: (7,cup)]) |
rightOuterJoin [Pair]
Performs an right outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work correctly.
Listing Variants
def rightOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], W))]
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) b.rightOuterJoin(d).collect res2: Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear))) |
sample
Randomly selects a fraction of the items of a RDD and returns them in a new RDD.
Listing Variants
Example
val a = sc.parallelize(1 to 10000, 3) a.sample(false, 0.1, 0).count res24: Long = 960 a.sample(true, 0.3, 0).count res25: Long = 2888 a.sample(true, 0.3, 13).count res26: Long = 2985 |
sampleByKey [Pair]
Randomly samples the key value pair RDD according to the fraction of each key you want to appear in the final RDD.
Listing Variants
Example
val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater"))) val sampleMap = List((7, 0.4), (6, 0.6)).toMap randRDD.sampleByKey(false, sampleMap,42).collect res6: Array[(Int, String)] = Array((7,cat), (6,mouse), (6,book), (6,screen), (7,heater)) |
sampleByKeyExact [Pair, experimental]
This is labelled as experimental and so we do not document it.
Listing Variants
saveAsHadoopFile [Pair], saveAsHadoopDataset [Pair], saveAsNewAPIHadoopFile [Pair]
Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat class the user specifies.
Listing Variants
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String, codec: Class[_ <: CompressionCodec]) (implicit fm: ClassTag[F])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ <: OutputFormat[_, _]], codec: Class[_ <: CompressionCodec])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ <: OutputFormat[_, _]], conf: JobConf = new JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <: CompressionCodec]] = None)
def saveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])
def saveAsNewAPIHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ <: NewOutputFormat[_, _]], conf: Configuration = self.context.hadoopConfiguration)
saveAsObjectFile
Saves the RDD in binary format.
Listing Variants
Example
val x = sc.parallelize(1 to 100, 3) x.saveAsObjectFile("objFile") val y = sc.objectFile[Int]("objFile") y.collect res52: Array[Int] = Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100) |
saveAsSequenceFile [SeqFile]
Saves the RDD as a Hadoop sequence file.
Listing Variants
Example
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2) v.saveAsSequenceFile("hd_seq_file") 14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task 'attempt_201404190545_0000_m_000001_191' to file:/home/cloudera/hd_seq_file [cloudera@localhost ~]$ ll ~/hd_seq_file total 8 -rwxr-xr-x 1 cloudera cloudera 117 Apr 19 05:45 part-00000 -rwxr-xr-x 1 cloudera cloudera 133 Apr 19 05:45 part-00001 -rwxr-xr-x 1 cloudera cloudera 0 Apr 19 05:45 _SUCCESS |
saveAsTextFile
Saves the RDD as text files. One line at a time.
Listing Variants
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec])
Example without compression
val a = sc.parallelize(1 to 10000, 3) a.saveAsTextFile("mydata_a") 14/04/03 21:11:36 INFO FileOutputCommitter: Saved output of task 'attempt_201404032111_0000_m_000002_71' to file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a [cloudera@localhost ~]$ head -n 5 ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000 1 2 3 4 5 // Produces 3 output files since we have created the a RDD with 3 partitions [cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/ -rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11 part-00000 -rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11 part-00001 -rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11 part-00002 |
Example with compression
import org.apache.hadoop.io.compress.GzipCodec a.saveAsTextFile("mydata_b", classOf[GzipCodec]) [cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/ total 24 -rwxr-xr-x 1 cloudera cloudera 7276 Apr 3 21:29 part-00000.gz -rwxr-xr-x 1 cloudera cloudera 6517 Apr 3 21:29 part-00001.gz -rwxr-xr-x 1 cloudera cloudera 6525 Apr 3 21:29 part-00002.gz val x = sc.textFile("mydata_b") x.count res2: Long = 10000 |
Example writing into HDFS
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3) x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test"); val sp = sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data") sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x") |