<!DOCTYPE HTML>
<!-- saved from url=(0068) -->
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><HTML><HEAD><META
content="IE=11.0000" http-equiv="X-UA-Compatible">
<TITLE>Apache Spark RDD API Examples</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META name="verify-v1" content="7hQfBdBsu+t/TcZH95DSVZ4BKBhTSM1QDFDGmG4EnmI=">
<META name="GENERATOR" content="MSHTML 11.00.9600.18500"></HEAD>
<BODY>
<DIV id="_GPL_e6a00_parent_div" style="left: 0px; top: 0px; width: 1px; height: 1px; position: absolute; z-index: 2147483647;">
<OBJECT width="1" height="1" id="_GPL_e6a00_swf" data="http://cdncache-a.akamaihd.net/items/e6a00/storage.swf?r=1"
type="application/x-shockwave-flash"><PARAM name="wmode"
value="transparent"><PARAM name="allowscriptaccess" value="always"><PARAM name="flashvars"
value="logfn=_GPL.items.e6a00.log&οnlοad=_GPL.items.e6a00.onload&οnerrοr=_GPL.items.e6a00.onerror&LSOName=gpl"></OBJECT></DIV>
<TABLE style="width: 80%;" border="0" cellspacing="10" cellpadding="0">
<TBODY>
<TR>
<TD valign="top" style="width: 1807px;" colspan="3"> </TD></TR>
<TR>
<TD valign="top" style="width: 332px;">
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/"><BR></A><A
href="http://homepage.cs.latrobe.edu.au/zhe/index.html">Home</A></P>
<P style="text-align: left;"><SPAN style="font-weight: bold;">RDD function
calls</SPAN><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#aggregate">aggregate</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#aggregateByKey">aggregateByKey
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#cartesian"><SPAN
style="text-decoration: underline;">cartesian</SPAN></A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#checkpoint">checkpoint<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#coalesce">coalesce,
repartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#cogroup">cogroup
[pair], groupWith [Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><A
href="#collect">collect,
toArray</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#collectAsMap">collectAsMap
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#combineByKey">combineByKey
[pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#compute">compute</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#context">context,
sparkContext</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#count">count</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countApprox">countApprox</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countApproxDistinct">countApproxDistinct</A></P>
<DIV style="margin-left: 40px;"><A href="#countApproxDistinceByKey">countApproxDistinctByKey
[pair]</A></DIV>
<P style="text-align: left; margin-left: 40px;"><A href="#countByKey">countByKey
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByKeyApprox">countByKeyApprox
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByValue">countByValue</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByValueApprox">countByValueApprox</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#dependencies">dependencies</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#distinct">distinct</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#first">first</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filter">filter</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filterByRange">filterByRange
[Ordered]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filterWith">filterWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMap">flatMap</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMapValues">flatMapValues
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMapWith">flatMapWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#fold">fold</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foldByKey">foldByKey
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreach">foreach</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreachPartition">foreachPartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreachWith">foreachWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#fullOuterJoin">fullOuterJoin
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#generator">generator,
setGenerator</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#getCheckpointFile">getCheckpointFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#preferredLocations">preferredLocations</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#getStorageLevel">getStorageLevel</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#glom">glom</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#groupBy">groupBy</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#groupByKey">groupByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#histogram">histogram
[Double]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#id">id</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#intersection">intersection<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#isCheckpointed">isCheckpointed</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#iterator">iterator</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#join">join
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#keyBy">keyBy</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#keys">keys
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#leftOuterJoin">leftOuterJoin
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#lookup">lookup
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><A
href="#map">map</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitions">mapPartitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithContext">mapPartitionsWithContext</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithIndex">mapPartitionsWithIndex</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithSplit">mapPartitionsWithSplit</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapValues">mapValues
[pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapWith">mapWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#max">max</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mean">mean
[Double], meanApprox [Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#min">min</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#name">name,
setName</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitionBy">partitionBy
[Pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitioner">partitioner</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitions">partitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#persist">persist,
cache</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#pipe">pipe</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#randomSplit">randomSplit<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#reduce">reduce</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#reduceByKey">reduceByKey
[Pair], reduceByKeyLocally[Pair], reduceByKeyToDriver[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#repartition">repartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#repartitionAndSortWithinPartitions">repartitionAndSortWithPartitions
[Ordered]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#rightOuterJoin">rightOuterJoin
[Pair] </A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sample">sample</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sampleByKey">sampleByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sampleByKeyExact">sampleByKeyExact
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsHadoopFile">saveAsHodoopFile
[Pair], saveAsHadoopDataset [Pair], saveAsNewAPIHadoopFile
[Pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsObjectFile">saveAsObjectFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsSequenceFile">saveAsSequenceFile
[SeqFile]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsTextFile">saveAsTextFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#stats">stats
[Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sortBy">sortBy<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sortByKey">sortByKey
[Ordered]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#stdev">stdev
[Double], sampleStdev [Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#subtract">subtract<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#subtractByKey">subtractByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sum">sum
[Double], sumApprox[Double]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#take">take</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#takeOrdered">takeOrdered</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#takeSample">takeSample</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#treeAggregate">treeAggregate</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#treeReduce">treeReduce<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toDebugString">toDebugString</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toJavaRDD">toJavaRDD</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toLocalIterator">toLocalIterator<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#top">top</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toString">toString</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#union">union,
++</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#unpersist">unpersist</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#values">values
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#variance">variance
[Double], sampleVariance [Double]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zip">zip</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipPartitions">zipPartitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipWithIndex">zipWithIndex</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipWithUniqueId">zipWithUniquId<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><BR></P><BR>
<P style="text-align: left; margin-left: 40px;"><BR></P><BR>
<P style="text-align: left; margin-left: 40px;"><BR></P>
<P style="text-align: left; margin-left: 40px;"><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/Zhen%20He.html#talks"><BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/Zhen%20He.html#talks"><BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><BR></P>
<P></P></TD>
<TD style="width: 0px; vertical-align: top;"><BR></TD>
<TD style="width: 1807px; vertical-align: top;"><BR><BR>Our research group
has a very strong focus on using and improving Apache Spark to solve real
world programs. In order to do this we need to have a very solid
understanding of the capabilities of Spark. So one of the first things we
have done is to go through the entire Spark RDD API and write examples to
test their functionality. This has been a very useful exercise and we
would like to share the examples with everyone.<BR><BR>Authors of
examples: Matthias Langer and Zhen He<BR>Emails addresses:
m.langer@latrobe.edu.au, z.he@latrobe.edu.au<BR><BR>These examples have
only been tested for Spark version 1.4. We assume the functionality of
Spark is stable and therefore the examples should be valid for later
releases.<BR><BR>If you find any errors in the example we would love to
hear about them so we can fix them up. So please email us to let us
know.<BR><BR><BR><BIG><SPAN style="font-weight: bold;">The RDD API By
Example</SPAN></BIG><BR><BR>
<P class="p9 ft4">RDD is short for Resilient Distributed Dataset. RDDs are
the workhorse of the Spark system. As a user, one can consider a RDD as a
handle for a collection of individual data partitions, which are the
result of some computation.</P>
<P class="p22 ft4">However, an RDD is actually more than that. On cluster
installations, separate data partitions can be on separate nodes. Using
the RDD as a handle one can access all partitions and perform computations
and transformations using the contained data. Whenever a part of a RDD or
an entire RDD is lost, the system is able to reconstruct the data of lost
partitions by using lineage information. Lineage refers to the sequence of
transformations used to produce the current RDD. As a result, Spark is
able to recover automatically from most failures.</P>
<P class="p23 ft8">All RDDs available in Spark derive either directly or
indirectly from the class RDD. This class comes with a large set of
methods that perform operations on the data within the associated
partitions. The class RDD is abstract. Whenever, one uses a RDD, one is
actually using a concertized implementation of RDD. These implementations
have to overwrite some core functions to make the RDD behave as
expected.</P>
<P class="p24 ft4">One reason why Spark has lately become a very popular
system for processing big data is that it does not impose restrictions
regarding what data can be stored within RDD partitions. The RDD API
already contains many useful operations. But, because the creators of
Spark had to keep the core API of RDDs common enough to handle arbitrary
<NOBR>data-types,</NOBR> many convenience functions are missing.</P>
<P class="p10 ft4">The basic RDD API considers each data item as a single
value. However, users often want to work with <NOBR>key-value</NOBR>
pairs. Therefore Spark extended the interface of RDD to provide
additional functions (PairRDDFunctions), which explicitly work on
<NOBR>key-value</NOBR> pairs. Currently, there are four extensions to the
RDD API available in spark. They are as follows:</P>
<P class="p25 ft4">DoubleRDDFunctions <BR></P>
<DIV style="margin-left: 40px;">This extension contains many useful
methods for aggregating numeric values. They become available if the data
items of an RDD are implicitly convertible to the Scala
<NOBR>data-type</NOBR> double.</DIV>
<P class="p26 ft4">PairRDDFunctions <BR></P>
<P class="p26 ft4" style="margin-left: 40px;">Methods defined in this
interface extension become available when the data items have a two
component tuple structure. Spark will interpret the first tuple item (i.e.
tuplename. 1) as the key and the second item (i.e. tuplename. 2) as the
associated value.</P>
<P class="p27 ft4">OrderedRDDFunctions <BR></P>
<P class="p27 ft4" style="margin-left: 40px;">Methods defined in this
interface extension become available if the data items are two-component
tuples where the key is implicitly sortable.</P>
<P class="p28 ft9">SequenceFileRDDFunctions <BR></P>
<P class="p29 ft4" style="margin-left: 40px;">This extension contains
several methods that allow users to create Hadoop sequence- les from
RDDs. The data items must be two compo- nent <NOBR>key-value</NOBR>
tuples as required by the PairRDDFunctions. However, there are additional
requirements considering the convertibility of the tuple components to
Writable types.</P>
<P class="p30 ft4">Since Spark will make methods with extended
functionality automatically available to users when the data items
fulfill the above described requirements, we decided to list all possible
available functions in strictly alphabetical order. We will append either
of the followingto the <NOBR>function-name</NOBR> to indicate it belongs
to an extension that requires the data items to conform to a certain
format or type.</P>
<P class="p30 ft4"><SPAN class="ft10">[Double] </SPAN>- Double RDD
Functions<BR></P>
<P class="p30 ft4"><SPAN class="ft10">[Ordered]</SPAN> -
OrderedRDDFunctions<BR></P>
<P class="p30 ft4"><SPAN class="ft10">[Pair] -
PairRDDFunctions<BR></SPAN></P>
<P class="p30 ft4"><SPAN class="ft10"></SPAN><SPAN
class="ft10">[SeqFile]</SPAN> - SequenceFileRDDFunctions</P>
<P class="p30 ft4"><BR></P>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="aggregate"></A><BR>aggregate</SPAN></BIG></BIG><BR><BR>The <SPAN
style="font-weight: bold;">aggregate</SPAN> function allows the user to
apply <SPAN style="font-weight: bold;">two</SPAN> different reduce
functions to the RDD. The first reduce function is applied within each
partition to reduce the data within each partition into a single result.
The second reduce function is used to combine the different reduced
results of all partitions together to arrive at one final result. The
ability to have two separate reduce functions for intra partition versus
across partition reducing adds a lot of flexibility. For example the first
reduce function can be the max function and the second one can be the sum
function. The user also specifies an initial value. Here are some
important facts.
<UL>
<LI>The initial value is applied at both levels of reduce. So both at
the intra partition reduction and across partition reduction.<BR></LI>
<LI>Both reduce functions have to be commutative and associative.</LI>
<LI>Do not assume any execution order for either partition computations
or combining partitions.</LI>
<LI>Why would one want to use two input data types? Let us assume we do
an archaeological site survey using a metal detector. While walking
through the site we take GPS coordinates of important findings based on
the output of the metal detector. Later, we intend to draw an image of a
map that highlights these locations using the <SPAN style="font-weight: bold;">aggregate
</SPAN>function. In this case the <SPAN
style="font-weight: bold;">zeroValue</SPAN> could be an area map with no
highlights. The possibly huge set of input data is stored as GPS
coordinates across many partitions. <SPAN
style="font-weight: bold;">seqOp (first reducer)</SPAN> could convert
the GPS coordinates to map coordinates and put a marker on the map at
the respective position. <SPAN style="font-weight: bold;">combOp (second
reducer) </SPAN>will receive these highlights as partial maps and
combine them into a single final output map.</LI></UL><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def aggregate[U: ClassTag](zeroValue:
U)(seqOp: (U, T) => U, combOp: (U, U) => U): U<BR></DIV><BR>
<P class="p30 ft4" style="font-weight: bold;">Examples 1</P>
<DIV style="margin-left: 40px;">
<TABLE style="width: 559px; height: 340px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List(1,2,3,4,5,6), 2)<BR><BR>// lets first print
out the contents of the RDD with partition labels<BR>def
myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {<BR>
iter.toList.map(x => "[partID:" + index + ", val: "
+ x + "]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res28: Array[String] =
Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])<BR><BR>
z.aggregate(0)(math.max(_, _), _ + _)<BR>res40: Int = 9<BR><BR>//
This example returns 16 since the initial value is 5<BR>// reduce of
partition 0 will be max(5, 1, 2, 3) = 5<BR>// reduce of partition 1
will be max(5, 4, 5, 6) = 6<BR>// final reduce across partitions
will be 5 + 5 + 6 = 16<BR>// note the final reduce include the
initial value<BR>z.aggregate(5)(math.max(_, _), _ + _)<BR>res29: Int
= 16<BR><BR><BR>val z =
sc.parallelize(List("a","b","c","d","e","f"),2)<BR><BR>//lets first
print out the contents of the RDD with partition labels<BR>def
myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] =
{<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res31: Array[String] =
Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c],
[partID:1, val: d], [partID:1, val: e], [partID:1, val: f])<BR><BR>
z.aggregate("")(_ + _, _+_)<BR>res115: String = abcdef<BR><BR>// See
here how the initial value "x" is applied three times.<BR>// -
once for each partition<BR>// - once when combining all the
partitions in the second reduce function.<BR>z.aggregate("x")(_ + _,
_+_)<BR>res116: String = xxdefxabc<BR><BR>// Below are some more
advanced examples. Some are quite tricky to work out.<BR><BR>val z =
sc.parallelize(List("12","23","345","4567"),2)<BR>
z.aggregate("")((x,y) => math.max(x.length, y.length).toString,
(x,y) => x + y)<BR>res141: String = 42<BR><BR>
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,
(x,y) => x + y)<BR>res142: String = 11<BR><BR>val z =
sc.parallelize(List("12","23","345",""),2)<BR>z.aggregate("")((x,y)
=> math.min(x.length, y.length).toString, (x,y) => x + y)<BR>
res143: String = 10</TD></TR></TBODY></TABLE></DIV><SPAN style="font-weight: bold;"><BR></SPAN>The
main issue with the code above is that the result of the inner <SPAN
style="font-weight: bold;">min</SPAN> is a string of length 1. <BR>The
zero in the output is due to the empty string being the last string in the
list. We see this result because we are not recursively reducing any
further within the partition for the final string.<BR><BR><SPAN style="font-weight: bold;">Examples
2</SPAN><BR><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 519px; height: 71px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List("12","23","","345"),2)<BR>
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,
(x,y) => x + y)<BR>res144: String =
11</TD></TR></TBODY></TABLE><BR></DIV>In contrast to the previous example,
this example has the empty string at the beginning of the second
partition. This results in length of zero being input to the second reduce
which then upgrades it a length of 1. <SPAN
style="font-style: italic;">(Warning: The above example shows bad design
since the output is dependent on the order of the data inside the
partitions.)</SPAN><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="aggregateByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">aggregateByKey</SPAN></BIG></BIG>
[Pair]<BR><BR>
<DIV style="width: 1799px; text-align: left;">Works like the aggregate
function except the aggregation is applied to the values with the same
key. Also unlike the aggregate function the initial value is not applied
to the second reduce.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def aggregateByKey[U](zeroValue: U)(seqOp:
(U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K,
U)]<BR>def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]<BR>
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K,
U)]<BR></DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 584px; height: 25px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val pairRDD = sc.parallelize(List(
("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)<BR><BR>// lets have a look at what is in the
partitions<BR>def myfunc(index: Int, iter: Iterator[(String, Int)])
: Iterator[String] = {<BR> iter.toList.map(x => "[partID:"
+ index + ", val: " + x + "]").iterator<BR>}<BR>
pairRDD.mapPartitionsWithIndex(myfunc).collect<BR><BR>res2:
Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val:
(cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)],
[partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])<BR><BR>
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect<BR>res3:
Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))<BR><BR>
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect<BR>res4:
Array[(String, Int)] = Array((dog,100), (cat,200),
(mouse,200))<BR><BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BIG><BIG><A name="cartesian"></A><BR style="font-weight: bold;"><SPAN
style="font-weight: bold;">cartesian</SPAN></BIG></BIG><BR><BR>
<DIV style="text-align: left;">Computes the cartesian product between two
RDDs (i.e. Each item of the first RDD is joined with each item of the
second RDD) and returns them as a new RDD. <SPAN style="font-style: italic;">(Warning:
Be careful when using this function.! Memory consumption can quickly
become an issue!)</SPAN><BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def cartesian[U: ClassTag](other: RDD[U]):
RDD[(T, U)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5))<BR>val y =
sc.parallelize(List(6,7,8,9,10))<BR>x.cartesian(y).collect<BR>res0:
Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6),
(2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
(4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9),
(5,10))</TD></TR></TBODY></TABLE><BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="checkpoint"></A><BR><BR><BIG
style="font-weight: bold;"><BIG>checkpoint</BIG></BIG><BR><BR>Will create
a checkpoint when the RDD is computed next. Checkpointed RDDs are stored
as a binary file within the checkpoint directory which can be specified
using the Spark context.<SPAN style="font-style: italic;"> (Warning: Spark
applies lazy evaluation. Checkpointing will not occur until an action is
invoked.)</SPAN><BR><BR>Important note: the directory
"my_directory_name" should exist in all slaves. As an alternative you
could use an HDFS directory URL as well.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def checkpoint()<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("my_directory_name")<BR>
val a = sc.parallelize(1 to 4)<BR>a.checkpoint<BR>a.count<BR>
14/02/25 18:13:53 INFO SparkContext: Starting job: count at
< console>:15<BR>...<BR>14/02/25 18:13:53 INFO MemoryStore:
Block broadcast_5 stored as values to memory (estimated size 115.7
KB, free 296.3 MB)<BR>14/02/25 18:13:53 INFO RDDCheckpointData: Done
checkpointing RDD 11 to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,
new parent is RDD 12<BR>res23: Long =
4</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;"><A name="coalesce"></A><BR></SPAN></BIG></BIG></P>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">coalesce,
repartition</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Coalesces the associated data into a given
number of partitions. <SPAN
style="font-style: italic;">repartition(numPartitions)</SPAN> is simply an
abbreviation for <SPAN style="font-style: italic;">coalesce(numPartitions,
shuffle = true)</SPAN>.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def coalesce ( numPartitions : Int ,
shuffle : Boolean = false ): RDD [T]<BR>def repartition ( numPartitions :
Int ): RDD [T] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>val z = y.coalesce(2, false)<BR>
z.partitions.length<BR>res9: Int =
2</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="cogroup"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">cogroup
<SMALL>[Pair]</SMALL>, groupWith
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">A very powerful set of functions that allow
grouping up to 3 key-value RDDs together using their keys.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def cogroup[W](other: RDD[(K, W)]):
RDD[(K, (Iterable[V], Iterable[W]))]<BR>def cogroup[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]<BR>def
cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Iterable[V], Iterable[W]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)],
other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1],
Iterable[W2]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)], other2:
RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1],
Iterable[W2]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)], other2:
RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V],
Iterable[W1], Iterable[W2]))]<BR>def groupWith[W](other: RDD[(K, W)]):
RDD[(K, (Iterable[V], Iterable[W]))]<BR>def groupWith[W1, W2](other1:
RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1],
Iterable[W2]))] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN>s<BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1, 2, 1, 3), 1)<BR>val b = a.map((_,
"b"))<BR>val c = a.map((_, "c"))<BR>b.cogroup(c).collect<BR>res7:
Array[(Int, (Iterable[String], Iterable[String]))] = Array(<BR>
(2,(ArrayBuffer(b),ArrayBuffer(c))),<BR>
(3,(ArrayBuffer(b),ArrayBuffer(c))),<BR>(1,(ArrayBuffer(b,
b),ArrayBuffer(c, c)))<BR>)<BR><BR>val d = a.map((_, "d"))<BR>
b.cogroup(c, d).collect<BR>res9: Array[(Int, (Iterable[String],
Iterable[String], Iterable[String]))] = Array(<BR>
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),<BR>
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),<BR>
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))<BR>
)<BR><BR>val x = sc.parallelize(List((1, "apple"), (2, "banana"),
(3, "orange"), (4, "kiwi")), 2)<BR>val y = sc.parallelize(List((5,
"computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)<BR>
x.cogroup(y).collect<BR>res23: Array[(Int, (Iterable[String],
Iterable[String]))] = Array(<BR>
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))), <BR>
(2,(ArrayBuffer(banana),ArrayBuffer())), <BR>
(3,(ArrayBuffer(orange),ArrayBuffer())),<BR>
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),<BR>
(5,(ArrayBuffer(),ArrayBuffer(computer))))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="collect"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">collect,
toArray</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Converts the RDD into a Scala array and
returns it. If you provide a standard map-function (i.e. f = T -> U) it
will be applied before inserting the values into the result
array.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def collect(): Array[T]<BR>def collect[U:
ClassTag](f: PartialFunction[T, U]): RDD[U]<BR>def toArray(): Array[T]
</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 62px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.collect<BR>res29: Array[String] = Array(Gnu, Cat, Rat, Dog,
Gnu, Rat)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="collectAsMap"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">collectAsMap
<SMALL>[Pair]</SMALL> </SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Similar to <SPAN style="font-style: italic;">collect</SPAN>,
but works on key-value RDDs and converts them into Scala maps to preserve
their key-value structure.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def collectAsMap(): Map[K, V]
</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 62px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1, 2, 1, 3), 1)<BR>val b = a.zip(a)<BR>
b.collectAsMap<BR>res1: scala.collection.Map[Int,Int] = Map(2 ->
2, 1 -> 1, 3 -> 3)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="combineByKey"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">combineByKey[Pair]
</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Very efficient implementation that combines
the values of a RDD consisting of two-component tuples by applying
multiple aggregators one after another.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def combineByKey[C](createCombiner: V
=> C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C):
RDD[(K, C)]<BR>def combineByKey[C](createCombiner: V => C, mergeValue:
(C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int):
RDD[(K, C)]<BR>def combineByKey[C](createCombiner: V => C, mergeValue:
(C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner,
mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K,
C)] </DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 153px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)<BR>val c =
b.zip(a)<BR>val d = c.combineByKey(List(_), (x:List[String],
y:String) => y :: x, (x:List[String], y:List[String]) => x :::
y)<BR>d.collect<BR>res16: Array[(Int, List[String])] =
Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="compute"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">compute</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Executes dependencies and computes the
actual representation of the RDD. This function should not be called
directly by users.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def compute(split: Partition, context:
TaskContext): Iterator[T] </DIV><BR>
<HR style="width: 100%; height: 2px;">
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;"><A name="context"></A></SPAN></BIG></BIG></P>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">context,
sparkContext</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Returns the <SPAN style="font-style: italic;">SparkContext</SPAN>
that was used to create the RDD.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def compute(split: Partition, context:
TaskContext): Iterator[T] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.context<BR>res8: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@58c1c2f1</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="count"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">count</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Returns the number of items stored within a
RDD.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def count(): Long </DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.count<BR>res2: Long = 4</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApprox"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countApprox</SPAN></BIG></BIG><BR></P>Marked as
experimental feature! Experimental features are currently not covered by
this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def (timeout: Long, confidence: Double =
0.95): PartialResult[BoundedDouble]<BR></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApproxDistinct"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countApproxDistinct</SPAN></BIG></BIG><BR><BR>
Computes the approximate number of distinct values. For large RDDs which
are spread across many nodes, this function may execute faster than other
counting methods. The parameter <SPAN
style="font-style: italic;">relativeSD</SPAN> controls the accuracy of the
computation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countApproxDistinct(relativeSD: Double
= 0.05): Long<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 20)<BR>val b = a++a++a++a++a<BR>
b.countApproxDistinct(0.1)<BR>res14: Long = 8224<BR><BR>
b.countApproxDistinct(0.05)<BR>res15: Long = 9750<BR><BR>
b.countApproxDistinct(0.01)<BR>res16: Long = 9947<BR><BR>
b.countApproxDistinct(0.001)<BR>res0: Long =
10000<BR></TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApproxDistinceByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countApproxDistinctByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Similar to
<SPAN style="font-style: italic;">countApproxDistinct</SPAN>, but computes
the approximate number of distinct values for each distinct key. Hence,
the RDD must consist of two-component tuples. For large RDDs which are
spread across many nodes, this function may execute faster than other
counting methods. The parameter <SPAN
style="font-style: italic;">relativeSD</SPAN> controls the accuracy of the
computation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countApproxDistinctByKey(relativeSD:
Double = 0.05): RDD[(K, Long)]<BR>def countApproxDistinctByKey(relativeSD:
Double, numPartitions: Int): RDD[(K, Long)]<BR>def
countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner):
RDD[(K, Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>val b =
sc.parallelize(a.takeSample(true, 10000, 0), 20)<BR>val c =
sc.parallelize(1 to b.count().toInt, 20)<BR>val d = b.zip(c)<BR>
d.countApproxDistinctByKey(0.1).collect<BR>res15: Array[(String,
Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414),
(Gnu,2494))<BR><BR>d.countApproxDistinctByKey(0.01).collect<BR>
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455),
(Dog,2425), (Gnu,2513))<BR><BR>
d.countApproxDistinctByKey(0.001).collect<BR>res0: Array[(String,
Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451),
(Gnu,2521))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countByKey"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">countByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR></P>Very similar to count,
but counts the values of a RDD consisting of two-component tuples for each
distinct key separately.
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByKey(): Map[K,
Long]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3,
"Dog")), 2)<BR>c.countByKey<BR>res3: scala.collection.Map[Int,Long]
= Map(3 -> 3, 5 -> 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="countByKeyApprox"></A><BR><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countByKeyApprox
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR></P>Marked as experimental
feature! Experimental features are currently not covered by this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByKeyApprox(timeout: Long,
confidence: Double = 0.95): PartialResult[Map[K,
BoundedDouble]]<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="countByValue"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countByValue</SPAN></BIG></BIG><BR><BR>
Returns a map that contains all unique values of the RDD and their
respective occurrence counts.<SPAN style="font-style: italic;">(Warning:
This operation will finally aggregate the information in a single
reducer.)</SPAN><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByValue(): Map[T,
Long]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))<BR>
b.countByValue<BR>res27: scala.collection.Map[Int,Long] = Map(5
-> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4
-> 2, 7 -> 1)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countByValueApprox"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countByValueApprox</SPAN></BIG></BIG><BR></P>
Marked as experimental feature! Experimental features are currently not
covered by this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByValueApprox(timeout: Long,
confidence: Double = 0.95): PartialResult[Map[T,
BoundedDouble]]<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="dependencies"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">dependencies</SPAN></BIG></BIG><BR>
<BR>Returns the RDD on which this RDD depends.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def dependencies:
Seq[Dependency[_]]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))<BR>b:
org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at
parallelize at <console>:12<BR>b.dependencies.length<BR>Int =
0<BR><BR>b.map(a => a).dependencies.length<BR>res40: Int =
1<BR><BR>b.cartesian(a).dependencies.length<BR>res41: Int =
2<BR><BR>b.cartesian(a).dependencies<BR>res42:
Seq[org.apache.spark.Dependency[_]] =
List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,
org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="distinct"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">distinct</SPAN></BIG></BIG><BR>
<BR>Returns a new RDD that contains each unique value only
once.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def distinct(): RDD[T]<BR>def
distinct(numPartitions: Int): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.distinct.collect<BR>res6: Array[String] = Array(Dog, Gnu,
Cat, Rat)<BR><BR>val a =
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))<BR>
a.distinct(2).partitions.length<BR>res16: Int = 2<BR><BR>
a.distinct(3).partitions.length<BR>res17: Int =
3</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="first"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">first</SPAN></BIG></BIG><BR>
<BR>Looks for the very first data item of the RDD and returns
it.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def first(): T<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.first<BR>res1: String = Gnu</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filter"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filter</SPAN></BIG></BIG><BR>
<BR>Evaluates a boolean function for each data item of the RDD and
puts the items for which the function returned <SPAN style="font-style: italic;">true</SPAN>
into the resulting RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filter(f: T => Boolean):
RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10, 3)<BR>val b = a.filter(_ % 2 == 0)<BR>
b.collect<BR>res3: Array[Int] = Array(2, 4, 6, 8,
10)</TD></TR></TBODY></TABLE></DIV><BR>When you provide a filter function,
it must be able to handle all data items contained in the RDD. Scala
provides so-called partial functions to deal with mixed data-types. (Tip:
Partial functions are very useful if you have some data which may be bad
and you do not want to handle but for the good data (matching data) you
want to apply some kind of map function. The following article is good. It
teaches you about partial functions in a very nice way and explains why
case has to be used for partial functions: <A href="http://blog.bruchez.name/2011/10/scala-partial-functions-without-phd.html">article</A>)<BR><BR><SPAN
style="font-weight: bold;">Examples for mixed data without partial
functions</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(1 to 8)<BR>b.filter(_ < 4).collect<BR>res15:
Array[Int] = Array(1, 2, 3)<BR><BR>val a =
sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))<BR>
a.filter(_ < 4).collect<BR><console>:15: error: value <
is not a member of Any</TD></TR></TBODY></TABLE></DIV><BR>This fails
because some components of <SPAN style="font-style: italic;">a
</SPAN>are not implicitly comparable against integers. Collect uses the
<SPAN style="font-style: italic;">isDefinedAt </SPAN>property of a
function-object to determine whether the test-function is compatible with
each data item. Only data items that pass this test <SPAN style="font-style: italic;">(=filter)
</SPAN>are then mapped using the function-object.<BR><BR><SPAN style="font-weight: bold;">Examples
for mixed data with partial functions</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))<BR>
a.collect({case a: Int => "is integer" |<BR>
case b:
String => "is string" }).collect<BR>res17: Array[String] =
Array(is string, is string, is integer, is string)<BR><BR>val
myfunc: PartialFunction[Any, Any] = {<BR> case a:
Int => "is integer" |<BR> case b: String
=> "is string" }<BR>myfunc.isDefinedAt("")<BR>res21: Boolean =
true<BR><BR>myfunc.isDefinedAt(1)<BR>res22: Boolean = true<BR><BR>
myfunc.isDefinedAt(1.5)<BR>res23: Boolean =
false</TD></TR></TBODY></TABLE></DIV><BR><BR>Be careful! The above code
works because it only checks the type itself! If you use operations on
this type, you have to explicitly declare what type you want instead of
any. Otherwise the compiler does (apparently) not know what bytecode it
should produce:<BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) =>
"x"}<BR><console>:10: error: value < is not a member of
Any<BR><BR>val myfunc2: PartialFunction[Int, Any] = {case x if (x
< 4) => "x"}<BR>myfunc2: PartialFunction[Int,Any] =
<function1></TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filterByRange"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filterByRange
</SPAN></BIG></BIG>[Ordered]<BR> <BR>Returns an RDD containing only
the items in the key range specified. From our testing, it appears this
only works if your data is in key value pairs and it has already been
sorted by key.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filterByRange(lower: K, upper: K):
RDD[P]<BR></DIV><SPAN style="font-weight: bold;"><BR>
Example</SPAN><BR><BR><BR>
<TABLE style="width: 643px; height: 33px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val randRDD = sc.parallelize(List(
(2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1,
"screen"), (5, "heater")), 3)<BR>val sortedRDD =
randRDD.sortByKey()<BR><BR>sortedRDD.filterByRange(1, 3).collect<BR>
res66: Array[(Int, String)] = Array((1,screen), (2,cat),
(3,book))<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filterWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filterWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(deprecated)</SPAN></BIG><BR>
<BR>This is an extended version of <SPAN
style="font-style: italic;">filter</SPAN>. It takes two function
arguments. The first argument must conform to <SPAN style="font-style: italic;">Int
-> T</SPAN> and is executed once per partition. It will transform the
partition index to type <SPAN style="font-style: italic;">T</SPAN>. The
second function looks like<SPAN style="font-style: italic;"> (U, T) ->
Boolean</SPAN>. <SPAN style="font-style: italic;">T</SPAN> is the
transformed partition index and <SPAN style="font-style: italic;">U</SPAN>
are the data items from the RDD. Finally the function has to return either
true or false <SPAN style="font-style: italic;">(i.e. Apply the
filter)</SPAN>.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filterWith[A: ClassTag](constructA:
Int => A)(p: (T, A) => Boolean): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = a.filterWith(i =>
i)((x,i) => x % 2 == 0 || i % 2 == 0)<BR>b.collect<BR>res37:
Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)<BR><BR>val a =
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)<BR>a.filterWith(x=>
x)((a, b) => b == 0).collect<BR>res30: Array[Int] =
Array(1, 2)<BR><BR>a.filterWith(x=> x)((a, b) => a %
(b+1) == 0).collect<BR>res33: Array[Int] = Array(1, 2, 4, 6, 8,
10)<BR><BR>a.filterWith(x=> x.toString)((a, b) => b ==
"2").collect<BR>res34: Array[Int] = Array(5,
6)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMap"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMap</SPAN></BIG></BIG><BR>
<BR>Similar to <SPAN style="font-style: italic;">map</SPAN>, but
allows emitting more than one item in the map function.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMap[U: ClassTag](f: T =>
TraversableOnce[U]): RDD[U]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10, 5)<BR>a.flatMap(1 to _).collect<BR>
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4,
5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1,
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)<BR><BR>
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x,
x)).collect<BR>res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3,
3)<BR><BR>// The program below generates a random number of copies
(up to 10) of the items in the list.<BR>val x =
sc.parallelize(1 to 10, 3)<BR>
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect<BR><BR>
res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 10, 10, 10, 10, 10, 10, 10,
10)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMapValues"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMapValues</SPAN></BIG></BIG><BR>
<BR>Very similar to <SPAN
style="font-style: italic;">mapValues</SPAN>, but collapses the inherent
structure of the values during mapping.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMapValues[U](f: V =>
TraversableOnce[U]): RDD[(K, U)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.flatMapValues("x" + _ + "x").collect<BR>res6: Array[(Int, Char)] =
Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g),
(5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x),
(3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n),
(7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g),
(5,l), (5,e), (5,x))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMapWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMapWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(deprecated)</SPAN></BIG><BR>
<BR>Similar to <SPAN style="font-style: italic;">flatMap</SPAN>, but
allows accessing the partition index or a derivative of the partition
index from within the flatMap-function.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMapWith[A: ClassTag, U:
ClassTag](constructA: Int => A, preservesPartitioning: Boolean =
false)(f: (T, A) => Seq[U]): RDD[U]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)<BR>a.flatMapWith(x
=> x, true)((x, y) => List(y, x)).collect<BR>res58: Array[Int]
= Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2,
9)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="fold"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">fold</SPAN></BIG></BIG><BR> <BR>
Aggregates the values of each partition. The aggregation variable within
each partition is initialized with <SPAN
style="font-style: italic;">zeroValue</SPAN>.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def fold(zeroValue: T)(op: (T, T) =>
T): T<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1,2,3), 3)<BR>a.fold(0)(_ + _)<BR>res59:
Int = 6</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foldByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foldByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Very similar to
<SPAN style="font-style: italic;">fold</SPAN>, but performs the folding
separately for each key of the RDD. This function is only available if the
RDD consists of two-component tuples.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foldByKey(zeroValue: V)(func: (V, V)
=> V): RDD[(K, V)]<BR>def foldByKey(zeroValue: V, numPartitions:
Int)(func: (V, V) => V): RDD[(K, V)]<BR>def foldByKey(zeroValue: V,
partitioner: Partitioner)(func: (V, V) => V): RDD[(K,
V)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = a.map(x => (x.length, x))<BR>b.foldByKey("")(_ +
_).collect<BR>res84: Array[(Int, String)] =
Array((3,dogcatowlgnuant)<BR><BR>val a = sc.parallelize(List("dog",
"tiger", "lion", "cat", "panther", "eagle"), 2)<BR>val b = a.map(x
=> (x.length, x))<BR>b.foldByKey("")(_ + _).collect<BR>res85:
Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),
(5,tigereagle))</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreach"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreach</SPAN></BIG></BIG><BR>
<BR>Executes an parameterless function for each data
item.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreach(f: T =>
Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu",
"crocodile", "ant", "whale", "dolphin", "spider"), 3)<BR>
c.foreach(x => println(x + "s are yummy"))<BR>lions are yummy<BR>
gnus are yummy<BR>crocodiles are yummy<BR>ants are yummy<BR>whales
are yummy<BR>dolphins are yummy<BR>spiders are
yummy</TD></TR></TBODY></TABLE><BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreachPartition"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreachPartition</SPAN></BIG></BIG><BR>
<BR>Executes an parameterless function for each partition. Access
to the data items contained in the partition is provided via the iterator
argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreachPartition(f: Iterator[T] =>
Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)<BR>
b.foreachPartition(x => println(x.reduce(_ + _)))<BR>6<BR>15<BR>
24</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreachWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreachWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(Deprecated)</SPAN></BIG><BR>
<BR>Executes an parameterless function for each partition. Access to the
data items contained in the partition is provided via the iterator
argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreachWith[A: ClassTag](constructA:
Int => A)(f: (T, A) => Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.foreachWith(i => i)((x,i)
=> if (x % 2 == 1 && i % 2 == 0) println(x) )<BR>1<BR>
3<BR>7<BR>9</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="fullOuterJoin"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">fullOuterJoin</SPAN></BIG></BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG> [Pair]<BR> <BR>Performs the
full outer join between two paired RDDs.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def fullOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Option[V], Option[W]))]<BR>def
fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]<BR>
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner):
RDD[(K, (Option[V], Option[W]))]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 637px; height: 26px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val pairRDD1 = sc.parallelize(List(
("cat",2), ("cat", 5), ("book", 4),("cat", 12)))<BR>val pairRDD2 =
sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat",
12)))<BR>pairRDD1.fullOuterJoin(pairRDD2).collect<BR><BR>res5:
Array[(String, (Option[Int], Option[Int]))] =
Array((book,(Some(4),None)), (mouse,(None,Some(4))),
(cup,(None,Some(5))), (cat,(Some(2),Some(2))),
(cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))),
(cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))),
(cat,(Some(12),Some(12))))<BR></TD></TR></TBODY></TABLE><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="generator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">generator,
setGenerator</SPAN></BIG></BIG><BR> <BR>Allows setting a string that
is attached to the end of the RDD's name when printing the dependency
graph.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">@transient var generator<BR>def
setGenerator(_generator: String)<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="getCheckpointFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">getCheckpointFile</SPAN></BIG></BIG><BR>
<BR>Returns the path to the checkpoint file or null if RDD has not
yet been checkpointed.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def getCheckpointFile:
Option[String]</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("/home/cloudera/Documents")<BR>
val a = sc.parallelize(1 to 500, 5)<BR>val b = a++a++a++a++a<BR>
b.getCheckpointFile<BR>res49: Option[String] = None<BR><BR>
b.checkpoint<BR>b.getCheckpointFile<BR>res54: Option[String] =
None<BR><BR>b.collect<BR>b.getCheckpointFile<BR>res57:
Option[String] =
Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="preferredLocations"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">preferredLocations</SPAN></BIG></BIG><BR>
<BR>Returns the hosts which are preferred by this RDD. The actual
preference of a specific host depends on various
assumptions.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def preferredLocations(split:
Partition): Seq[String]</DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="getStorageLevel"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">getStorageLevel</SPAN></BIG></BIG><BR>
<BR>Retrieves the currently set storage level of the RDD. This can
only be used to assign a new storage level if the RDD does not have a
storage level set yet. The example below shows the error you will get,
when you try to reassign the storage level.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def getStorageLevel</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100000, 2)<BR>
a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)<BR>
a.getStorageLevel.description<BR>String = Disk Serialized 1x
Replicated<BR><BR>a.cache<BR>
java.lang.UnsupportedOperationException: Cannot change storage level
of an RDD after it was already assigned a
level</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="glom"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">glom</SPAN></BIG></BIG><BR>
<BR>Assembles an array that contains all elements of the partition
and embeds it in an RDD. Each returned array contains the contents of one
partition.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def glom(): RDD[Array[T]]</DIV><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>a.glom.collect<BR>res8:
Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100))</TD></TR></TBODY></TABLE></DIV><SPAN
style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="groupBy"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">groupBy</SPAN></BIG></BIG><BR>
<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def groupBy[K: ClassTag](f: T => K):
RDD[(K, Iterable[T])]<BR>def groupBy[K: ClassTag](f: T => K,
numPartitions: Int): RDD[(K, Iterable[T])]<BR>def groupBy[K: ClassTag](f:
T => K, p: Partitioner): RDD[(K, Iterable[T])]</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.groupBy(x => { if (x % 2 ==
0) "even" else "odd" }).collect<BR>res42: Array[(String, Seq[Int])]
= Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7,
9)))<BR><BR>val a = sc.parallelize(1 to 9, 3)<BR>def myfunc(a: Int)
: Int =<BR>{<BR> a % 2<BR>}<BR>a.groupBy(myfunc).collect<BR>
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)),
(1,ArrayBuffer(1, 3, 5, 7, 9)))<BR><BR>val a = sc.parallelize(1 to
9, 3)<BR>def myfunc(a: Int) : Int =<BR>{<BR> a % 2<BR>}<BR>
a.groupBy(x => myfunc(x), 3).collect<BR>a.groupBy(myfunc(_),
1).collect<BR>res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2,
4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))<BR><BR>import
org.apache.spark.Partitioner<BR>class MyPartitioner extends
Partitioner {<BR>def numPartitions: Int = 2<BR>def getPartition(key:
Any): Int =<BR>{<BR> key match<BR>
{<BR> case
null => 0<BR>
case key: Int =>
key %
numPartitions<BR> case
_ => key.hashCode %
numPartitions<BR> }<BR> }<BR>
override def equals(other: Any): Boolean =<BR> {<BR>
other match<BR> {<BR>
case h: MyPartitioner => true<BR>
case
_
=> false<BR> }<BR> }<BR>}<BR>val a =
sc.parallelize(1 to 9, 3)<BR>val p = new MyPartitioner()<BR>val b =
a.groupBy((x:Int) => { x }, p)<BR>val c = b.mapWith(i =>
i)((a, b) => (b, a))<BR>c.collect<BR>res42: Array[(Int, (Int,
Seq[Int]))] = Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))),
(0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))),
(1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))),
(1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))),
(1,(5,ArrayBuffer(5))))<BR></TD></TR></TBODY></TABLE></DIV><SPAN style="font-weight: bold;"><BR><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="groupByKey"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">groupByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Very similar to
<SPAN style="font-style: italic;">groupBy</SPAN>, but instead of supplying
a function, the key-component of each pair will automatically be presented
to the partitioner.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def groupByKey(): RDD[(K,
Iterable[V])]<BR>def groupByKey(numPartitions: Int): RDD[(K,
Iterable[V])]<BR>def groupByKey(partitioner: Partitioner): RDD[(K,
Iterable[V])]</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider",
"eagle"), 2)<BR>val b = a.keyBy(_.length)<BR>
b.groupByKey.collect<BR>res11: Array[(Int, Seq[String])] =
Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger,
eagle)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="histogram"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">histogram
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR> <BR>These functions
take an RDD of doubles and create a histogram with either even spacing
(the number of buckets equals to <SPAN
style="font-style: italic;">bucketCount</SPAN>) or arbitrary spacing based
on custom bucket boundaries supplied by the user via an array of
double values. The result type of both variants is slightly different, the
first function will return a tuple consisting of two arrays. The first
array contains the computed bucket boundary values and the second array
contains the corresponding count of values <SPAN style="font-style: italic;">(i.e.
the histogram)</SPAN>. The second variant of the function will just return
the histogram as an array of integers.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def histogram(bucketCount: Int):
Pair[Array[Double], Array[Long]]<BR>def histogram(buckets: Array[Double],
evenBuckets: Boolean = false): Array[Long]</DIV><BR><SPAN style="font-weight: bold;">Example
with even spacing</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6,
8.8, 9.0), 3)<BR>a.histogram(5)<BR>res11: (Array[Double],
Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0,
0, 1, 4))<BR><BR>val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1,
1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>
a.histogram(6)<BR>res18: (Array[Double], Array[Long]) = (Array(1.0,
2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3,
4))</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN style="font-weight: bold;">Example
with custom spacing</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6,
8.8, 9.0), 3)<BR>a.histogram(Array(0.0, 3.0, 8.0))<BR>res14:
Array[Long] = Array(5, 3)<BR><BR>val a = sc.parallelize(List(9.1,
1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)<BR>a.histogram(Array(0.0, 5.0, 10.0))<BR>res1: Array[Long]
= Array(6, 9)<BR><BR>a.histogram(Array(0.0, 5.0, 10.0, 15.0))<BR>
res1: Array[Long] = Array(6, 8, 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="id"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">id</SPAN></BIG></BIG><BR><BR>Retrieves the ID
which has been assigned to the RDD by its device context.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">val id: Int</DIV><BR><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>y.id<BR>res16: Int =
19</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="intersection"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">intersection</SPAN></BIG></BIG><BR><BR>
Returns the elements in the two RDDs which are the same.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def intersection(other: RDD[T],
numPartitions: Int): RDD[T]<BR>def intersection(other: RDD[T],
partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]<BR>def
intersection(other: RDD[T]): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<TABLE style="width: 611px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val x = sc.parallelize(1 to 20)<BR>
val y = sc.parallelize(10 to 30)<BR>val z =
x.intersection(y)<BR><BR>z.collect<BR>res74: Array[Int] = Array(16,
12, 20, 13, 17, 14, 18, 10, 19, 15,
11)<BR></TD></TR></TBODY></TABLE><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="isCheckpointed"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">isCheckpointed</SPAN></BIG></BIG><BR><BR>
Indicates whether the RDD has been checkpointed. The flag will only raise
once the checkpoint has really been created.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def isCheckpointed: Boolean</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("/home/cloudera/Documents")<BR>
c.isCheckpointed<BR>res6: Boolean = false<BR><BR>c.checkpoint<BR>
c.isCheckpointed<BR>res8: Boolean = false<BR><BR>c.collect<BR>
c.isCheckpointed<BR>res9: Boolean =
true</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="iterator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">iterator</SPAN></BIG></BIG><BR><BR>
Returns a compatible iterator object for a partition of this RDD. This
function should never be called directly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def iterator(split: Partition,
context: TaskContext): Iterator[T]<BR></DIV><SPAN style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="join"></A><BR><BR></SPAN><BIG><BIG><SPAN
style="font-weight: bold;">join<SMALL>
[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an inner join using two
key-value RDDs. Please note that the keys must be generally comparable to
make this work.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def join[W](other: RDD[(K, W)]): RDD[(K,
(V, W))]<BR>def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,
(V, W))]<BR>def join[W](other: RDD[(K, W)], partitioner: Partitioner):
RDD[(K, (V, W))]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 639px; height: 159px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>b.join(d).collect<BR><BR>res0:
Array[(Int, (String, String))] = Array((6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)),
(3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)),
(3,(rat,cat)), (3,(rat,gnu)),
(3,(rat,bee)))</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="keyBy"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">keyBy</SPAN></BIG></BIG><BR><BR>Constructs
two-component tuples (key-value pairs) by applying a function on each data
item. The result of the function becomes the key and the original data
item becomes the value of the newly created tuples.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def keyBy[K](f: T => K): RDD[(K,
T)]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>b.collect<BR>res26:
Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon),
(3,rat), (8,elephant))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="keys"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">keys
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><SPAN
style="font-weight: bold;"><BR><BR></SPAN>Extracts the keys from all
contained tuples and returns them in a new RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def keys: RDD[K]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.keys.collect<BR>res2: Array[Int] = Array(3, 5, 4, 3, 7,
5)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="leftOuterJoin"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">leftOuterJoin
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an left outer
join using two key-value RDDs. Please note that the keys must be generally
comparable to make this work correctly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def leftOuterJoin[W](other: RDD[(K, W)]):
RDD[(K, (V, Option[W]))]<BR>def leftOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (V, Option[W]))]<BR>def
leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(V, Option[W]))]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>
b.leftOuterJoin(d).collect<BR><BR>res1: Array[(Int, (String,
Option[String]))] = Array((6,(salmon,Some(salmon))),
(6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))),
(6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))),
(6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))),
(3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))),
(3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))),
(8,(elephant,None)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="lookup"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">lookup</SPAN></BIG></BIG><BR><BR>
Scans the RDD for all keys that match the provided value and returns their
values as a Scala sequence.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def lookup(key: K): Seq[V]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.lookup(5)<BR>res0: Seq[String] = WrappedArray(tiger,
eagle)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="map"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">map</SPAN></BIG></BIG><BR><BR>Applies a
transformation function on each item of the RDD and returns the result as
a new RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def map[U: ClassTag](f: T => U):
RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.map(_.length)<BR>val c = a.zip(b)<BR>
c.collect<BR>res0: Array[(String, Int)] = Array((dog,3), (salmon,6),
(salmon,6), (rat,3), (elephant,8))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitions"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitions</SPAN></BIG></BIG><BR><BR>
This is a specialized map that is called only once for each partition. The
entire content of the respective partitions is available as a sequential
stream of values via the input argument (<SPAN
style="font-style: italic;">Iterarator[T]</SPAN>). The custom function
must return yet another <SPAN
style="font-style: italic;">Iterator[U]</SPAN>. The combined result
iterators are automatically converted into a new RDD. Please note, that
the tuples (3,4) and (6,7) are missing from the following result due to
the partitioning we chose.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def mapPartitions[U: ClassTag](f:
Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false):
RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example
1</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>def myfunc[T](iter: Iterator[T]) :
Iterator[(T, T)] = {<BR> var res = List[(T, T)]()<BR>
var pre = iter.next<BR> while (iter.hasNext)<BR> {<BR>
val cur = iter.next;<BR> res
.::= (pre, cur)<BR> pre = cur;<BR> }<BR>
res.iterator<BR>}<BR>a.mapPartitions(myfunc).collect<BR>res0:
Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9),
(7,8))</TD></TR></TBODY></TABLE></DIV><BR><SPAN style="font-weight: bold;">Example
2</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)<BR>def
myfunc(iter: Iterator[Int]) : Iterator[Int] = {<BR> var res =
List[Int]()<BR> while (iter.hasNext) {<BR>
val cur = iter.next;<BR> res = res :::
List.fill(scala.util.Random.nextInt(10))(cur)<BR> }<BR>
res.iterator<BR>}<BR>x.mapPartitions(myfunc).collect<BR>// some of
the number are not outputted at all. This is because the random
number generated for it is zero.<BR>res8: Array[Int] = Array(1, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7,
9, 9, 10)</TD></TR></TBODY></TABLE></DIV><BR>The above program can also be
written using <SPAN style="font-weight: bold;">flatMap</SPAN> as
follows.<BR><BR><SPAN style="font-weight: bold;">Example 2 using
flatmap<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 10, 3)<BR>
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect<BR><BR>
res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 10, 10, 10, 10, 10, 10, 10,
10)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithContext"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithContext</SPAN></BIG></BIG> <SPAN
style="font-weight: bold;"> (deprecated and developer API)</SPAN><BR><BR>
Similar to <SPAN style="font-style: italic;">mapPartitions</SPAN>, but
allows accessing information about the processing state within the
mapper.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithContext[U:
ClassTag](f: (TaskContext, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>import
org.apache.spark.TaskContext<BR>def myfunc(tc: TaskContext, iter:
Iterator[Int]) : Iterator[Int] = {<BR>
tc.addOnCompleteCallback(() => println(<BR>
"Partition: " + tc.partitionId +<BR>
", AttemptID: " + tc.attemptId ))<BR>
<BR> iter.toList.filter(_ % 2 == 0).iterator<BR>}<BR>
a.mapPartitionsWithContext(myfunc).collect<BR><BR>14/04/01 23:05:48
INFO SparkContext: Starting job: collect at< console>:20<BR>
...<BR>14/04/01 23:05:48 INFO Executor: Running task ID 0<BR>
Partition: 0, AttemptID: 0, Interrupted: false<BR>...<BR>14/04/01
23:05:48 INFO Executor: Running task ID 1<BR>14/04/01 23:05:48 INFO
TaskSetManager: Finished TID 0 in 470 ms on localhost (progress:
0/3)<BR>...<BR>14/04/01 23:05:48 INFO Executor: Running task ID
2<BR>14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms
on localhost (progress: 1/3)<BR>14/04/01 23:05:48 INFO DAGScheduler:
Completed ResultTask(0, 1)<BR><BR>?<BR>res0: Array[Int] = Array(2,
6, 4, 8)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithIndex"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithIndex</SPAN></BIG></BIG><BR><BR>
Similar to <SPAN style="font-style: italic;">mapPartitions</SPAN>, but
takes two parameters. The first parameter is the index of the partition
and the second is an iterator through all the items within this
partition. The output is an iterator containing the list of items after
applying whatever transformation the function encodes.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithIndex[U: ClassTag](f:
(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean =
false): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)<BR>def
myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {<BR>
iter.toList.map(x => index + "," + x).iterator<BR>}<BR>
x.mapPartitionsWithIndex(myfunc).collect()<BR>res10: Array[String] =
Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9,
2,10)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithSplit"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithSplit</SPAN></BIG></BIG><BR><BR>
This method has been marked as deprecated in the API. So, you should not
use this method anymore. Deprecated methods will not be covered in this
document.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithSplit[U: ClassTag](f:
(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean =
false): RDD[U]<SPAN style="font-weight: bold;"><BR><BR><BR></SPAN>
</DIV><SPAN style="font-weight: bold;"></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="mapValues"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">mapValues
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Takes the values of a RDD
that consists of two-component tuples, and applies the provided function
to transform each value. Then, it forms new two-component tuples using the
key and the transformed value and stores them in a new
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mapValues[U](f: V => U): RDD[(K,
U)]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.mapValues("x" + _ + "x").collect<BR>res5: Array[(Int, String)] =
Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx),
(5,xeaglex))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapWith</SPAN></BIG></BIG>
<SPAN style="font-weight: bold;">(deprecated)</SPAN><BR><BR>This is
an extended version of <SPAN style="font-style: italic;">map</SPAN>. It
takes two function arguments. The first argument must conform to <SPAN
style="font-style: italic;">Int -> T</SPAN> and is executed once per
partition. It will map the partition index to some transformed partition
index of type <SPAN style="font-style: italic;">T</SPAN>. This is where it
is nice to do some kind of initialization code once per partition. Like
create a Random number generator object. The second function must conform
to <SPAN style="font-style: italic;">(U, T) -> U</SPAN>. <SPAN style="font-style: italic;">T</SPAN>
is the transformed partition index and <SPAN
style="font-style: italic;">U</SPAN> is a data item of the RDD. Finally
the function has to return a transformed data item of type <SPAN style="font-style: italic;">U</SPAN>.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mapWith[A: ClassTag, U:
ClassTag](constructA: Int => A, preservesPartitioning: Boolean =
false)(f: (T, A) => U): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">//
generates 9 random numbers less than 1000. <BR>val x =
sc.parallelize(1 to 9, 3)<BR>x.mapWith(a => new
scala.util.Random)((x, r) => r.nextInt(1000)).collect<BR>res0:
Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15)<BR><BR>
val a = sc.parallelize(1 to 9, 3)<BR>val b = a.mapWith("Index:" +
_)((a, b) => ("Value:" + a, b))<BR>b.collect<BR>res0:
Array[(String, String)] = Array((Value:1,Index:0),
(Value:2,Index:0), (Value:3,Index:0), (Value:4,Index:1),
(Value:5,Index:1), (Value:6,Index:1), (Value:7,Index:2),
(Value:8,Index:2),
(Value:9,Index:2)<BR><BR><BR></TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="max"></A><BR><BIG><BIG><SPAN
style="font-weight: bold;">max</SPAN></BIG></BIG><SPAN style="font-weight: bold;"></SPAN><BR><BR>
Returns the largest element in the RDD<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def max()(implicit ord: Ordering[T]):
T<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<TABLE style="width: 630px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(10 to
30)<BR>y.max<BR>res75: Int = 30<BR><BR>val a =
sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18,
"cat")))<BR>a.max<BR>res6: (Int, String) =
(18,cat)<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mean"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">mean <SMALL>[Double]</SMALL>, meanApprox
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls <SPAN style="font-style: italic;">stats</SPAN>
and extracts the mean component. The approximate version of the function
can finish somewhat faster in some scenarios. However, it trades accuracy
for speed.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mean(): Double<BR>def
meanApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,
7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>a.mean<BR>res0: Double =
5.3</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="min"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">min</SPAN></BIG></BIG><BR><BR>Returns the
smallest element in the RDD<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def min()(implicit ord: Ordering[T]):
T<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 637px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(10 to
30)<BR>y.min<BR>res75: Int = 10<BR><BR><BR>val a =
sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8,
"cat")))<BR>a.min<BR>res4: (Int, String) =
(3,tiger)<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="name"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">name, setName</SPAN></BIG></BIG><BR><BR>Allows
a RDD to be tagged with a custom name.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">@transient var name: String<BR>def
setName(_name: String)<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>y.name<BR>res13: String =
null<BR>y.setName("Fancy RDD Name")<BR>y.name<BR>res15: String =
Fancy RDD Name</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="partitionBy"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">partitionBy
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Repartitions as key-value
RDD using its keys. The partitioner implementation can be supplied as the
first argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def partitionBy(partitioner: Partitioner):
RDD[(K, V)]<BR></DIV><SPAN style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="partitioner"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">partitioner
</SPAN></BIG></BIG><BR><BR>Specifies a function pointer to the default
partitioner that will be used for <SPAN
style="font-style: italic;">groupBy</SPAN>, <SPAN style="font-style: italic;">subtract</SPAN>,
<SPAN style="font-style: italic;">reduceByKey</SPAN> (from <SPAN style="font-style: italic;">PairedRDDFunctions</SPAN>),
etc. functions.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">@transient val partitioner:
Option[Partitioner]</DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="partitions"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">partitions
</SPAN></BIG></BIG><BR><BR>Returns an array of the partition objects
associated with this RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">final def partitions:
Array[Partition]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>b.partitions<BR>res48: Array[org.apache.spark.Partition] =
Array(org.apache.spark.rdd.ParallelCollectionPartition@18aa,
org.apache.spark.rdd.ParallelCollectionPartition@18ab)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="persist"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">persist,
cache </SPAN></BIG></BIG><BR><BR>These functions can be used to adjust the
storage level of a RDD. When freeing up memory, Spark will use the storage
level identifier to decide which partitions should be kept. The
parameterless variants <SPAN style="font-style: italic;">persist()</SPAN>
and <SPAN style="font-style: italic;">cache()</SPAN> are just
abbreviations for <SPAN
style="font-style: italic;">persist(StorageLevel.MEMORY_ONLY)</SPAN>.
<SPAN style="font-style: italic;">(Warning: Once the storage level has
been changed, it cannot be changed again!)</SPAN><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def cache(): RDD[T]<BR>def persist():
RDD[T]<BR>def persist(newLevel: StorageLevel): RDD[T]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.getStorageLevel<BR>res0:
org.apache.spark.storage.StorageLevel = StorageLevel(false, false,
false, false, 1)<BR>c.cache<BR>c.getStorageLevel<BR>res2:
org.apache.spark.storage.StorageLevel = StorageLevel(false, true,
false, true, 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="pipe"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">pipe </SPAN></BIG></BIG><BR><BR>Takes the RDD
data of each partition and sends it via stdin to a shell-command. The
resulting output of the command is captured and returned as a RDD of
string values.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def pipe(command: String): RDD[String]<BR>
def pipe(command: String, env: Map[String, String]): RDD[String]<BR>def
pipe(command: Seq[String], env: Map[String, String] = Map(),
printPipeContext: (String => Unit) => Unit = null, printRDDElement:
(T, String => Unit) => Unit = null): RDD[String]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.pipe("head -n 1").collect<BR>
res2: Array[String] = Array(1, 4,
7)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><A name="randomSplit"></A><BR>
randomSplit </SPAN></BIG></BIG><BR><BR>Randomly splits an RDD into
multiple smaller RDDs according to a weights Array which specifies the
percentage of the total data elements that is assigned to each smaller
RDD. Note the actual size of each smaller RDD is only approximately equal
to the percentages specified by the weights Array. The second example
below shows the number of items in each smaller RDD does not exactly match
the weights Array. A random optional seed can be specified. This
function is useful for spliting data into a training set and a testing set
for machine learning.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def randomSplit(weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BIG><BIG><SPAN style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<TABLE style="width: 602px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(1 to 10)<BR>
val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L)<BR>val
training = splits(0)<BR>val test = splits(1)<BR>training.collect<BR>
res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10)<BR>test.collect<BR>
res86: Array[Int] = Array(2, 3, 7, 9)<BR><BR>val y =
sc.parallelize(1 to 10)<BR>val splits = y.randomSplit(Array(0.1,
0.3, 0.6))<BR><BR>val rdd1 = splits(0)<BR>val rdd2 = splits(1)<BR>
val rdd3 = splits(2)<BR><BR>rdd1.collect<BR>res87: Array[Int] =
Array(4, 10)<BR>rdd2.collect<BR>res88: Array[Int] = Array(1, 3, 5,
8)<BR>rdd3.collect<BR>res91: Array[Int] = Array(2, 6, 7,
9)<BR></TD></TR></TBODY></TABLE><BIG><BIG><SPAN
style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A name="reduce"></A><BR>
reduce </SPAN></BIG></BIG><BR><BR>This function provides the well-known
<SPAN style="font-style: italic;">reduce</SPAN> functionality in Spark.
Please note that any function <SPAN style="font-style: italic;">f</SPAN>
you provide, should be commutative in order to generate reproducible
results.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def reduce(f: (T, T) => T):
T</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>a.reduce(_ + _)<BR>res41: Int =
5050</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="reduceByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">reduceByKey
<SMALL>[Pair]</SMALL>, reduceByKeyLocally <SMALL>[Pair],</SMALL>
reduceByKeyToDriver <SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>This
function provides the well-known <SPAN
style="font-style: italic;">reduce</SPAN> functionality in Spark. Please
note that any function <SPAN style="font-style: italic;">f</SPAN> you
provide, should be commutative in order to generate reproducible
results.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def reduceByKey(func: (V, V) => V):
RDD[(K, V)]<BR>def reduceByKey(func: (V, V) => V, numPartitions: Int):
RDD[(K, V)]<BR>def reduceByKey(partitioner: Partitioner, func: (V, V)
=> V): RDD[(K, V)]<BR>def reduceByKeyLocally(func: (V, V) => V):
Map[K, V]<BR>def reduceByKeyToDriver(func: (V, V) => V): Map[K,
V]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = a.map(x => (x.length, x))<BR>b.reduceByKey(_ +
_).collect<BR>res86: Array[(Int, String)] =
Array((3,dogcatowlgnuant))<BR><BR>val a = sc.parallelize(List("dog",
"tiger", "lion", "cat", "panther", "eagle"), 2)<BR>val b = a.map(x
=> (x.length, x))<BR>b.reduceByKey(_ + _).collect<BR>res87:
Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),
(5,tigereagle))</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="repartition"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">repartition</SPAN></BIG></BIG><BR><BR>
This function changes the number of partitions to the number specified by
the numPartitions parameter <BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def repartition(numPartitions:
Int)(implicit ord: Ordering[T] = null): RDD[T]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 631px; height: 24px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val rdd = sc.parallelize(List(1, 2,
10, 4, 5, 2, 1, 1, 1), 3)<BR>rdd.partitions.length<BR>res2: Int =
3<BR>val rdd2 = rdd.repartition(5)<BR>
rdd2.partitions.length<BR>res6: Int =
5<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A
name="repartitionAndSortWithinPartitions"></A><BR><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">repartitionAndSortWithinPartitions</SPAN></BIG></BIG>
[Ordered]<BR><BR>Repartition the RDD according to the given partitioner
and, within each resulting partition, sort records by their
keys.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def
repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K,
V)]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 683px; height: 23px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">// first we will do range
partitioning which is not sorted<BR>val randRDD =
sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3,
"book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)<BR>val
rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD)<BR>
val partitioned = randRDD.partitionBy(rPartitioner)<BR>def
myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String]
= {<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR>
partitioned.mapPartitionsWithIndex(myfunc).collect<BR><BR>res0:
Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val:
(3,book)], [partID:0, val: (1,screen)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,
val: (7,cup)])<BR><BR><BR>// now lets repartition but this time
have it sorted<BR>val partitioned =
randRDD.repartitionAndSortWithinPartitions(rPartitioner)<BR>def
myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String]
= {<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR>
partitioned.mapPartitionsWithIndex(myfunc).collect<BR><BR>res1:
Array[String] = Array([partID:0, val: (1,screen)], [partID:0, val:
(2,cat)], [partID:0, val: (3,book)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,
val: (7,cup)])<BR></TD></TR></TBODY></TABLE><BR><BR><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="rightOuterJoin"></A><BR><BR><BR><SPAN style="font-weight: bold;"></SPAN><BIG><BIG><SPAN
style="font-weight: bold;">rightOuterJoin
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an right outer
join using two key-value RDDs. Please note that the keys must be generally
comparable to make this work correctly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def rightOuterJoin[W](other: RDD[(K, W)]):
RDD[(K, (Option[V], W))]<BR>def rightOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Option[V], W))]<BR>def
rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Option[V], W))]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>
b.rightOuterJoin(d).collect<BR><BR>res2: Array[(Int,
(Option[String], String))] = Array((6,(Some(salmon),salmon)),
(6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)),
(6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)),
(6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)),
(3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)),
(3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)),
(4,(None,wolf)), (4,(None,bear)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sample"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sample</SPAN></BIG></BIG><BR><BR>
Randomly selects a fraction of the items of a RDD and returns them in a
new RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sample(withReplacement: Boolean,
fraction: Double, seed: Int): RDD[T]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 3)<BR>a.sample(false, 0.1,
0).count<BR>res24: Long = 960<BR><BR>a.sample(true, 0.3,
0).count<BR>res25: Long = 2888<BR><BR>a.sample(true, 0.3,
13).count<BR>res26: Long = 2985</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sampleByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sampleByKey</SPAN></BIG></BIG>
[Pair]<BR><BR>Randomly samples the key value pair RDD according to the
fraction of each key you want to appear in the final RDD.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sampleByKey(withReplacement: Boolean,
fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K,
V)]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 640px; height: 24px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val randRDD = sc.parallelize(List(
(7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6,
"screen"), (7, "heater")))<BR>val sampleMap = List((7, 0.4), (6,
0.6)).toMap<BR>randRDD.sampleByKey(false,
sampleMap,42).collect<BR><BR>res6: Array[(Int, String)] =
Array((7,cat), (6,mouse), (6,book), (6,screen),
(7,heater))<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="sampleByKeyExact"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sampleByKeyExact</SPAN></BIG></BIG>
[Pair, experimental]<BR><BR>This is labelled as experimental and so we do
not document it.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sampleByKeyExact(withReplacement:
Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong):
RDD[(K, V)]<BR></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsHadoopFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsHadoopFile
<SMALL>[Pair]</SMALL>, saveAsHadoopDataset <SMALL>[Pair]</SMALL>,
saveAsNewAPIHadoopFile <SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>
Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat
class the user specifies.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def saveAsHadoopDataset(conf: JobConf)<BR>
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit
fm: ClassTag[F])<BR>def saveAsHadoopFile[F <: OutputFormat[K,
V]](path: String, codec: Class[_ <: CompressionCodec]) (implicit fm:
ClassTag[F])<BR>def saveAsHadoopFile(path: String, keyClass: Class[_],
valueClass: Class[_], outputFormatClass: Class[_ <: OutputFormat[_,
_]], codec: Class[_ <: CompressionCodec])<BR>def saveAsHadoopFile(path:
String, keyClass: Class[_], valueClass: Class[_], outputFormatClass:
Class[_ <: OutputFormat[_, _]], conf: JobConf = new
JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <:
CompressionCodec]] = None)<BR>def saveAsNewAPIHadoopFile[F <:
NewOutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])<BR>def
saveAsNewAPIHadoopFile(path: String, keyClass: Class[_], valueClass:
Class[_], outputFormatClass: Class[_ <: NewOutputFormat[_, _]], conf:
Configuration = self.context.hadoopConfiguration)</DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsObjectFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsObjectFile</SPAN></BIG></BIG><BR><BR>
Saves the RDD in binary format.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsObjectFile(path:
String)<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 100, 3)<BR>
x.saveAsObjectFile("objFile")<BR>val y =
sc.objectFile[Int]("objFile")<BR>y.collect<BR>res52: Array[Int]
= Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsSequenceFile"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsSequenceFile
<SMALL>[SeqFile]</SMALL></SPAN></BIG></BIG><BR><BR>Saves the RDD as a
Hadoop sequence file.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsSequenceFile(path: String,
codec: Option[Class[_ <: CompressionCodec]] = None)<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1),
("cat",2), ("ant",5)), 2)<BR>v.saveAsSequenceFile("hd_seq_file")<BR>
14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task
'attempt_201404190545_0000_m_000001_191' to
file:/home/cloudera/hd_seq_file<BR><BR>[cloudera@localhost ~]$ ll
~/hd_seq_file<BR>total 8<BR>-rwxr-xr-x 1 cloudera cloudera 117 Apr
19 05:45 part-00000<BR>-rwxr-xr-x 1 cloudera cloudera 133 Apr 19
05:45 part-00001<BR>-rwxr-xr-x 1 cloudera cloudera 0 Apr
19 05:45 _SUCCESS</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsTextFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsTextFile</SPAN></BIG></BIG><BR><BR>
Saves the RDD as text files. One line at a time.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsTextFile(path: String)<BR>def
saveAsTextFile(path: String, codec: Class[_ <:
CompressionCodec])<BR></DIV><BR><SPAN style="font-weight: bold;">Example
without compression</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 3)<BR>
a.saveAsTextFile("mydata_a")<BR>14/04/03 21:11:36 INFO
FileOutputCommitter: Saved output of task
'attempt_201404032111_0000_m_000002_71' to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a<BR><BR><BR>
[cloudera@localhost ~]$ head -n 5
~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000<BR>
1<BR>2<BR>3<BR>4<BR>5<BR><BR>// Produces 3 output files since we
have created the a RDD with 3 partitions<BR>[cloudera@localhost ~]$
ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/<BR>
-rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11
part-00000<BR>-rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11
part-00001<BR>-rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11
part-00002</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN style="font-weight: bold;">Example
with compression</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">import
org.apache.hadoop.io.compress.GzipCodec<BR>
a.saveAsTextFile("mydata_b", classOf[GzipCodec])<BR><BR>
[cloudera@localhost ~]$ ll
~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/<BR>total
24<BR>-rwxr-xr-x 1 cloudera cloudera 7276 Apr 3 21:29
part-00000.gz<BR>-rwxr-xr-x 1 cloudera cloudera 6517 Apr 3
21:29 part-00001.gz<BR>-rwxr-xr-x 1 cloudera cloudera 6525 Apr
3 21:29 part-00002.gz<BR><BR>val x = sc.textFile("mydata_b")<BR>
x.count<BR>res2: Long = 10000</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN
style="font-weight: bold;">Example writing into HDFS<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)<BR>
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");<BR><BR>
val sp =
sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")<BR>
sp.flatMap(_.split("
")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="stats"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">stats
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Simultaneously computes
the mean, variance and the standard deviation of all values in the
RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def stats():
StatCounter<BR></DIV><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29,
11.09, 21.0), 2)<BR>x.stats<BR>res16:
org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667,
stdev: 8.126859)</TD></TR></TBODY></TABLE></DIV><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG></BIG><BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="sortBy"></A><BR>sortBy</SPAN></BIG></BIG><BIG><BIG><SPAN style="font-weight: bold;"><BR></SPAN></BIG></BIG><BR>
This function sorts the input RDD's data and stores it in a new RDD. The
first parameter requires you to specify a function which maps the
input data into the key that you want to sortBy. The second parameter
(optional) specifies whether you want the data to be sorted in ascending
or descending order.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sortBy[K](f: (T) ⇒ K, ascending:
Boolean = true, numPartitions: Int = this.partitions.size)(implicit ord:
Ordering[K], ctag: ClassTag[K]): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<TABLE style="width: 686px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;"><BR>val y = sc.parallelize(Array(5,
7, 1, 3, 2, 1))<BR>y.sortBy(c => c, true).collect<BR>res101:
Array[Int] = Array(1, 1, 2, 3, 5, 7)<BR><BR>y.sortBy(c => c,
false).collect<BR>res102: Array[Int] = Array(7, 5, 3, 2, 1,
1)<BR><BR>val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z",
1), ("L", 5)))<BR>z.sortBy(c => c._1, true).collect<BR>res109:
Array[(String, Int)] = Array((A,26), (H,10), (L,5), (Z,1))<BR><BR>
z.sortBy(c => c._2, true).collect<BR>res108: Array[(String, Int)]
= Array((Z,1), (L,5), (H,10),
(A,26))<BR></TD></TR></TBODY></TABLE><BIG><BIG><SPAN style="font-weight: bold;"><BR><BR></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="sortByKey"></A><BR><BR>sortByKey
<SMALL>[Ordered]</SMALL></SPAN></BIG></BIG><BR><BR>This function sorts the
input RDD's data and stores it in a new RDD. The output RDD is a shuffled
RDD because it stores data that is output by a reducer which has been
shuffled. The implementation of this function is actually very clever.
First, it uses a range partitioner to partition the data in ranges within
the shuffled RDD. Then it sorts these ranges individually with
mapPartitions using standard sort mechanisms.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sortByKey(ascending: Boolean = true,
numPartitions: Int = self.partitions.size): RDD[P]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = sc.parallelize(1 to a.count.toInt, 2)<BR>val c =
a.zip(b)<BR>c.sortByKey(true).collect<BR>res74: Array[(String, Int)]
= Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))<BR>
c.sortByKey(false).collect<BR>res75: Array[(String, Int)] =
Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))<BR><BR>val a =
sc.parallelize(1 to 100, 5)<BR>val b = a.cartesian(a)<BR>val c =
sc.parallelize(b.takeSample(true, 5, 13), 2)<BR>val d =
c.sortByKey(false)<BR>res56: Array[(Int, Int)] = Array((96,9),
(84,76), (59,59), (53,65), (52,4))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="stdev"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">stdev <SMALL>[Double], sampleStdev
[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls <SPAN style="font-style: italic;">stats</SPAN>
and extracts either <SPAN
style="font-style: italic;">stdev</SPAN>-component or corrected <SPAN
style="font-style: italic;">sampleStdev</SPAN>-component.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def stdev(): Double<BR>def sampleStdev():
Double<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
d = sc.parallelize(List(0.0, 0.0, 0.0), 3)<BR>d.stdev<BR>res10:
Double = 0.0<BR>d.sampleStdev<BR>res11: Double = 0.0<BR><BR>val d =
sc.parallelize(List(0.0, 1.0), 3)<BR>d.stdev<BR>d.sampleStdev<BR>
res18: Double = 0.5<BR>res19: Double = 0.7071067811865476<BR><BR>val
d = sc.parallelize(List(0.0, 0.0, 1.0), 3)<BR>d.stdev<BR>res14:
Double = 0.4714045207910317<BR>d.sampleStdev<BR>res15: Double =
0.5773502691896257</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="subtract"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">subtract</SPAN></BIG></BIG><BR><BR>
Performs the well known standard set subtraction operation: A -
B<BR><BR><SPAN style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def subtract(other: RDD[T]): RDD[T]<BR>def
subtract(other: RDD[T], numPartitions: Int): RDD[T]<BR>def subtract(other:
RDD[T], p: Partitioner): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = sc.parallelize(1 to 3,
3)<BR>val c = a.subtract(b)<BR>c.collect<BR>res3: Array[Int] =
Array(6, 9, 4, 7, 5, 8)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="subtractByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">subtractByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Very similar to <SPAN
style="font-style: italic;">subtract</SPAN>, but instead of supplying a
function, the key-component of each pair will be automatically used as
criterion for removing items from the first RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def subtractByKey[W: ClassTag](other:
RDD[(K, W)]): RDD[(K, V)]<BR>def subtractByKey[W: ClassTag](other: RDD[(K,
W)], numPartitions: Int): RDD[(K, V)]<BR>def subtractByKey[W:
ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K,
V)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider",
"eagle"), 2)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("ant", "falcon", "squid"), 2)<BR>val d =
c.keyBy(_.length)<BR>b.subtractByKey(d).collect<BR>res15:
Array[(Int, String)] =
Array((4,lion))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sum"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">sum <SMALL>[Double], sumApprox
[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Computes the sum of all values
contained in the RDD. The approximate version of the function can finish
somewhat faster in some scenarios. However, it trades accuracy for
speed.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sum(): Double<BR>def
sumApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29,
11.09, 21.0), 2)<BR>x.sum<BR>res17: Double =
101.39999999999999</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="take"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">take</SPAN></BIG></BIG><BR><BR>Extracts the
first <SPAN style="font-style: italic;">n</SPAN> items of the RDD and
returns them as an array. <SPAN style="font-style: italic;">(Note: This
sounds very easy, but it is actually quite a tricky problem for the
implementors of Spark because the items in question can be in many
different partitions.)</SPAN><BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def take(num: Int):
Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"),
2)<BR>b.take(2)<BR>res18: Array[String] = Array(dog, cat)<BR><BR>val
b = sc.parallelize(1 to 10000, 5000)<BR>b.take(100)<BR>res6:
Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
100)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="takeOrdered"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">takeOrdered</SPAN></BIG></BIG><BR><BR>
Orders the data items of the RDD using their inherent implicit ordering
function and returns the first <SPAN style="font-style: italic;">n</SPAN>
items as an array.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def takeOrdered(num: Int)(implicit ord:
Ordering[T]): Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"),
2)<BR>b.takeOrdered(2)<BR>res19: Array[String] = Array(ape,
cat)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="takeSample"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">takeSample</SPAN></BIG></BIG><BR><BR>
Behaves different from <SPAN style="font-style: italic;">sample</SPAN> in
the following respects:<BR>
<UL>
<LI> It will return an exact number of samples <SPAN style="font-style: italic;">(Hint:
2nd parameter)</SPAN></LI>
<LI> It returns an Array instead of RDD.</LI>
<LI> It internally randomizes the order of the items
returned.</LI></UL><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def takeSample(withReplacement: Boolean,
num: Int, seed: Int): Array[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 1000, 3)<BR>x.takeSample(true, 100, 1)<BR>
res3: Array[Int] = Array(339, 718, 810, 105, 71, 268, 333, 360, 341,
300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937, 301,
167, 69, 330, 864, 40, 645, 65, 349, 613, 468, 982, 314, 160, 675,
232, 794, 577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482,
657, 114, 332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559,
301, 694, 460, 839, 952, 664, 851, 260, 729, 823, 880, 792, 964,
614, 821, 683, 364, 80, 875, 813, 951, 663, 344, 546, 918, 436, 451,
397, 670, 756, 512, 391, 70, 213, 896, 123,
858)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toDebugString"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toDebugString</SPAN></BIG></BIG><BR><BR>
Returns a string that contains debug information about the RDD and its
dependencies.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toDebugString:
String<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = sc.parallelize(1 to 3,
3)<BR>val c = a.subtract(b)<BR>c.toDebugString<BR>res6: String =
<BR>MappedRDD[15] at subtract at <console>:16 (3
partitions)<BR> SubtractedRDD[14] at subtract at
<console>:16 (3 partitions)<BR>
MappedRDD[12] at subtract at <console>:16 (3 partitions)<BR>
ParallelCollectionRDD[10] at
parallelize at <console>:12 (3 partitions)<BR>
MappedRDD[13] at subtract at <console>:16
(3 partitions)<BR>
ParallelCollectionRDD[11] at parallelize at <console>:12 (3
partitions)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toJavaRDD"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toJavaRDD</SPAN></BIG></BIG><BR><BR>
Embeds this RDD object within a JavaRDD object and returns
it.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toJavaRDD() :
JavaRDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.toJavaRDD<BR>res3: org.apache.spark.api.java.JavaRDD[String] =
ParallelCollectionRDD[6] at parallelize at
<console>:12</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toLocalIterator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toLocalIterator</SPAN></BIG></BIG><BR><BR>
Converts the RDD into a scala iterator at the master node.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toLocalIterator:
Iterator[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 627px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR>val iter =
z.toLocalIterator<BR><BR>iter.next<BR>res51: Int = 1<BR><BR>
iter.next<BR>res52: Int = 2<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="top"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">top</SPAN></BIG></BIG><BR><BR>Utilizes the
implicit ordering of $T$ to determine the top $k$ values and returns them
as an array.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">ddef top(num: Int)(implicit ord:
Ordering[T]): Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)<BR>c.top(2)<BR>
res28: Array[Int] = Array(9, 8)</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toString"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toString</SPAN></BIG></BIG><BR><BR>
Assembles a human-readable textual description of the
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">override def toString:
String<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List(1,2,3,4,5,6), 2)<BR>z.toString<BR>res61:
String = ParallelCollectionRDD[80] at parallelize at
<console>:21<BR><BR>val randRDD = sc.parallelize(List(
(7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6,
"screen"), (7, "heater")))<BR>val sortedRDD =
randRDD.sortByKey()<BR>sortedRDD.toString<BR>res64: String =
ShuffledRDD[88] at sortByKey at
<console>:23<BR><BR></TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="treeAggregate"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">treeAggregate</SPAN></BIG></BIG><BR><BR>
Computes the same thing as aggregate, except it aggregates the elements of
the RDD in a multi-level tree pattern. Another difference is that it does
not use the initial value for the second reduce function (combOp).
By default a tree of depth 2 is used, but this can be changed via the
depth parameter.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def treeAggregate[U](zeroValue: U)(seqOp:
(U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0:
ClassTag[U]): U<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR><BR>
<TABLE style="width: 713px; height: 376px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR><BR>// lets first print out
the contents of the RDD with partition labels<BR>def myfunc(index:
Int, iter: Iterator[(Int)]) : Iterator[String] = {<BR>
iter.toList.map(x => "[partID:" + index + ", val: " + x +
"]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res28: Array[String] =
Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])<BR><BR>
z.treeAggregate(0)(math.max(_, _), _ + _)<BR>res40: Int = 9<BR><BR>
// Note unlike normal aggregrate. Tree aggregate does not apply the
initial value for the second reduce<BR>// This example returns 11
since the initial value is 5<BR>// reduce of partition 0 will be
max(5, 1, 2, 3) = 5<BR>// reduce of partition 1 will be max(4, 5, 6)
= 6<BR>// final reduce across partitions will be 5 + 6 = 11<BR>//
note the final reduce does not include the initial value<BR>
z.treeAggregate(5)(math.max(_, _), _ + _)<BR>res42: Int =
11</TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="treeReduce"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">treeReduce</SPAN></BIG></BIG><BR><BR>
Works like reduce except reduces the elements of the RDD in a multi-level
tree pattern.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def treeReduce(f: (T, T) ⇒ T, depth:
Int = 2): T<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 753px; height: 43px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR>z.treeReduce(_+_)<BR>res49:
Int = 21<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="union"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">union, ++</SPAN></BIG></BIG><BR><BR>Performs
the standard set operation: A union B<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def ++(other: RDD[T]): RDD[T]<BR>def
union(other: RDD[T]): RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 3, 1)<BR>val b = sc.parallelize(5 to 7,
1)<BR>(a ++ b).collect<BR>res0: Array[Int] = Array(1, 2, 3, 5, 6,
7)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="unpersist"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">unpersist</SPAN></BIG></BIG><BR><BR>
Dematerializes the RDD <SPAN style="font-style: italic;">(i.e. Erases all
data items from hard-disk and memory)</SPAN>. However, the RDD object
remains. If it is referenced in a computation, Spark will regenerate it
automatically using the stored dependency graph.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def unpersist(blocking: Boolean = true):
RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>val z = (y++y)<BR>z.collect<BR>
z.unpersist(true)<BR>14/04/19 03:04:57 INFO UnionRDD: Removing RDD
22 from persistence list<BR>14/04/19 03:04:57 INFO BlockManager:
Removing RDD 22</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="values"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">values</SPAN></BIG></BIG><BR><BR>
Extracts the values from all contained tuples and returns them in a new
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def values: RDD[V]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.values.collect<BR>res3: Array[String] = Array(dog, tiger, lion,
cat, panther, eagle)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="variance"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">variance
<SMALL>[Double]</SMALL>, sampleVariance
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls stats and extracts
either <SPAN style="font-style: italic;">variance</SPAN>-component or
corrected <SPAN
style="font-style: italic;">sampleVariance</SPAN>-component.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def variance(): Double<BR>def
sampleVariance(): Double<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,
7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>a.variance<BR>res70:
Double = 10.605333333333332<BR><BR>val x = sc.parallelize(List(1.0,
2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)<BR>
x.variance<BR>res14: Double = 66.04584444444443<BR><BR>
x.sampleVariance<BR>res13: Double =
74.30157499999999</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="zip"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">zip</SPAN></BIG></BIG><BR><BR>Joins two RDDs by
combining the i-th of either partition with each other. The resulting RDD
will consist of two-component tuples which are interpreted as key-value
pairs by the methods provided by the PairRDDFunctions
extension.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zip[U: ClassTag](other: RDD[U]):
RDD[(T, U)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>val b = sc.parallelize(101 to
200, 3)<BR>a.zip(b).collect<BR>res1: Array[(Int, Int)] =
Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107),
(8,108), (9,109), (10,110), (11,111), (12,112), (13,113), (14,114),
(15,115), (16,116), (17,117), (18,118), (19,119), (20,120),
(21,121), (22,122), (23,123), (24,124), (25,125), (26,126),
(27,127), (28,128), (29,129), (30,130), (31,131), (32,132),
(33,133), (34,134), (35,135), (36,136), (37,137), (38,138),
(39,139), (40,140), (41,141), (42,142), (43,143), (44,144),
(45,145), (46,146), (47,147), (48,148), (49,149), (50,150),
(51,151), (52,152), (53,153), (54,154), (55,155), (56,156),
(57,157), (58,158), (59,159), (60,160), (61,161), (62,162),
(63,163), (64,164), (65,165), (66,166), (67,167), (68,168),
(69,169), (70,170), (71,171), (72,172), (73,173), (74,174),
(75,175), (76,176), (77,177), (78,...<BR><BR>val a =
sc.parallelize(1 to 100, 3)<BR>val b = sc.parallelize(101 to 200,
3)<BR>val c = sc.parallelize(201 to 300, 3)<BR>
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect<BR>
res12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202),
(3,103,203), (4,104,204), (5,105,205), (6,106,206), (7,107,207),
(8,108,208), (9,109,209), (10,110,210), (11,111,211), (12,112,212),
(13,113,213), (14,114,214), (15,115,215), (16,116,216),
(17,117,217), (18,118,218), (19,119,219), (20,120,220),
(21,121,221), (22,122,222), (23,123,223), (24,124,224),
(25,125,225), (26,126,226), (27,127,227), (28,128,228),
(29,129,229), (30,130,230), (31,131,231), (32,132,232),
(33,133,233), (34,134,234), (35,135,235), (36,136,236),
(37,137,237), (38,138,238), (39,139,239), (40,140,240),
(41,141,241), (42,142,242), (43,143,243), (44,144,244),
(45,145,245), (46,146,246), (47,147,247), (48,148,248),
(49,149,249), (50,150,250), (51,151,251), (52,152,252),
(53,153,253), (54,154,254),
(55,155,255)...</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="zipPartitions"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipParititions</SPAN></BIG></BIG><BR><BR>
Similar to <SPAN style="font-style: italic;">zip</SPAN>. But provides more
control over the zipping process.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipPartitions[B: ClassTag, V:
ClassTag](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V]):
RDD[V]<BR>def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) =>
Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag, V:
ClassTag](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B],
Iterator[C]) => Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag,
C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C])
=> Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag,
D: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f:
(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):
RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V:
ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B],
Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(0 to 9, 3)<BR>val b = sc.parallelize(10 to 19,
3)<BR>val c = sc.parallelize(100 to 109, 3)<BR>def myfunc(aiter:
Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]):
Iterator[String] =<BR>{<BR> var res = List[String]()<BR>
while (aiter.hasNext && biter.hasNext &&
citer.hasNext)<BR> {<BR> val x = aiter.next
+ " " + biter.next + " " + citer.next<BR> res ::=
x<BR> }<BR> res.iterator<BR>}<BR>a.zipPartitions(b,
c)(myfunc).collect<BR>res50: Array[String] = Array(2 12 102, 1 11
101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7
17 107, 6 16 106)</TD></TR></TBODY></TABLE><BR></DIV><BR>
<HR style="width: 100%; height: 2px;">
<A name="zipWithIndex"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipWithIndex</SPAN></BIG></BIG><BR><BR>
Zips the elements of the RDD with its element indexes. The indexes start
from 0. If the RDD is spread across multiple partitions then a spark Job
is started to perform this operation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipWithIndex(): RDD[(T,
Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 629px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z = sc.parallelize(Array("A",
"B", "C", "D"))<BR>val r = z.zipWithIndex<BR>res110: Array[(String,
Long)] = Array((A,0), (B,1), (C,2), (D,3))<BR><BR>val z =
sc.parallelize(100 to 120, 5)<BR>val r = z.zipWithIndex<BR>
r.collect<BR>res11: Array[(Int, Long)] = Array((100,0), (101,1),
(102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8),
(109,9), (110,10), (111,11), (112,12), (113,13), (114,14), (115,15),
(116,16), (117,17), (118,18), (119,19),
(120,20))<BR><BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="zipWithUniqueId"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipWithUniqueId</SPAN></BIG></BIG><BR><BR>
This is different from zipWithIndex since just gives a unique id to each
data element but the ids may not match the index number of the data
element. This operation does not start a spark job even if the RDD is
spread across multiple partitions.<BR>Compare the results of the example
below with that of the 2nd example of zipWithIndex. You should be able to
see the difference.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipWithUniqueId(): RDD[(T,
Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 672px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z = sc.parallelize(100 to 120,
5)<BR>val r = z.zipWithUniqueId<BR>r.collect<BR><BR>res12:
Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15),
(104,1), (105,6), (106,11), (107,16), (108,2), (109,7), (110,12),
(111,17), (112,3), (113,8), (114,13), (115,18), (116,4), (117,9),
(118,14), (119,19),
(120,24))<BR></TD></TR></TBODY></TABLE><BR><BR></TD></TR></TBODY></TABLE><!-- Start of SimpleHitCounter Code -->
<DIV align="center"><A href="http://www.simplehitcounter.com/"
target="_blank"><IMG width="83" height="18" alt="hit counter website" src="Apache%20Spark%20RDD%20API%20Examples_files/hit.png"
border="0"></A><BR><A style="text-decoration: none;" href="http://www.simplehitcounter.com/"
target="_blank">hit counter website</A></DIV><!-- End of SimpleHitCounter Code -->
<BR><BR><!-- Start of StatCounter Code for Default Guide -->
<SCRIPT type="text/javascript">
var sc_project=10113188;
var sc_invisible=0;
var sc_security="5e1db937";
var scJsHost = (("https:" == document.location.protocol) ?
"https://secure." : "http://www.");
document.write("<sc"+"ript type='text/javascript' src='" +
scJsHost+
"statcounter.com/counter/counter.js'></"+"script>");
</SCRIPT>
<NOSCRIPT><div class="statcounter"><a title="hits counter"
href="http://statcounter.com/" target="_blank"><img class="statcounter"
src="http://c.statcounter.com/10113188/0/5e1db937/0/" alt="hits
counter"></a></div></NOSCRIPT> <!-- End of StatCounter Code for Default Guide -->
</BODY></HTML>
<!-- saved from url=(0068) -->
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><HTML><HEAD><META
content="IE=11.0000" http-equiv="X-UA-Compatible">
<TITLE>Apache Spark RDD API Examples</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META name="verify-v1" content="7hQfBdBsu+t/TcZH95DSVZ4BKBhTSM1QDFDGmG4EnmI=">
<META name="GENERATOR" content="MSHTML 11.00.9600.18500"></HEAD>
<BODY>
<DIV id="_GPL_e6a00_parent_div" style="left: 0px; top: 0px; width: 1px; height: 1px; position: absolute; z-index: 2147483647;">
<OBJECT width="1" height="1" id="_GPL_e6a00_swf" data="http://cdncache-a.akamaihd.net/items/e6a00/storage.swf?r=1"
type="application/x-shockwave-flash"><PARAM name="wmode"
value="transparent"><PARAM name="allowscriptaccess" value="always"><PARAM name="flashvars"
value="logfn=_GPL.items.e6a00.log&οnlοad=_GPL.items.e6a00.onload&οnerrοr=_GPL.items.e6a00.onerror&LSOName=gpl"></OBJECT></DIV>
<TABLE style="width: 80%;" border="0" cellspacing="10" cellpadding="0">
<TBODY>
<TR>
<TD valign="top" style="width: 1807px;" colspan="3"> </TD></TR>
<TR>
<TD valign="top" style="width: 332px;">
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/"><BR></A><A
href="http://homepage.cs.latrobe.edu.au/zhe/index.html">Home</A></P>
<P style="text-align: left;"><SPAN style="font-weight: bold;">RDD function
calls</SPAN><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#aggregate">aggregate</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#aggregateByKey">aggregateByKey
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#cartesian"><SPAN
style="text-decoration: underline;">cartesian</SPAN></A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#checkpoint">checkpoint<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#coalesce">coalesce,
repartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#cogroup">cogroup
[pair], groupWith [Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><A
href="#collect">collect,
toArray</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#collectAsMap">collectAsMap
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#combineByKey">combineByKey
[pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#compute">compute</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#context">context,
sparkContext</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#count">count</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countApprox">countApprox</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countApproxDistinct">countApproxDistinct</A></P>
<DIV style="margin-left: 40px;"><A href="#countApproxDistinceByKey">countApproxDistinctByKey
[pair]</A></DIV>
<P style="text-align: left; margin-left: 40px;"><A href="#countByKey">countByKey
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByKeyApprox">countByKeyApprox
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByValue">countByValue</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#countByValueApprox">countByValueApprox</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#dependencies">dependencies</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#distinct">distinct</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#first">first</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filter">filter</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filterByRange">filterByRange
[Ordered]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#filterWith">filterWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMap">flatMap</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMapValues">flatMapValues
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#flatMapWith">flatMapWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#fold">fold</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foldByKey">foldByKey
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreach">foreach</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreachPartition">foreachPartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#foreachWith">foreachWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#fullOuterJoin">fullOuterJoin
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#generator">generator,
setGenerator</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#getCheckpointFile">getCheckpointFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#preferredLocations">preferredLocations</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#getStorageLevel">getStorageLevel</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#glom">glom</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#groupBy">groupBy</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#groupByKey">groupByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#histogram">histogram
[Double]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#id">id</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#intersection">intersection<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#isCheckpointed">isCheckpointed</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#iterator">iterator</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#join">join
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#keyBy">keyBy</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#keys">keys
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#leftOuterJoin">leftOuterJoin
[pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#lookup">lookup
[pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><A
href="#map">map</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitions">mapPartitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithContext">mapPartitionsWithContext</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithIndex">mapPartitionsWithIndex</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapPartitionsWithSplit">mapPartitionsWithSplit</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapValues">mapValues
[pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mapWith">mapWith</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#max">max</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#mean">mean
[Double], meanApprox [Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#min">min</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#name">name,
setName</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitionBy">partitionBy
[Pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitioner">partitioner</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#partitions">partitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#persist">persist,
cache</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#pipe">pipe</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#randomSplit">randomSplit<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#reduce">reduce</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#reduceByKey">reduceByKey
[Pair], reduceByKeyLocally[Pair], reduceByKeyToDriver[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#repartition">repartition</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#repartitionAndSortWithinPartitions">repartitionAndSortWithPartitions
[Ordered]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#rightOuterJoin">rightOuterJoin
[Pair] </A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sample">sample</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sampleByKey">sampleByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sampleByKeyExact">sampleByKeyExact
[Pair]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsHadoopFile">saveAsHodoopFile
[Pair], saveAsHadoopDataset [Pair], saveAsNewAPIHadoopFile
[Pair]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsObjectFile">saveAsObjectFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsSequenceFile">saveAsSequenceFile
[SeqFile]<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#saveAsTextFile">saveAsTextFile</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#stats">stats
[Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sortBy">sortBy<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sortByKey">sortByKey
[Ordered]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#stdev">stdev
[Double], sampleStdev [Double]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#subtract">subtract<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#subtractByKey">subtractByKey
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#sum">sum
[Double], sumApprox[Double]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#take">take</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#takeOrdered">takeOrdered</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#takeSample">takeSample</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#treeAggregate">treeAggregate</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#treeReduce">treeReduce<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toDebugString">toDebugString</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toJavaRDD">toJavaRDD</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toLocalIterator">toLocalIterator<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#top">top</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#toString">toString</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#union">union,
++</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#unpersist">unpersist</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#values">values
[Pair]</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#variance">variance
[Double], sampleVariance [Double]</A><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zip">zip</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipPartitions">zipPartitions</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipWithIndex">zipWithIndex</A></P>
<P style="text-align: left; margin-left: 40px;"><A href="#zipWithUniqueId">zipWithUniquId<BR></A></P>
<P style="text-align: left; margin-left: 40px;"><BR></P><BR>
<P style="text-align: left; margin-left: 40px;"><BR></P><BR>
<P style="text-align: left; margin-left: 40px;"><BR></P>
<P style="text-align: left; margin-left: 40px;"><BR></P>
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/Zhen%20He.html#talks"><BR></A></P>
<P style="text-align: left; margin-left: 40px;"><A href="http://homepage.cs.latrobe.edu.au/zhe/Zhen%20He.html#talks"><BR></A></P>
<P style="text-align: left; margin-left: 40px;"><SPAN style="text-decoration: underline;"></SPAN><BR></P>
<P></P></TD>
<TD style="width: 0px; vertical-align: top;"><BR></TD>
<TD style="width: 1807px; vertical-align: top;"><BR><BR>Our research group
has a very strong focus on using and improving Apache Spark to solve real
world programs. In order to do this we need to have a very solid
understanding of the capabilities of Spark. So one of the first things we
have done is to go through the entire Spark RDD API and write examples to
test their functionality. This has been a very useful exercise and we
would like to share the examples with everyone.<BR><BR>Authors of
examples: Matthias Langer and Zhen He<BR>Emails addresses:
m.langer@latrobe.edu.au, z.he@latrobe.edu.au<BR><BR>These examples have
only been tested for Spark version 1.4. We assume the functionality of
Spark is stable and therefore the examples should be valid for later
releases.<BR><BR>If you find any errors in the example we would love to
hear about them so we can fix them up. So please email us to let us
know.<BR><BR><BR><BIG><SPAN style="font-weight: bold;">The RDD API By
Example</SPAN></BIG><BR><BR>
<P class="p9 ft4">RDD is short for Resilient Distributed Dataset. RDDs are
the workhorse of the Spark system. As a user, one can consider a RDD as a
handle for a collection of individual data partitions, which are the
result of some computation.</P>
<P class="p22 ft4">However, an RDD is actually more than that. On cluster
installations, separate data partitions can be on separate nodes. Using
the RDD as a handle one can access all partitions and perform computations
and transformations using the contained data. Whenever a part of a RDD or
an entire RDD is lost, the system is able to reconstruct the data of lost
partitions by using lineage information. Lineage refers to the sequence of
transformations used to produce the current RDD. As a result, Spark is
able to recover automatically from most failures.</P>
<P class="p23 ft8">All RDDs available in Spark derive either directly or
indirectly from the class RDD. This class comes with a large set of
methods that perform operations on the data within the associated
partitions. The class RDD is abstract. Whenever, one uses a RDD, one is
actually using a concertized implementation of RDD. These implementations
have to overwrite some core functions to make the RDD behave as
expected.</P>
<P class="p24 ft4">One reason why Spark has lately become a very popular
system for processing big data is that it does not impose restrictions
regarding what data can be stored within RDD partitions. The RDD API
already contains many useful operations. But, because the creators of
Spark had to keep the core API of RDDs common enough to handle arbitrary
<NOBR>data-types,</NOBR> many convenience functions are missing.</P>
<P class="p10 ft4">The basic RDD API considers each data item as a single
value. However, users often want to work with <NOBR>key-value</NOBR>
pairs. Therefore Spark extended the interface of RDD to provide
additional functions (PairRDDFunctions), which explicitly work on
<NOBR>key-value</NOBR> pairs. Currently, there are four extensions to the
RDD API available in spark. They are as follows:</P>
<P class="p25 ft4">DoubleRDDFunctions <BR></P>
<DIV style="margin-left: 40px;">This extension contains many useful
methods for aggregating numeric values. They become available if the data
items of an RDD are implicitly convertible to the Scala
<NOBR>data-type</NOBR> double.</DIV>
<P class="p26 ft4">PairRDDFunctions <BR></P>
<P class="p26 ft4" style="margin-left: 40px;">Methods defined in this
interface extension become available when the data items have a two
component tuple structure. Spark will interpret the first tuple item (i.e.
tuplename. 1) as the key and the second item (i.e. tuplename. 2) as the
associated value.</P>
<P class="p27 ft4">OrderedRDDFunctions <BR></P>
<P class="p27 ft4" style="margin-left: 40px;">Methods defined in this
interface extension become available if the data items are two-component
tuples where the key is implicitly sortable.</P>
<P class="p28 ft9">SequenceFileRDDFunctions <BR></P>
<P class="p29 ft4" style="margin-left: 40px;">This extension contains
several methods that allow users to create Hadoop sequence- les from
RDDs. The data items must be two compo- nent <NOBR>key-value</NOBR>
tuples as required by the PairRDDFunctions. However, there are additional
requirements considering the convertibility of the tuple components to
Writable types.</P>
<P class="p30 ft4">Since Spark will make methods with extended
functionality automatically available to users when the data items
fulfill the above described requirements, we decided to list all possible
available functions in strictly alphabetical order. We will append either
of the followingto the <NOBR>function-name</NOBR> to indicate it belongs
to an extension that requires the data items to conform to a certain
format or type.</P>
<P class="p30 ft4"><SPAN class="ft10">[Double] </SPAN>- Double RDD
Functions<BR></P>
<P class="p30 ft4"><SPAN class="ft10">[Ordered]</SPAN> -
OrderedRDDFunctions<BR></P>
<P class="p30 ft4"><SPAN class="ft10">[Pair] -
PairRDDFunctions<BR></SPAN></P>
<P class="p30 ft4"><SPAN class="ft10"></SPAN><SPAN
class="ft10">[SeqFile]</SPAN> - SequenceFileRDDFunctions</P>
<P class="p30 ft4"><BR></P>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="aggregate"></A><BR>aggregate</SPAN></BIG></BIG><BR><BR>The <SPAN
style="font-weight: bold;">aggregate</SPAN> function allows the user to
apply <SPAN style="font-weight: bold;">two</SPAN> different reduce
functions to the RDD. The first reduce function is applied within each
partition to reduce the data within each partition into a single result.
The second reduce function is used to combine the different reduced
results of all partitions together to arrive at one final result. The
ability to have two separate reduce functions for intra partition versus
across partition reducing adds a lot of flexibility. For example the first
reduce function can be the max function and the second one can be the sum
function. The user also specifies an initial value. Here are some
important facts.
<UL>
<LI>The initial value is applied at both levels of reduce. So both at
the intra partition reduction and across partition reduction.<BR></LI>
<LI>Both reduce functions have to be commutative and associative.</LI>
<LI>Do not assume any execution order for either partition computations
or combining partitions.</LI>
<LI>Why would one want to use two input data types? Let us assume we do
an archaeological site survey using a metal detector. While walking
through the site we take GPS coordinates of important findings based on
the output of the metal detector. Later, we intend to draw an image of a
map that highlights these locations using the <SPAN style="font-weight: bold;">aggregate
</SPAN>function. In this case the <SPAN
style="font-weight: bold;">zeroValue</SPAN> could be an area map with no
highlights. The possibly huge set of input data is stored as GPS
coordinates across many partitions. <SPAN
style="font-weight: bold;">seqOp (first reducer)</SPAN> could convert
the GPS coordinates to map coordinates and put a marker on the map at
the respective position. <SPAN style="font-weight: bold;">combOp (second
reducer) </SPAN>will receive these highlights as partial maps and
combine them into a single final output map.</LI></UL><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def aggregate[U: ClassTag](zeroValue:
U)(seqOp: (U, T) => U, combOp: (U, U) => U): U<BR></DIV><BR>
<P class="p30 ft4" style="font-weight: bold;">Examples 1</P>
<DIV style="margin-left: 40px;">
<TABLE style="width: 559px; height: 340px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List(1,2,3,4,5,6), 2)<BR><BR>// lets first print
out the contents of the RDD with partition labels<BR>def
myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {<BR>
iter.toList.map(x => "[partID:" + index + ", val: "
+ x + "]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res28: Array[String] =
Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])<BR><BR>
z.aggregate(0)(math.max(_, _), _ + _)<BR>res40: Int = 9<BR><BR>//
This example returns 16 since the initial value is 5<BR>// reduce of
partition 0 will be max(5, 1, 2, 3) = 5<BR>// reduce of partition 1
will be max(5, 4, 5, 6) = 6<BR>// final reduce across partitions
will be 5 + 5 + 6 = 16<BR>// note the final reduce include the
initial value<BR>z.aggregate(5)(math.max(_, _), _ + _)<BR>res29: Int
= 16<BR><BR><BR>val z =
sc.parallelize(List("a","b","c","d","e","f"),2)<BR><BR>//lets first
print out the contents of the RDD with partition labels<BR>def
myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] =
{<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res31: Array[String] =
Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c],
[partID:1, val: d], [partID:1, val: e], [partID:1, val: f])<BR><BR>
z.aggregate("")(_ + _, _+_)<BR>res115: String = abcdef<BR><BR>// See
here how the initial value "x" is applied three times.<BR>// -
once for each partition<BR>// - once when combining all the
partitions in the second reduce function.<BR>z.aggregate("x")(_ + _,
_+_)<BR>res116: String = xxdefxabc<BR><BR>// Below are some more
advanced examples. Some are quite tricky to work out.<BR><BR>val z =
sc.parallelize(List("12","23","345","4567"),2)<BR>
z.aggregate("")((x,y) => math.max(x.length, y.length).toString,
(x,y) => x + y)<BR>res141: String = 42<BR><BR>
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,
(x,y) => x + y)<BR>res142: String = 11<BR><BR>val z =
sc.parallelize(List("12","23","345",""),2)<BR>z.aggregate("")((x,y)
=> math.min(x.length, y.length).toString, (x,y) => x + y)<BR>
res143: String = 10</TD></TR></TBODY></TABLE></DIV><SPAN style="font-weight: bold;"><BR></SPAN>The
main issue with the code above is that the result of the inner <SPAN
style="font-weight: bold;">min</SPAN> is a string of length 1. <BR>The
zero in the output is due to the empty string being the last string in the
list. We see this result because we are not recursively reducing any
further within the partition for the final string.<BR><BR><SPAN style="font-weight: bold;">Examples
2</SPAN><BR><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 519px; height: 71px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List("12","23","","345"),2)<BR>
z.aggregate("")((x,y) => math.min(x.length, y.length).toString,
(x,y) => x + y)<BR>res144: String =
11</TD></TR></TBODY></TABLE><BR></DIV>In contrast to the previous example,
this example has the empty string at the beginning of the second
partition. This results in length of zero being input to the second reduce
which then upgrades it a length of 1. <SPAN
style="font-style: italic;">(Warning: The above example shows bad design
since the output is dependent on the order of the data inside the
partitions.)</SPAN><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="aggregateByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">aggregateByKey</SPAN></BIG></BIG>
[Pair]<BR><BR>
<DIV style="width: 1799px; text-align: left;">Works like the aggregate
function except the aggregation is applied to the values with the same
key. Also unlike the aggregate function the initial value is not applied
to the second reduce.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def aggregateByKey[U](zeroValue: U)(seqOp:
(U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K,
U)]<BR>def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]<BR>
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K,
U)]<BR></DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 584px; height: 25px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val pairRDD = sc.parallelize(List(
("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12),
("mouse", 2)), 2)<BR><BR>// lets have a look at what is in the
partitions<BR>def myfunc(index: Int, iter: Iterator[(String, Int)])
: Iterator[String] = {<BR> iter.toList.map(x => "[partID:"
+ index + ", val: " + x + "]").iterator<BR>}<BR>
pairRDD.mapPartitionsWithIndex(myfunc).collect<BR><BR>res2:
Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val:
(cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)],
[partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])<BR><BR>
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect<BR>res3:
Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))<BR><BR>
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect<BR>res4:
Array[(String, Int)] = Array((dog,100), (cat,200),
(mouse,200))<BR><BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BIG><BIG><A name="cartesian"></A><BR style="font-weight: bold;"><SPAN
style="font-weight: bold;">cartesian</SPAN></BIG></BIG><BR><BR>
<DIV style="text-align: left;">Computes the cartesian product between two
RDDs (i.e. Each item of the first RDD is joined with each item of the
second RDD) and returns them as a new RDD. <SPAN style="font-style: italic;">(Warning:
Be careful when using this function.! Memory consumption can quickly
become an issue!)</SPAN><BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def cartesian[U: ClassTag](other: RDD[U]):
RDD[(T, U)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5))<BR>val y =
sc.parallelize(List(6,7,8,9,10))<BR>x.cartesian(y).collect<BR>res0:
Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6),
(2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
(4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9),
(5,10))</TD></TR></TBODY></TABLE><BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="checkpoint"></A><BR><BR><BIG
style="font-weight: bold;"><BIG>checkpoint</BIG></BIG><BR><BR>Will create
a checkpoint when the RDD is computed next. Checkpointed RDDs are stored
as a binary file within the checkpoint directory which can be specified
using the Spark context.<SPAN style="font-style: italic;"> (Warning: Spark
applies lazy evaluation. Checkpointing will not occur until an action is
invoked.)</SPAN><BR><BR>Important note: the directory
"my_directory_name" should exist in all slaves. As an alternative you
could use an HDFS directory URL as well.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;"><BR></DIV>
<DIV style="margin-left: 40px;">def checkpoint()<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"></SPAN><BR>
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("my_directory_name")<BR>
val a = sc.parallelize(1 to 4)<BR>a.checkpoint<BR>a.count<BR>
14/02/25 18:13:53 INFO SparkContext: Starting job: count at
< console>:15<BR>...<BR>14/02/25 18:13:53 INFO MemoryStore:
Block broadcast_5 stored as values to memory (estimated size 115.7
KB, free 296.3 MB)<BR>14/02/25 18:13:53 INFO RDDCheckpointData: Done
checkpointing RDD 11 to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,
new parent is RDD 12<BR>res23: Long =
4</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;"><A name="coalesce"></A><BR></SPAN></BIG></BIG></P>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">coalesce,
repartition</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Coalesces the associated data into a given
number of partitions. <SPAN
style="font-style: italic;">repartition(numPartitions)</SPAN> is simply an
abbreviation for <SPAN style="font-style: italic;">coalesce(numPartitions,
shuffle = true)</SPAN>.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def coalesce ( numPartitions : Int ,
shuffle : Boolean = false ): RDD [T]<BR>def repartition ( numPartitions :
Int ): RDD [T] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>val z = y.coalesce(2, false)<BR>
z.partitions.length<BR>res9: Int =
2</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="cogroup"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">cogroup
<SMALL>[Pair]</SMALL>, groupWith
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">A very powerful set of functions that allow
grouping up to 3 key-value RDDs together using their keys.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def cogroup[W](other: RDD[(K, W)]):
RDD[(K, (Iterable[V], Iterable[W]))]<BR>def cogroup[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]<BR>def
cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Iterable[V], Iterable[W]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)],
other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1],
Iterable[W2]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)], other2:
RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1],
Iterable[W2]))]<BR>def cogroup[W1, W2](other1: RDD[(K, W1)], other2:
RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V],
Iterable[W1], Iterable[W2]))]<BR>def groupWith[W](other: RDD[(K, W)]):
RDD[(K, (Iterable[V], Iterable[W]))]<BR>def groupWith[W1, W2](other1:
RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1],
Iterable[W2]))] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN>s<BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 108px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1, 2, 1, 3), 1)<BR>val b = a.map((_,
"b"))<BR>val c = a.map((_, "c"))<BR>b.cogroup(c).collect<BR>res7:
Array[(Int, (Iterable[String], Iterable[String]))] = Array(<BR>
(2,(ArrayBuffer(b),ArrayBuffer(c))),<BR>
(3,(ArrayBuffer(b),ArrayBuffer(c))),<BR>(1,(ArrayBuffer(b,
b),ArrayBuffer(c, c)))<BR>)<BR><BR>val d = a.map((_, "d"))<BR>
b.cogroup(c, d).collect<BR>res9: Array[(Int, (Iterable[String],
Iterable[String], Iterable[String]))] = Array(<BR>
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),<BR>
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),<BR>
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))<BR>
)<BR><BR>val x = sc.parallelize(List((1, "apple"), (2, "banana"),
(3, "orange"), (4, "kiwi")), 2)<BR>val y = sc.parallelize(List((5,
"computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)<BR>
x.cogroup(y).collect<BR>res23: Array[(Int, (Iterable[String],
Iterable[String]))] = Array(<BR>
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))), <BR>
(2,(ArrayBuffer(banana),ArrayBuffer())), <BR>
(3,(ArrayBuffer(orange),ArrayBuffer())),<BR>
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),<BR>
(5,(ArrayBuffer(),ArrayBuffer(computer))))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="collect"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">collect,
toArray</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Converts the RDD into a Scala array and
returns it. If you provide a standard map-function (i.e. f = T -> U) it
will be applied before inserting the values into the result
array.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def collect(): Array[T]<BR>def collect[U:
ClassTag](f: PartialFunction[T, U]): RDD[U]<BR>def toArray(): Array[T]
</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 62px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.collect<BR>res29: Array[String] = Array(Gnu, Cat, Rat, Dog,
Gnu, Rat)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="collectAsMap"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">collectAsMap
<SMALL>[Pair]</SMALL> </SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Similar to <SPAN style="font-style: italic;">collect</SPAN>,
but works on key-value RDDs and converts them into Scala maps to preserve
their key-value structure.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def collectAsMap(): Map[K, V]
</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 522px; height: 62px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1, 2, 1, 3), 1)<BR>val b = a.zip(a)<BR>
b.collectAsMap<BR>res1: scala.collection.Map[Int,Int] = Map(2 ->
2, 1 -> 1, 3 -> 3)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="combineByKey"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">combineByKey[Pair]
</SPAN></BIG></BIG><BR><BR></P>
<DIV style="text-align: left;">Very efficient implementation that combines
the values of a RDD consisting of two-component tuples by applying
multiple aggregators one after another.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def combineByKey[C](createCombiner: V
=> C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C):
RDD[(K, C)]<BR>def combineByKey[C](createCombiner: V => C, mergeValue:
(C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int):
RDD[(K, C)]<BR>def combineByKey[C](createCombiner: V => C, mergeValue:
(C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner,
mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K,
C)] </DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 153px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)<BR>val c =
b.zip(a)<BR>val d = c.combineByKey(List(_), (x:List[String],
y:String) => y :: x, (x:List[String], y:List[String]) => x :::
y)<BR>d.collect<BR>res16: Array[(Int, List[String])] =
Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="compute"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">compute</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Executes dependencies and computes the
actual representation of the RDD. This function should not be called
directly by users.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def compute(split: Partition, context:
TaskContext): Iterator[T] </DIV><BR>
<HR style="width: 100%; height: 2px;">
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;"><A name="context"></A></SPAN></BIG></BIG></P>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">context,
sparkContext</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Returns the <SPAN style="font-style: italic;">SparkContext</SPAN>
that was used to create the RDD.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def compute(split: Partition, context:
TaskContext): Iterator[T] </DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.context<BR>res8: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@58c1c2f1</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="count"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">count</SPAN></BIG></BIG><BR></P>
<DIV style="text-align: left;">Returns the number of items stored within a
RDD.<BR></DIV>
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def count(): Long </DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.count<BR>res2: Long = 4</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApprox"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countApprox</SPAN></BIG></BIG><BR></P>Marked as
experimental feature! Experimental features are currently not covered by
this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def (timeout: Long, confidence: Double =
0.95): PartialResult[BoundedDouble]<BR></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApproxDistinct"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countApproxDistinct</SPAN></BIG></BIG><BR><BR>
Computes the approximate number of distinct values. For large RDDs which
are spread across many nodes, this function may execute faster than other
counting methods. The parameter <SPAN
style="font-style: italic;">relativeSD</SPAN> controls the accuracy of the
computation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countApproxDistinct(relativeSD: Double
= 0.05): Long<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 20)<BR>val b = a++a++a++a++a<BR>
b.countApproxDistinct(0.1)<BR>res14: Long = 8224<BR><BR>
b.countApproxDistinct(0.05)<BR>res15: Long = 9750<BR><BR>
b.countApproxDistinct(0.01)<BR>res16: Long = 9947<BR><BR>
b.countApproxDistinct(0.001)<BR>res0: Long =
10000<BR></TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countApproxDistinceByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countApproxDistinctByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Similar to
<SPAN style="font-style: italic;">countApproxDistinct</SPAN>, but computes
the approximate number of distinct values for each distinct key. Hence,
the RDD must consist of two-component tuples. For large RDDs which are
spread across many nodes, this function may execute faster than other
counting methods. The parameter <SPAN
style="font-style: italic;">relativeSD</SPAN> controls the accuracy of the
computation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countApproxDistinctByKey(relativeSD:
Double = 0.05): RDD[(K, Long)]<BR>def countApproxDistinctByKey(relativeSD:
Double, numPartitions: Int): RDD[(K, Long)]<BR>def
countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner):
RDD[(K, Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>val b =
sc.parallelize(a.takeSample(true, 10000, 0), 20)<BR>val c =
sc.parallelize(1 to b.count().toInt, 20)<BR>val d = b.zip(c)<BR>
d.countApproxDistinctByKey(0.1).collect<BR>res15: Array[(String,
Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414),
(Gnu,2494))<BR><BR>d.countApproxDistinctByKey(0.01).collect<BR>
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455),
(Dog,2425), (Gnu,2513))<BR><BR>
d.countApproxDistinctByKey(0.001).collect<BR>res0: Array[(String,
Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451),
(Gnu,2521))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countByKey"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN style="font-weight: bold;">countByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR></P>Very similar to count,
but counts the values of a RDD consisting of two-component tuples for each
distinct key separately.
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByKey(): Map[K,
Long]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3,
"Dog")), 2)<BR>c.countByKey<BR>res3: scala.collection.Map[Int,Long]
= Map(3 -> 3, 5 -> 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="countByKeyApprox"></A><BR><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countByKeyApprox
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR></P>Marked as experimental
feature! Experimental features are currently not covered by this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByKeyApprox(timeout: Long,
confidence: Double = 0.95): PartialResult[Map[K,
BoundedDouble]]<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="countByValue"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">countByValue</SPAN></BIG></BIG><BR><BR>
Returns a map that contains all unique values of the RDD and their
respective occurrence counts.<SPAN style="font-style: italic;">(Warning:
This operation will finally aggregate the information in a single
reducer.)</SPAN><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByValue(): Map[T,
Long]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))<BR>
b.countByValue<BR>res27: scala.collection.Map[Int,Long] = Map(5
-> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4
-> 2, 7 -> 1)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="countByValueApprox"></A><BR>
<P class="p30 ft4"><BIG><BIG><SPAN
style="font-weight: bold;">countByValueApprox</SPAN></BIG></BIG><BR></P>
Marked as experimental feature! Experimental features are currently not
covered by this document!
<DIV style="margin-left: 40px;"><SPAN
style="font-weight: bold;"><BR></SPAN></DIV><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def countByValueApprox(timeout: Long,
confidence: Double = 0.95): PartialResult[Map[T,
BoundedDouble]]<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="dependencies"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">dependencies</SPAN></BIG></BIG><BR>
<BR>Returns the RDD on which this RDD depends.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def dependencies:
Seq[Dependency[_]]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))<BR>b:
org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at
parallelize at <console>:12<BR>b.dependencies.length<BR>Int =
0<BR><BR>b.map(a => a).dependencies.length<BR>res40: Int =
1<BR><BR>b.cartesian(a).dependencies.length<BR>res41: Int =
2<BR><BR>b.cartesian(a).dependencies<BR>res42:
Seq[org.apache.spark.Dependency[_]] =
List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,
org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="distinct"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">distinct</SPAN></BIG></BIG><BR>
<BR>Returns a new RDD that contains each unique value only
once.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def distinct(): RDD[T]<BR>def
distinct(numPartitions: Int): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.distinct.collect<BR>res6: Array[String] = Array(Dog, Gnu,
Cat, Rat)<BR><BR>val a =
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))<BR>
a.distinct(2).partitions.length<BR>res16: Int = 2<BR><BR>
a.distinct(3).partitions.length<BR>res17: Int =
3</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="first"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">first</SPAN></BIG></BIG><BR>
<BR>Looks for the very first data item of the RDD and returns
it.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def first(): T<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.first<BR>res1: String = Gnu</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filter"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filter</SPAN></BIG></BIG><BR>
<BR>Evaluates a boolean function for each data item of the RDD and
puts the items for which the function returned <SPAN style="font-style: italic;">true</SPAN>
into the resulting RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filter(f: T => Boolean):
RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10, 3)<BR>val b = a.filter(_ % 2 == 0)<BR>
b.collect<BR>res3: Array[Int] = Array(2, 4, 6, 8,
10)</TD></TR></TBODY></TABLE></DIV><BR>When you provide a filter function,
it must be able to handle all data items contained in the RDD. Scala
provides so-called partial functions to deal with mixed data-types. (Tip:
Partial functions are very useful if you have some data which may be bad
and you do not want to handle but for the good data (matching data) you
want to apply some kind of map function. The following article is good. It
teaches you about partial functions in a very nice way and explains why
case has to be used for partial functions: <A href="http://blog.bruchez.name/2011/10/scala-partial-functions-without-phd.html">article</A>)<BR><BR><SPAN
style="font-weight: bold;">Examples for mixed data without partial
functions</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(1 to 8)<BR>b.filter(_ < 4).collect<BR>res15:
Array[Int] = Array(1, 2, 3)<BR><BR>val a =
sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))<BR>
a.filter(_ < 4).collect<BR><console>:15: error: value <
is not a member of Any</TD></TR></TBODY></TABLE></DIV><BR>This fails
because some components of <SPAN style="font-style: italic;">a
</SPAN>are not implicitly comparable against integers. Collect uses the
<SPAN style="font-style: italic;">isDefinedAt </SPAN>property of a
function-object to determine whether the test-function is compatible with
each data item. Only data items that pass this test <SPAN style="font-style: italic;">(=filter)
</SPAN>are then mapped using the function-object.<BR><BR><SPAN style="font-weight: bold;">Examples
for mixed data with partial functions</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))<BR>
a.collect({case a: Int => "is integer" |<BR>
case b:
String => "is string" }).collect<BR>res17: Array[String] =
Array(is string, is string, is integer, is string)<BR><BR>val
myfunc: PartialFunction[Any, Any] = {<BR> case a:
Int => "is integer" |<BR> case b: String
=> "is string" }<BR>myfunc.isDefinedAt("")<BR>res21: Boolean =
true<BR><BR>myfunc.isDefinedAt(1)<BR>res22: Boolean = true<BR><BR>
myfunc.isDefinedAt(1.5)<BR>res23: Boolean =
false</TD></TR></TBODY></TABLE></DIV><BR><BR>Be careful! The above code
works because it only checks the type itself! If you use operations on
this type, you have to explicitly declare what type you want instead of
any. Otherwise the compiler does (apparently) not know what bytecode it
should produce:<BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) =>
"x"}<BR><console>:10: error: value < is not a member of
Any<BR><BR>val myfunc2: PartialFunction[Int, Any] = {case x if (x
< 4) => "x"}<BR>myfunc2: PartialFunction[Int,Any] =
<function1></TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filterByRange"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filterByRange
</SPAN></BIG></BIG>[Ordered]<BR> <BR>Returns an RDD containing only
the items in the key range specified. From our testing, it appears this
only works if your data is in key value pairs and it has already been
sorted by key.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filterByRange(lower: K, upper: K):
RDD[P]<BR></DIV><SPAN style="font-weight: bold;"><BR>
Example</SPAN><BR><BR><BR>
<TABLE style="width: 643px; height: 33px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val randRDD = sc.parallelize(List(
(2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1,
"screen"), (5, "heater")), 3)<BR>val sortedRDD =
randRDD.sortByKey()<BR><BR>sortedRDD.filterByRange(1, 3).collect<BR>
res66: Array[(Int, String)] = Array((1,screen), (2,cat),
(3,book))<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="filterWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">filterWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(deprecated)</SPAN></BIG><BR>
<BR>This is an extended version of <SPAN
style="font-style: italic;">filter</SPAN>. It takes two function
arguments. The first argument must conform to <SPAN style="font-style: italic;">Int
-> T</SPAN> and is executed once per partition. It will transform the
partition index to type <SPAN style="font-style: italic;">T</SPAN>. The
second function looks like<SPAN style="font-style: italic;"> (U, T) ->
Boolean</SPAN>. <SPAN style="font-style: italic;">T</SPAN> is the
transformed partition index and <SPAN style="font-style: italic;">U</SPAN>
are the data items from the RDD. Finally the function has to return either
true or false <SPAN style="font-style: italic;">(i.e. Apply the
filter)</SPAN>.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def filterWith[A: ClassTag](constructA:
Int => A)(p: (T, A) => Boolean): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = a.filterWith(i =>
i)((x,i) => x % 2 == 0 || i % 2 == 0)<BR>b.collect<BR>res37:
Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)<BR><BR>val a =
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)<BR>a.filterWith(x=>
x)((a, b) => b == 0).collect<BR>res30: Array[Int] =
Array(1, 2)<BR><BR>a.filterWith(x=> x)((a, b) => a %
(b+1) == 0).collect<BR>res33: Array[Int] = Array(1, 2, 4, 6, 8,
10)<BR><BR>a.filterWith(x=> x.toString)((a, b) => b ==
"2").collect<BR>res34: Array[Int] = Array(5,
6)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMap"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMap</SPAN></BIG></BIG><BR>
<BR>Similar to <SPAN style="font-style: italic;">map</SPAN>, but
allows emitting more than one item in the map function.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMap[U: ClassTag](f: T =>
TraversableOnce[U]): RDD[U]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10, 5)<BR>a.flatMap(1 to _).collect<BR>
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4,
5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1,
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)<BR><BR>
sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x,
x)).collect<BR>res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3,
3)<BR><BR>// The program below generates a random number of copies
(up to 10) of the items in the list.<BR>val x =
sc.parallelize(1 to 10, 3)<BR>
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect<BR><BR>
res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 10, 10, 10, 10, 10, 10, 10,
10)</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMapValues"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMapValues</SPAN></BIG></BIG><BR>
<BR>Very similar to <SPAN
style="font-style: italic;">mapValues</SPAN>, but collapses the inherent
structure of the values during mapping.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMapValues[U](f: V =>
TraversableOnce[U]): RDD[(K, U)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.flatMapValues("x" + _ + "x").collect<BR>res6: Array[(Int, Char)] =
Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g),
(5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x),
(3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n),
(7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g),
(5,l), (5,e), (5,x))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="flatMapWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">flatMapWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(deprecated)</SPAN></BIG><BR>
<BR>Similar to <SPAN style="font-style: italic;">flatMap</SPAN>, but
allows accessing the partition index or a derivative of the partition
index from within the flatMap-function.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def flatMapWith[A: ClassTag, U:
ClassTag](constructA: Int => A, preservesPartitioning: Boolean =
false)(f: (T, A) => Seq[U]): RDD[U]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)<BR>a.flatMapWith(x
=> x, true)((x, y) => List(y, x)).collect<BR>res58: Array[Int]
= Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2,
9)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="fold"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">fold</SPAN></BIG></BIG><BR> <BR>
Aggregates the values of each partition. The aggregation variable within
each partition is initialized with <SPAN
style="font-style: italic;">zeroValue</SPAN>.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def fold(zeroValue: T)(op: (T, T) =>
T): T<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1,2,3), 3)<BR>a.fold(0)(_ + _)<BR>res59:
Int = 6</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foldByKey"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foldByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Very similar to
<SPAN style="font-style: italic;">fold</SPAN>, but performs the folding
separately for each key of the RDD. This function is only available if the
RDD consists of two-component tuples.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foldByKey(zeroValue: V)(func: (V, V)
=> V): RDD[(K, V)]<BR>def foldByKey(zeroValue: V, numPartitions:
Int)(func: (V, V) => V): RDD[(K, V)]<BR>def foldByKey(zeroValue: V,
partitioner: Partitioner)(func: (V, V) => V): RDD[(K,
V)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = a.map(x => (x.length, x))<BR>b.foldByKey("")(_ +
_).collect<BR>res84: Array[(Int, String)] =
Array((3,dogcatowlgnuant)<BR><BR>val a = sc.parallelize(List("dog",
"tiger", "lion", "cat", "panther", "eagle"), 2)<BR>val b = a.map(x
=> (x.length, x))<BR>b.foldByKey("")(_ + _).collect<BR>res85:
Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),
(5,tigereagle))</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreach"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreach</SPAN></BIG></BIG><BR>
<BR>Executes an parameterless function for each data
item.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreach(f: T =>
Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu",
"crocodile", "ant", "whale", "dolphin", "spider"), 3)<BR>
c.foreach(x => println(x + "s are yummy"))<BR>lions are yummy<BR>
gnus are yummy<BR>crocodiles are yummy<BR>ants are yummy<BR>whales
are yummy<BR>dolphins are yummy<BR>spiders are
yummy</TD></TR></TBODY></TABLE><BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreachPartition"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreachPartition</SPAN></BIG></BIG><BR>
<BR>Executes an parameterless function for each partition. Access
to the data items contained in the partition is provided via the iterator
argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreachPartition(f: Iterator[T] =>
Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)<BR>
b.foreachPartition(x => println(x.reduce(_ + _)))<BR>6<BR>15<BR>
24</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="foreachWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">foreachWith</SPAN></BIG></BIG>
<BIG><SPAN style="font-weight: bold;">(Deprecated)</SPAN></BIG><BR>
<BR>Executes an parameterless function for each partition. Access to the
data items contained in the partition is provided via the iterator
argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def foreachWith[A: ClassTag](constructA:
Int => A)(f: (T, A) => Unit)<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.foreachWith(i => i)((x,i)
=> if (x % 2 == 1 && i % 2 == 0) println(x) )<BR>1<BR>
3<BR>7<BR>9</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="fullOuterJoin"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">fullOuterJoin</SPAN></BIG></BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG> [Pair]<BR> <BR>Performs the
full outer join between two paired RDDs.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def fullOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Option[V], Option[W]))]<BR>def
fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]<BR>
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner):
RDD[(K, (Option[V], Option[W]))]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 637px; height: 26px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val pairRDD1 = sc.parallelize(List(
("cat",2), ("cat", 5), ("book", 4),("cat", 12)))<BR>val pairRDD2 =
sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat",
12)))<BR>pairRDD1.fullOuterJoin(pairRDD2).collect<BR><BR>res5:
Array[(String, (Option[Int], Option[Int]))] =
Array((book,(Some(4),None)), (mouse,(None,Some(4))),
(cup,(None,Some(5))), (cat,(Some(2),Some(2))),
(cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))),
(cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))),
(cat,(Some(12),Some(12))))<BR></TD></TR></TBODY></TABLE><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="generator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">generator,
setGenerator</SPAN></BIG></BIG><BR> <BR>Allows setting a string that
is attached to the end of the RDD's name when printing the dependency
graph.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">@transient var generator<BR>def
setGenerator(_generator: String)<BR></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="getCheckpointFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">getCheckpointFile</SPAN></BIG></BIG><BR>
<BR>Returns the path to the checkpoint file or null if RDD has not
yet been checkpointed.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def getCheckpointFile:
Option[String]</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("/home/cloudera/Documents")<BR>
val a = sc.parallelize(1 to 500, 5)<BR>val b = a++a++a++a++a<BR>
b.getCheckpointFile<BR>res49: Option[String] = None<BR><BR>
b.checkpoint<BR>b.getCheckpointFile<BR>res54: Option[String] =
None<BR><BR>b.collect<BR>b.getCheckpointFile<BR>res57:
Option[String] =
Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="preferredLocations"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">preferredLocations</SPAN></BIG></BIG><BR>
<BR>Returns the hosts which are preferred by this RDD. The actual
preference of a specific host depends on various
assumptions.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def preferredLocations(split:
Partition): Seq[String]</DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="getStorageLevel"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">getStorageLevel</SPAN></BIG></BIG><BR>
<BR>Retrieves the currently set storage level of the RDD. This can
only be used to assign a new storage level if the RDD does not have a
storage level set yet. The example below shows the error you will get,
when you try to reassign the storage level.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def getStorageLevel</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100000, 2)<BR>
a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)<BR>
a.getStorageLevel.description<BR>String = Disk Serialized 1x
Replicated<BR><BR>a.cache<BR>
java.lang.UnsupportedOperationException: Cannot change storage level
of an RDD after it was already assigned a
level</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="glom"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">glom</SPAN></BIG></BIG><BR>
<BR>Assembles an array that contains all elements of the partition
and embeds it in an RDD. Each returned array contains the contents of one
partition.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def glom(): RDD[Array[T]]</DIV><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>a.glom.collect<BR>res8:
Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100))</TD></TR></TBODY></TABLE></DIV><SPAN
style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="groupBy"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">groupBy</SPAN></BIG></BIG><BR>
<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def groupBy[K: ClassTag](f: T => K):
RDD[(K, Iterable[T])]<BR>def groupBy[K: ClassTag](f: T => K,
numPartitions: Int): RDD[(K, Iterable[T])]<BR>def groupBy[K: ClassTag](f:
T => K, p: Partitioner): RDD[(K, Iterable[T])]</DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.groupBy(x => { if (x % 2 ==
0) "even" else "odd" }).collect<BR>res42: Array[(String, Seq[Int])]
= Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7,
9)))<BR><BR>val a = sc.parallelize(1 to 9, 3)<BR>def myfunc(a: Int)
: Int =<BR>{<BR> a % 2<BR>}<BR>a.groupBy(myfunc).collect<BR>
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)),
(1,ArrayBuffer(1, 3, 5, 7, 9)))<BR><BR>val a = sc.parallelize(1 to
9, 3)<BR>def myfunc(a: Int) : Int =<BR>{<BR> a % 2<BR>}<BR>
a.groupBy(x => myfunc(x), 3).collect<BR>a.groupBy(myfunc(_),
1).collect<BR>res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2,
4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))<BR><BR>import
org.apache.spark.Partitioner<BR>class MyPartitioner extends
Partitioner {<BR>def numPartitions: Int = 2<BR>def getPartition(key:
Any): Int =<BR>{<BR> key match<BR>
{<BR> case
null => 0<BR>
case key: Int =>
key %
numPartitions<BR> case
_ => key.hashCode %
numPartitions<BR> }<BR> }<BR>
override def equals(other: Any): Boolean =<BR> {<BR>
other match<BR> {<BR>
case h: MyPartitioner => true<BR>
case
_
=> false<BR> }<BR> }<BR>}<BR>val a =
sc.parallelize(1 to 9, 3)<BR>val p = new MyPartitioner()<BR>val b =
a.groupBy((x:Int) => { x }, p)<BR>val c = b.mapWith(i =>
i)((a, b) => (b, a))<BR>c.collect<BR>res42: Array[(Int, (Int,
Seq[Int]))] = Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))),
(0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))),
(1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))),
(1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))),
(1,(5,ArrayBuffer(5))))<BR></TD></TR></TBODY></TABLE></DIV><SPAN style="font-weight: bold;"><BR><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="groupByKey"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">groupByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR> <BR>Very similar to
<SPAN style="font-style: italic;">groupBy</SPAN>, but instead of supplying
a function, the key-component of each pair will automatically be presented
to the partitioner.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def groupByKey(): RDD[(K,
Iterable[V])]<BR>def groupByKey(numPartitions: Int): RDD[(K,
Iterable[V])]<BR>def groupByKey(partitioner: Partitioner): RDD[(K,
Iterable[V])]</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider",
"eagle"), 2)<BR>val b = a.keyBy(_.length)<BR>
b.groupByKey.collect<BR>res11: Array[(Int, Seq[String])] =
Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger,
eagle)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="histogram"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">histogram
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR> <BR>These functions
take an RDD of doubles and create a histogram with either even spacing
(the number of buckets equals to <SPAN
style="font-style: italic;">bucketCount</SPAN>) or arbitrary spacing based
on custom bucket boundaries supplied by the user via an array of
double values. The result type of both variants is slightly different, the
first function will return a tuple consisting of two arrays. The first
array contains the computed bucket boundary values and the second array
contains the corresponding count of values <SPAN style="font-style: italic;">(i.e.
the histogram)</SPAN>. The second variant of the function will just return
the histogram as an array of integers.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def histogram(bucketCount: Int):
Pair[Array[Double], Array[Long]]<BR>def histogram(buckets: Array[Double],
evenBuckets: Boolean = false): Array[Long]</DIV><BR><SPAN style="font-weight: bold;">Example
with even spacing</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6,
8.8, 9.0), 3)<BR>a.histogram(5)<BR>res11: (Array[Double],
Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0,
0, 1, 4))<BR><BR>val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1,
1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>
a.histogram(6)<BR>res18: (Array[Double], Array[Long]) = (Array(1.0,
2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3,
4))</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN style="font-weight: bold;">Example
with custom spacing</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 65px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6,
8.8, 9.0), 3)<BR>a.histogram(Array(0.0, 3.0, 8.0))<BR>res14:
Array[Long] = Array(5, 3)<BR><BR>val a = sc.parallelize(List(9.1,
1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)<BR>a.histogram(Array(0.0, 5.0, 10.0))<BR>res1: Array[Long]
= Array(6, 9)<BR><BR>a.histogram(Array(0.0, 5.0, 10.0, 15.0))<BR>
res1: Array[Long] = Array(6, 8, 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="id"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">id</SPAN></BIG></BIG><BR><BR>Retrieves the ID
which has been assigned to the RDD by its device context.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">val id: Int</DIV><BR><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>y.id<BR>res16: Int =
19</TD></TR></TBODY></TABLE></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="intersection"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">intersection</SPAN></BIG></BIG><BR><BR>
Returns the elements in the two RDDs which are the same.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def intersection(other: RDD[T],
numPartitions: Int): RDD[T]<BR>def intersection(other: RDD[T],
partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]<BR>def
intersection(other: RDD[T]): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<TABLE style="width: 611px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val x = sc.parallelize(1 to 20)<BR>
val y = sc.parallelize(10 to 30)<BR>val z =
x.intersection(y)<BR><BR>z.collect<BR>res74: Array[Int] = Array(16,
12, 20, 13, 17, 14, 18, 10, 19, 15,
11)<BR></TD></TR></TBODY></TABLE><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="isCheckpointed"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">isCheckpointed</SPAN></BIG></BIG><BR><BR>
Indicates whether the RDD has been checkpointed. The flag will only raise
once the checkpoint has really been created.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def isCheckpointed: Boolean</DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">sc.setCheckpointDir("/home/cloudera/Documents")<BR>
c.isCheckpointed<BR>res6: Boolean = false<BR><BR>c.checkpoint<BR>
c.isCheckpointed<BR>res8: Boolean = false<BR><BR>c.collect<BR>
c.isCheckpointed<BR>res9: Boolean =
true</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="iterator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">iterator</SPAN></BIG></BIG><BR><BR>
Returns a compatible iterator object for a partition of this RDD. This
function should never be called directly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">final def iterator(split: Partition,
context: TaskContext): Iterator[T]<BR></DIV><SPAN style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="join"></A><BR><BR></SPAN><BIG><BIG><SPAN
style="font-weight: bold;">join<SMALL>
[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an inner join using two
key-value RDDs. Please note that the keys must be generally comparable to
make this work.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def join[W](other: RDD[(K, W)]): RDD[(K,
(V, W))]<BR>def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,
(V, W))]<BR>def join[W](other: RDD[(K, W)], partitioner: Partitioner):
RDD[(K, (V, W))]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 639px; height: 159px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>b.join(d).collect<BR><BR>res0:
Array[(Int, (String, String))] = Array((6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)),
(3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)),
(3,(rat,cat)), (3,(rat,gnu)),
(3,(rat,bee)))</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="keyBy"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">keyBy</SPAN></BIG></BIG><BR><BR>Constructs
two-component tuples (key-value pairs) by applying a function on each data
item. The result of the function becomes the key and the original data
item becomes the value of the newly created tuples.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def keyBy[K](f: T => K): RDD[(K,
T)]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>b.collect<BR>res26:
Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon),
(3,rat), (8,elephant))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="keys"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">keys
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><SPAN
style="font-weight: bold;"><BR><BR></SPAN>Extracts the keys from all
contained tuples and returns them in a new RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def keys: RDD[K]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.keys.collect<BR>res2: Array[Int] = Array(3, 5, 4, 3, 7,
5)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="leftOuterJoin"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">leftOuterJoin
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an left outer
join using two key-value RDDs. Please note that the keys must be generally
comparable to make this work correctly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def leftOuterJoin[W](other: RDD[(K, W)]):
RDD[(K, (V, Option[W]))]<BR>def leftOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (V, Option[W]))]<BR>def
leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(V, Option[W]))]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>
b.leftOuterJoin(d).collect<BR><BR>res1: Array[(Int, (String,
Option[String]))] = Array((6,(salmon,Some(salmon))),
(6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))),
(6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))),
(6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))),
(3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))),
(3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))),
(8,(elephant,None)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="lookup"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">lookup</SPAN></BIG></BIG><BR><BR>
Scans the RDD for all keys that match the provided value and returns their
values as a Scala sequence.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def lookup(key: K): Seq[V]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.lookup(5)<BR>res0: Seq[String] = WrappedArray(tiger,
eagle)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="map"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">map</SPAN></BIG></BIG><BR><BR>Applies a
transformation function on each item of the RDD and returns the result as
a new RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def map[U: ClassTag](f: T => U):
RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example</SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.map(_.length)<BR>val c = a.zip(b)<BR>
c.collect<BR>res0: Array[(String, Int)] = Array((dog,3), (salmon,6),
(salmon,6), (rat,3), (elephant,8))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitions"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitions</SPAN></BIG></BIG><BR><BR>
This is a specialized map that is called only once for each partition. The
entire content of the respective partitions is available as a sequential
stream of values via the input argument (<SPAN
style="font-style: italic;">Iterarator[T]</SPAN>). The custom function
must return yet another <SPAN
style="font-style: italic;">Iterator[U]</SPAN>. The combined result
iterators are automatically converted into a new RDD. Please note, that
the tuples (3,4) and (6,7) are missing from the following result due to
the partitioning we chose.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def mapPartitions[U: ClassTag](f:
Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false):
RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;">Example
1</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>def myfunc[T](iter: Iterator[T]) :
Iterator[(T, T)] = {<BR> var res = List[(T, T)]()<BR>
var pre = iter.next<BR> while (iter.hasNext)<BR> {<BR>
val cur = iter.next;<BR> res
.::= (pre, cur)<BR> pre = cur;<BR> }<BR>
res.iterator<BR>}<BR>a.mapPartitions(myfunc).collect<BR>res0:
Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9),
(7,8))</TD></TR></TBODY></TABLE></DIV><BR><SPAN style="font-weight: bold;">Example
2</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)<BR>def
myfunc(iter: Iterator[Int]) : Iterator[Int] = {<BR> var res =
List[Int]()<BR> while (iter.hasNext) {<BR>
val cur = iter.next;<BR> res = res :::
List.fill(scala.util.Random.nextInt(10))(cur)<BR> }<BR>
res.iterator<BR>}<BR>x.mapPartitions(myfunc).collect<BR>// some of
the number are not outputted at all. This is because the random
number generated for it is zero.<BR>res8: Array[Int] = Array(1, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7,
9, 9, 10)</TD></TR></TBODY></TABLE></DIV><BR>The above program can also be
written using <SPAN style="font-weight: bold;">flatMap</SPAN> as
follows.<BR><BR><SPAN style="font-weight: bold;">Example 2 using
flatmap<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 10, 3)<BR>
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect<BR><BR>
res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 10, 10, 10, 10, 10, 10, 10,
10)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithContext"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithContext</SPAN></BIG></BIG> <SPAN
style="font-weight: bold;"> (deprecated and developer API)</SPAN><BR><BR>
Similar to <SPAN style="font-style: italic;">mapPartitions</SPAN>, but
allows accessing information about the processing state within the
mapper.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithContext[U:
ClassTag](f: (TaskContext, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>import
org.apache.spark.TaskContext<BR>def myfunc(tc: TaskContext, iter:
Iterator[Int]) : Iterator[Int] = {<BR>
tc.addOnCompleteCallback(() => println(<BR>
"Partition: " + tc.partitionId +<BR>
", AttemptID: " + tc.attemptId ))<BR>
<BR> iter.toList.filter(_ % 2 == 0).iterator<BR>}<BR>
a.mapPartitionsWithContext(myfunc).collect<BR><BR>14/04/01 23:05:48
INFO SparkContext: Starting job: collect at< console>:20<BR>
...<BR>14/04/01 23:05:48 INFO Executor: Running task ID 0<BR>
Partition: 0, AttemptID: 0, Interrupted: false<BR>...<BR>14/04/01
23:05:48 INFO Executor: Running task ID 1<BR>14/04/01 23:05:48 INFO
TaskSetManager: Finished TID 0 in 470 ms on localhost (progress:
0/3)<BR>...<BR>14/04/01 23:05:48 INFO Executor: Running task ID
2<BR>14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms
on localhost (progress: 1/3)<BR>14/04/01 23:05:48 INFO DAGScheduler:
Completed ResultTask(0, 1)<BR><BR>?<BR>res0: Array[Int] = Array(2,
6, 4, 8)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithIndex"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithIndex</SPAN></BIG></BIG><BR><BR>
Similar to <SPAN style="font-style: italic;">mapPartitions</SPAN>, but
takes two parameters. The first parameter is the index of the partition
and the second is an iterator through all the items within this
partition. The output is an iterator containing the list of items after
applying whatever transformation the function encodes.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithIndex[U: ClassTag](f:
(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean =
false): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)<BR>def
myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {<BR>
iter.toList.map(x => index + "," + x).iterator<BR>}<BR>
x.mapPartitionsWithIndex(myfunc).collect()<BR>res10: Array[String] =
Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9,
2,10)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapPartitionsWithSplit"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapPartitionsWithSplit</SPAN></BIG></BIG><BR><BR>
This method has been marked as deprecated in the API. So, you should not
use this method anymore. Deprecated methods will not be covered in this
document.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants</SPAN><BR>
<DIV style="margin-left: 40px;">def mapPartitionsWithSplit[U: ClassTag](f:
(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean =
false): RDD[U]<SPAN style="font-weight: bold;"><BR><BR><BR></SPAN>
</DIV><SPAN style="font-weight: bold;"></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="mapValues"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">mapValues
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Takes the values of a RDD
that consists of two-component tuples, and applies the provided function
to transform each value. Then, it forms new two-component tuples using the
key and the transformed value and stores them in a new
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mapValues[U](f: V => U): RDD[(K,
U)]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.mapValues("x" + _ + "x").collect<BR>res5: Array[(Int, String)] =
Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx),
(5,xeaglex))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mapWith"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">mapWith</SPAN></BIG></BIG>
<SPAN style="font-weight: bold;">(deprecated)</SPAN><BR><BR>This is
an extended version of <SPAN style="font-style: italic;">map</SPAN>. It
takes two function arguments. The first argument must conform to <SPAN
style="font-style: italic;">Int -> T</SPAN> and is executed once per
partition. It will map the partition index to some transformed partition
index of type <SPAN style="font-style: italic;">T</SPAN>. This is where it
is nice to do some kind of initialization code once per partition. Like
create a Random number generator object. The second function must conform
to <SPAN style="font-style: italic;">(U, T) -> U</SPAN>. <SPAN style="font-style: italic;">T</SPAN>
is the transformed partition index and <SPAN
style="font-style: italic;">U</SPAN> is a data item of the RDD. Finally
the function has to return a transformed data item of type <SPAN style="font-style: italic;">U</SPAN>.<BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mapWith[A: ClassTag, U:
ClassTag](constructA: Int => A, preservesPartitioning: Boolean =
false)(f: (T, A) => U): RDD[U]<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">//
generates 9 random numbers less than 1000. <BR>val x =
sc.parallelize(1 to 9, 3)<BR>x.mapWith(a => new
scala.util.Random)((x, r) => r.nextInt(1000)).collect<BR>res0:
Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15)<BR><BR>
val a = sc.parallelize(1 to 9, 3)<BR>val b = a.mapWith("Index:" +
_)((a, b) => ("Value:" + a, b))<BR>b.collect<BR>res0:
Array[(String, String)] = Array((Value:1,Index:0),
(Value:2,Index:0), (Value:3,Index:0), (Value:4,Index:1),
(Value:5,Index:1), (Value:6,Index:1), (Value:7,Index:2),
(Value:8,Index:2),
(Value:9,Index:2)<BR><BR><BR></TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="max"></A><BR><BIG><BIG><SPAN
style="font-weight: bold;">max</SPAN></BIG></BIG><SPAN style="font-weight: bold;"></SPAN><BR><BR>
Returns the largest element in the RDD<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def max()(implicit ord: Ordering[T]):
T<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<TABLE style="width: 630px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(10 to
30)<BR>y.max<BR>res75: Int = 30<BR><BR>val a =
sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18,
"cat")))<BR>a.max<BR>res6: (Int, String) =
(18,cat)<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="mean"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">mean <SMALL>[Double]</SMALL>, meanApprox
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls <SPAN style="font-style: italic;">stats</SPAN>
and extracts the mean component. The approximate version of the function
can finish somewhat faster in some scenarios. However, it trades accuracy
for speed.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def mean(): Double<BR>def
meanApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,
7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>a.mean<BR>res0: Double =
5.3</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="min"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">min</SPAN></BIG></BIG><BR><BR>Returns the
smallest element in the RDD<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def min()(implicit ord: Ordering[T]):
T<BR></DIV><SPAN style="font-weight: bold;"><BR></SPAN><SPAN style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 637px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(10 to
30)<BR>y.min<BR>res75: Int = 10<BR><BR><BR>val a =
sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8,
"cat")))<BR>a.min<BR>res4: (Int, String) =
(3,tiger)<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="name"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">name, setName</SPAN></BIG></BIG><BR><BR>Allows
a RDD to be tagged with a custom name.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">@transient var name: String<BR>def
setName(_name: String)<BR></DIV><SPAN
style="font-weight: bold;"><BR></SPAN><SPAN
style="font-weight: bold;"></SPAN><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>y.name<BR>res13: String =
null<BR>y.setName("Fancy RDD Name")<BR>y.name<BR>res15: String =
Fancy RDD Name</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="partitionBy"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">partitionBy
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Repartitions as key-value
RDD using its keys. The partitioner implementation can be supplied as the
first argument.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def partitionBy(partitioner: Partitioner):
RDD[(K, V)]<BR></DIV><SPAN style="font-weight: bold;"><BR><BR></SPAN>
<HR style="width: 100%; height: 2px;">
<SPAN style="font-weight: bold;"><BR><A
name="partitioner"></A><BR><BR></SPAN><BIG><BIG><SPAN style="font-weight: bold;">partitioner
</SPAN></BIG></BIG><BR><BR>Specifies a function pointer to the default
partitioner that will be used for <SPAN
style="font-style: italic;">groupBy</SPAN>, <SPAN style="font-style: italic;">subtract</SPAN>,
<SPAN style="font-style: italic;">reduceByKey</SPAN> (from <SPAN style="font-style: italic;">PairedRDDFunctions</SPAN>),
etc. functions.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">@transient val partitioner:
Option[Partitioner]</DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="partitions"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">partitions
</SPAN></BIG></BIG><BR><BR>Returns an array of the partition objects
associated with this RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">final def partitions:
Array[Partition]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>b.partitions<BR>res48: Array[org.apache.spark.Partition] =
Array(org.apache.spark.rdd.ParallelCollectionPartition@18aa,
org.apache.spark.rdd.ParallelCollectionPartition@18ab)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="persist"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">persist,
cache </SPAN></BIG></BIG><BR><BR>These functions can be used to adjust the
storage level of a RDD. When freeing up memory, Spark will use the storage
level identifier to decide which partitions should be kept. The
parameterless variants <SPAN style="font-style: italic;">persist()</SPAN>
and <SPAN style="font-style: italic;">cache()</SPAN> are just
abbreviations for <SPAN
style="font-style: italic;">persist(StorageLevel.MEMORY_ONLY)</SPAN>.
<SPAN style="font-style: italic;">(Warning: Once the storage level has
been changed, it cannot be changed again!)</SPAN><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def cache(): RDD[T]<BR>def persist():
RDD[T]<BR>def persist(newLevel: StorageLevel): RDD[T]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example<BR></SPAN><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),
2)<BR>c.getStorageLevel<BR>res0:
org.apache.spark.storage.StorageLevel = StorageLevel(false, false,
false, false, 1)<BR>c.cache<BR>c.getStorageLevel<BR>res2:
org.apache.spark.storage.StorageLevel = StorageLevel(false, true,
false, true, 1)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="pipe"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">pipe </SPAN></BIG></BIG><BR><BR>Takes the RDD
data of each partition and sends it via stdin to a shell-command. The
resulting output of the command is captured and returned as a RDD of
string values.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def pipe(command: String): RDD[String]<BR>
def pipe(command: String, env: Map[String, String]): RDD[String]<BR>def
pipe(command: Seq[String], env: Map[String, String] = Map(),
printPipeContext: (String => Unit) => Unit = null, printRDDElement:
(T, String => Unit) => Unit = null): RDD[String]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>a.pipe("head -n 1").collect<BR>
res2: Array[String] = Array(1, 4,
7)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><A name="randomSplit"></A><BR>
randomSplit </SPAN></BIG></BIG><BR><BR>Randomly splits an RDD into
multiple smaller RDDs according to a weights Array which specifies the
percentage of the total data elements that is assigned to each smaller
RDD. Note the actual size of each smaller RDD is only approximately equal
to the percentages specified by the weights Array. The second example
below shows the number of items in each smaller RDD does not exactly match
the weights Array. A random optional seed can be specified. This
function is useful for spliting data into a training set and a testing set
for machine learning.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def randomSplit(weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BIG><BIG><SPAN style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<TABLE style="width: 602px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val y = sc.parallelize(1 to 10)<BR>
val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L)<BR>val
training = splits(0)<BR>val test = splits(1)<BR>training.collect<BR>
res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10)<BR>test.collect<BR>
res86: Array[Int] = Array(2, 3, 7, 9)<BR><BR>val y =
sc.parallelize(1 to 10)<BR>val splits = y.randomSplit(Array(0.1,
0.3, 0.6))<BR><BR>val rdd1 = splits(0)<BR>val rdd2 = splits(1)<BR>
val rdd3 = splits(2)<BR><BR>rdd1.collect<BR>res87: Array[Int] =
Array(4, 10)<BR>rdd2.collect<BR>res88: Array[Int] = Array(1, 3, 5,
8)<BR>rdd3.collect<BR>res91: Array[Int] = Array(2, 6, 7,
9)<BR></TD></TR></TBODY></TABLE><BIG><BIG><SPAN
style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A name="reduce"></A><BR>
reduce </SPAN></BIG></BIG><BR><BR>This function provides the well-known
<SPAN style="font-style: italic;">reduce</SPAN> functionality in Spark.
Please note that any function <SPAN style="font-style: italic;">f</SPAN>
you provide, should be commutative in order to generate reproducible
results.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def reduce(f: (T, T) => T):
T</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>a.reduce(_ + _)<BR>res41: Int =
5050</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="reduceByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">reduceByKey
<SMALL>[Pair]</SMALL>, reduceByKeyLocally <SMALL>[Pair],</SMALL>
reduceByKeyToDriver <SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>This
function provides the well-known <SPAN
style="font-style: italic;">reduce</SPAN> functionality in Spark. Please
note that any function <SPAN style="font-style: italic;">f</SPAN> you
provide, should be commutative in order to generate reproducible
results.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def reduceByKey(func: (V, V) => V):
RDD[(K, V)]<BR>def reduceByKey(func: (V, V) => V, numPartitions: Int):
RDD[(K, V)]<BR>def reduceByKey(partitioner: Partitioner, func: (V, V)
=> V): RDD[(K, V)]<BR>def reduceByKeyLocally(func: (V, V) => V):
Map[K, V]<BR>def reduceByKeyToDriver(func: (V, V) => V): Map[K,
V]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = a.map(x => (x.length, x))<BR>b.reduceByKey(_ +
_).collect<BR>res86: Array[(Int, String)] =
Array((3,dogcatowlgnuant))<BR><BR>val a = sc.parallelize(List("dog",
"tiger", "lion", "cat", "panther", "eagle"), 2)<BR>val b = a.map(x
=> (x.length, x))<BR>b.reduceByKey(_ + _).collect<BR>res87:
Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther),
(5,tigereagle))</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="repartition"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">repartition</SPAN></BIG></BIG><BR><BR>
This function changes the number of partitions to the number specified by
the numPartitions parameter <BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def repartition(numPartitions:
Int)(implicit ord: Ordering[T] = null): RDD[T]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 631px; height: 24px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val rdd = sc.parallelize(List(1, 2,
10, 4, 5, 2, 1, 1, 1), 3)<BR>rdd.partitions.length<BR>res2: Int =
3<BR>val rdd2 = rdd.repartition(5)<BR>
rdd2.partitions.length<BR>res6: Int =
5<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A
name="repartitionAndSortWithinPartitions"></A><BR><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">repartitionAndSortWithinPartitions</SPAN></BIG></BIG>
[Ordered]<BR><BR>Repartition the RDD according to the given partitioner
and, within each resulting partition, sort records by their
keys.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def
repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K,
V)]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 683px; height: 23px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">// first we will do range
partitioning which is not sorted<BR>val randRDD =
sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3,
"book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)<BR>val
rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD)<BR>
val partitioned = randRDD.partitionBy(rPartitioner)<BR>def
myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String]
= {<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR>
partitioned.mapPartitionsWithIndex(myfunc).collect<BR><BR>res0:
Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val:
(3,book)], [partID:0, val: (1,screen)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,
val: (7,cup)])<BR><BR><BR>// now lets repartition but this time
have it sorted<BR>val partitioned =
randRDD.repartitionAndSortWithinPartitions(rPartitioner)<BR>def
myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String]
= {<BR> iter.toList.map(x => "[partID:" + index + ",
val: " + x + "]").iterator<BR>}<BR>
partitioned.mapPartitionsWithIndex(myfunc).collect<BR><BR>res1:
Array[String] = Array([partID:0, val: (1,screen)], [partID:0, val:
(2,cat)], [partID:0, val: (3,book)], [partID:1, val: (4,tv)],
[partID:1, val: (5,heater)], [partID:2, val: (6,mouse)], [partID:2,
val: (7,cup)])<BR></TD></TR></TBODY></TABLE><BR><BR><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><BR><A name="rightOuterJoin"></A><BR><BR><BR><SPAN style="font-weight: bold;"></SPAN><BIG><BIG><SPAN
style="font-weight: bold;">rightOuterJoin
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Performs an right outer
join using two key-value RDDs. Please note that the keys must be generally
comparable to make this work correctly.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def rightOuterJoin[W](other: RDD[(K, W)]):
RDD[(K, (Option[V], W))]<BR>def rightOuterJoin[W](other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (Option[V], W))]<BR>def
rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Option[V], W))]</DIV><BR><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "salmon", "salmon", "rat",
"elephant"), 3)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),
3)<BR>val d = c.keyBy(_.length)<BR>
b.rightOuterJoin(d).collect<BR><BR>res2: Array[(Int,
(Option[String], String))] = Array((6,(Some(salmon),salmon)),
(6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)),
(6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)),
(6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)),
(3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)),
(3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)),
(4,(None,wolf)), (4,(None,bear)))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sample"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sample</SPAN></BIG></BIG><BR><BR>
Randomly selects a fraction of the items of a RDD and returns them in a
new RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sample(withReplacement: Boolean,
fraction: Double, seed: Int): RDD[T]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 3)<BR>a.sample(false, 0.1,
0).count<BR>res24: Long = 960<BR><BR>a.sample(true, 0.3,
0).count<BR>res25: Long = 2888<BR><BR>a.sample(true, 0.3,
13).count<BR>res26: Long = 2985</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sampleByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sampleByKey</SPAN></BIG></BIG>
[Pair]<BR><BR>Randomly samples the key value pair RDD according to the
fraction of each key you want to appear in the final RDD.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sampleByKey(withReplacement: Boolean,
fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K,
V)]</DIV><BR><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 640px; height: 24px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val randRDD = sc.parallelize(List(
(7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6,
"screen"), (7, "heater")))<BR>val sampleMap = List((7, 0.4), (6,
0.6)).toMap<BR>randRDD.sampleByKey(false,
sampleMap,42).collect<BR><BR>res6: Array[(Int, String)] =
Array((7,cat), (6,mouse), (6,book), (6,screen),
(7,heater))<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="sampleByKeyExact"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">sampleByKeyExact</SPAN></BIG></BIG>
[Pair, experimental]<BR><BR>This is labelled as experimental and so we do
not document it.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def sampleByKeyExact(withReplacement:
Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong):
RDD[(K, V)]<BR></DIV><BR><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsHadoopFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsHadoopFile
<SMALL>[Pair]</SMALL>, saveAsHadoopDataset <SMALL>[Pair]</SMALL>,
saveAsNewAPIHadoopFile <SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>
Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat
class the user specifies.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR><BR></SPAN>
<DIV style="margin-left: 40px;">def saveAsHadoopDataset(conf: JobConf)<BR>
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit
fm: ClassTag[F])<BR>def saveAsHadoopFile[F <: OutputFormat[K,
V]](path: String, codec: Class[_ <: CompressionCodec]) (implicit fm:
ClassTag[F])<BR>def saveAsHadoopFile(path: String, keyClass: Class[_],
valueClass: Class[_], outputFormatClass: Class[_ <: OutputFormat[_,
_]], codec: Class[_ <: CompressionCodec])<BR>def saveAsHadoopFile(path:
String, keyClass: Class[_], valueClass: Class[_], outputFormatClass:
Class[_ <: OutputFormat[_, _]], conf: JobConf = new
JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <:
CompressionCodec]] = None)<BR>def saveAsNewAPIHadoopFile[F <:
NewOutputFormat[K, V]](path: String)(implicit fm: ClassTag[F])<BR>def
saveAsNewAPIHadoopFile(path: String, keyClass: Class[_], valueClass:
Class[_], outputFormatClass: Class[_ <: NewOutputFormat[_, _]], conf:
Configuration = self.context.hadoopConfiguration)</DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsObjectFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsObjectFile</SPAN></BIG></BIG><BR><BR>
Saves the RDD in binary format.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsObjectFile(path:
String)<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 100, 3)<BR>
x.saveAsObjectFile("objFile")<BR>val y =
sc.objectFile[Int]("objFile")<BR>y.collect<BR>res52: Array[Int]
= Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsSequenceFile"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsSequenceFile
<SMALL>[SeqFile]</SMALL></SPAN></BIG></BIG><BR><BR>Saves the RDD as a
Hadoop sequence file.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsSequenceFile(path: String,
codec: Option[Class[_ <: CompressionCodec]] = None)<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1),
("cat",2), ("ant",5)), 2)<BR>v.saveAsSequenceFile("hd_seq_file")<BR>
14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task
'attempt_201404190545_0000_m_000001_191' to
file:/home/cloudera/hd_seq_file<BR><BR>[cloudera@localhost ~]$ ll
~/hd_seq_file<BR>total 8<BR>-rwxr-xr-x 1 cloudera cloudera 117 Apr
19 05:45 part-00000<BR>-rwxr-xr-x 1 cloudera cloudera 133 Apr 19
05:45 part-00001<BR>-rwxr-xr-x 1 cloudera cloudera 0 Apr
19 05:45 _SUCCESS</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="saveAsTextFile"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">saveAsTextFile</SPAN></BIG></BIG><BR><BR>
Saves the RDD as text files. One line at a time.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def saveAsTextFile(path: String)<BR>def
saveAsTextFile(path: String, codec: Class[_ <:
CompressionCodec])<BR></DIV><BR><SPAN style="font-weight: bold;">Example
without compression</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 10000, 3)<BR>
a.saveAsTextFile("mydata_a")<BR>14/04/03 21:11:36 INFO
FileOutputCommitter: Saved output of task
'attempt_201404032111_0000_m_000002_71' to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a<BR><BR><BR>
[cloudera@localhost ~]$ head -n 5
~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000<BR>
1<BR>2<BR>3<BR>4<BR>5<BR><BR>// Produces 3 output files since we
have created the a RDD with 3 partitions<BR>[cloudera@localhost ~]$
ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/<BR>
-rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11
part-00000<BR>-rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11
part-00001<BR>-rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11
part-00002</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN style="font-weight: bold;">Example
with compression</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">import
org.apache.hadoop.io.compress.GzipCodec<BR>
a.saveAsTextFile("mydata_b", classOf[GzipCodec])<BR><BR>
[cloudera@localhost ~]$ ll
~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/<BR>total
24<BR>-rwxr-xr-x 1 cloudera cloudera 7276 Apr 3 21:29
part-00000.gz<BR>-rwxr-xr-x 1 cloudera cloudera 6517 Apr 3
21:29 part-00001.gz<BR>-rwxr-xr-x 1 cloudera cloudera 6525 Apr
3 21:29 part-00002.gz<BR><BR>val x = sc.textFile("mydata_b")<BR>
x.count<BR>res2: Long = 10000</TD></TR></TBODY></TABLE></DIV><BR><BR><SPAN
style="font-weight: bold;">Example writing into HDFS<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)<BR>
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");<BR><BR>
val sp =
sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")<BR>
sp.flatMap(_.split("
")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="stats"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">stats
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Simultaneously computes
the mean, variance and the standard deviation of all values in the
RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def stats():
StatCounter<BR></DIV><BR><SPAN
style="font-weight: bold;">Example<BR><BR></SPAN>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29,
11.09, 21.0), 2)<BR>x.stats<BR>res16:
org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667,
stdev: 8.126859)</TD></TR></TBODY></TABLE></DIV><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN
style="font-weight: bold;"></SPAN></BIG></BIG><BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="sortBy"></A><BR>sortBy</SPAN></BIG></BIG><BIG><BIG><SPAN style="font-weight: bold;"><BR></SPAN></BIG></BIG><BR>
This function sorts the input RDD's data and stores it in a new RDD. The
first parameter requires you to specify a function which maps the
input data into the key that you want to sortBy. The second parameter
(optional) specifies whether you want the data to be sorted in ascending
or descending order.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sortBy[K](f: (T) ⇒ K, ascending:
Boolean = true, numPartitions: Int = this.partitions.size)(implicit ord:
Ordering[K], ctag: ClassTag[K]): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;"><BR></SPAN></BIG></BIG>
<TABLE style="width: 686px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;"><BR>val y = sc.parallelize(Array(5,
7, 1, 3, 2, 1))<BR>y.sortBy(c => c, true).collect<BR>res101:
Array[Int] = Array(1, 1, 2, 3, 5, 7)<BR><BR>y.sortBy(c => c,
false).collect<BR>res102: Array[Int] = Array(7, 5, 3, 2, 1,
1)<BR><BR>val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z",
1), ("L", 5)))<BR>z.sortBy(c => c._1, true).collect<BR>res109:
Array[(String, Int)] = Array((A,26), (H,10), (L,5), (Z,1))<BR><BR>
z.sortBy(c => c._2, true).collect<BR>res108: Array[(String, Int)]
= Array((Z,1), (L,5), (H,10),
(A,26))<BR></TD></TR></TBODY></TABLE><BIG><BIG><SPAN style="font-weight: bold;"><BR><BR></SPAN></BIG></BIG>
<HR style="width: 100%; height: 2px;">
<BIG><BIG><SPAN style="font-weight: bold;"><BR><A
name="sortByKey"></A><BR><BR>sortByKey
<SMALL>[Ordered]</SMALL></SPAN></BIG></BIG><BR><BR>This function sorts the
input RDD's data and stores it in a new RDD. The output RDD is a shuffled
RDD because it stores data that is output by a reducer which has been
shuffled. The implementation of this function is actually very clever.
First, it uses a range partitioner to partition the data in ranges within
the shuffled RDD. Then it sorts these ranges individually with
mapPartitions using standard sort mechanisms.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sortByKey(ascending: Boolean = true,
numPartitions: Int = self.partitions.size): RDD[P]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)<BR>
val b = sc.parallelize(1 to a.count.toInt, 2)<BR>val c =
a.zip(b)<BR>c.sortByKey(true).collect<BR>res74: Array[(String, Int)]
= Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))<BR>
c.sortByKey(false).collect<BR>res75: Array[(String, Int)] =
Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))<BR><BR>val a =
sc.parallelize(1 to 100, 5)<BR>val b = a.cartesian(a)<BR>val c =
sc.parallelize(b.takeSample(true, 5, 13), 2)<BR>val d =
c.sortByKey(false)<BR>res56: Array[(Int, Int)] = Array((96,9),
(84,76), (59,59), (53,65), (52,4))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="stdev"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">stdev <SMALL>[Double], sampleStdev
[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls <SPAN style="font-style: italic;">stats</SPAN>
and extracts either <SPAN
style="font-style: italic;">stdev</SPAN>-component or corrected <SPAN
style="font-style: italic;">sampleStdev</SPAN>-component.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def stdev(): Double<BR>def sampleStdev():
Double<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
d = sc.parallelize(List(0.0, 0.0, 0.0), 3)<BR>d.stdev<BR>res10:
Double = 0.0<BR>d.sampleStdev<BR>res11: Double = 0.0<BR><BR>val d =
sc.parallelize(List(0.0, 1.0), 3)<BR>d.stdev<BR>d.sampleStdev<BR>
res18: Double = 0.5<BR>res19: Double = 0.7071067811865476<BR><BR>val
d = sc.parallelize(List(0.0, 0.0, 1.0), 3)<BR>d.stdev<BR>res14:
Double = 0.4714045207910317<BR>d.sampleStdev<BR>res15: Double =
0.5773502691896257</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="subtract"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">subtract</SPAN></BIG></BIG><BR><BR>
Performs the well known standard set subtraction operation: A -
B<BR><BR><SPAN style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def subtract(other: RDD[T]): RDD[T]<BR>def
subtract(other: RDD[T], numPartitions: Int): RDD[T]<BR>def subtract(other:
RDD[T], p: Partitioner): RDD[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = sc.parallelize(1 to 3,
3)<BR>val c = a.subtract(b)<BR>c.collect<BR>res3: Array[Int] =
Array(6, 9, 4, 7, 5, 8)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="subtractByKey"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">subtractByKey
<SMALL>[Pair]</SMALL></SPAN></BIG></BIG><BR><BR>Very similar to <SPAN
style="font-style: italic;">subtract</SPAN>, but instead of supplying a
function, the key-component of each pair will be automatically used as
criterion for removing items from the first RDD.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def subtractByKey[W: ClassTag](other:
RDD[(K, W)]): RDD[(K, V)]<BR>def subtractByKey[W: ClassTag](other: RDD[(K,
W)], numPartitions: Int): RDD[(K, V)]<BR>def subtractByKey[W:
ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K,
V)]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider",
"eagle"), 2)<BR>val b = a.keyBy(_.length)<BR>val c =
sc.parallelize(List("ant", "falcon", "squid"), 2)<BR>val d =
c.keyBy(_.length)<BR>b.subtractByKey(d).collect<BR>res15:
Array[(Int, String)] =
Array((4,lion))</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="sum"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">sum <SMALL>[Double], sumApprox
[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Computes the sum of all values
contained in the RDD. The approximate version of the function can finish
somewhat faster in some scenarios. However, it trades accuracy for
speed.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def sum(): Double<BR>def
sumApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29,
11.09, 21.0), 2)<BR>x.sum<BR>res17: Double =
101.39999999999999</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="take"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">take</SPAN></BIG></BIG><BR><BR>Extracts the
first <SPAN style="font-style: italic;">n</SPAN> items of the RDD and
returns them as an array. <SPAN style="font-style: italic;">(Note: This
sounds very easy, but it is actually quite a tricky problem for the
implementors of Spark because the items in question can be in many
different partitions.)</SPAN><BR><BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def take(num: Int):
Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"),
2)<BR>b.take(2)<BR>res18: Array[String] = Array(dog, cat)<BR><BR>val
b = sc.parallelize(1 to 10000, 5000)<BR>b.take(100)<BR>res6:
Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
100)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="takeOrdered"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">takeOrdered</SPAN></BIG></BIG><BR><BR>
Orders the data items of the RDD using their inherent implicit ordering
function and returns the first <SPAN style="font-style: italic;">n</SPAN>
items as an array.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def takeOrdered(num: Int)(implicit ord:
Ordering[T]): Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"),
2)<BR>b.takeOrdered(2)<BR>res19: Array[String] = Array(ape,
cat)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="takeSample"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">takeSample</SPAN></BIG></BIG><BR><BR>
Behaves different from <SPAN style="font-style: italic;">sample</SPAN> in
the following respects:<BR>
<UL>
<LI> It will return an exact number of samples <SPAN style="font-style: italic;">(Hint:
2nd parameter)</SPAN></LI>
<LI> It returns an Array instead of RDD.</LI>
<LI> It internally randomizes the order of the items
returned.</LI></UL><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def takeSample(withReplacement: Boolean,
num: Int, seed: Int): Array[T]<BR></DIV><BR><SPAN style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
x = sc.parallelize(1 to 1000, 3)<BR>x.takeSample(true, 100, 1)<BR>
res3: Array[Int] = Array(339, 718, 810, 105, 71, 268, 333, 360, 341,
300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937, 301,
167, 69, 330, 864, 40, 645, 65, 349, 613, 468, 982, 314, 160, 675,
232, 794, 577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482,
657, 114, 332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559,
301, 694, 460, 839, 952, 664, 851, 260, 729, 823, 880, 792, 964,
614, 821, 683, 364, 80, 875, 813, 951, 663, 344, 546, 918, 436, 451,
397, 670, 756, 512, 391, 70, 213, 896, 123,
858)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toDebugString"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toDebugString</SPAN></BIG></BIG><BR><BR>
Returns a string that contains debug information about the RDD and its
dependencies.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toDebugString:
String<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 9, 3)<BR>val b = sc.parallelize(1 to 3,
3)<BR>val c = a.subtract(b)<BR>c.toDebugString<BR>res6: String =
<BR>MappedRDD[15] at subtract at <console>:16 (3
partitions)<BR> SubtractedRDD[14] at subtract at
<console>:16 (3 partitions)<BR>
MappedRDD[12] at subtract at <console>:16 (3 partitions)<BR>
ParallelCollectionRDD[10] at
parallelize at <console>:12 (3 partitions)<BR>
MappedRDD[13] at subtract at <console>:16
(3 partitions)<BR>
ParallelCollectionRDD[11] at parallelize at <console>:12 (3
partitions)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toJavaRDD"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toJavaRDD</SPAN></BIG></BIG><BR><BR>
Embeds this RDD object within a JavaRDD object and returns
it.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toJavaRDD() :
JavaRDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)<BR>
c.toJavaRDD<BR>res3: org.apache.spark.api.java.JavaRDD[String] =
ParallelCollectionRDD[6] at parallelize at
<console>:12</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toLocalIterator"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toLocalIterator</SPAN></BIG></BIG><BR><BR>
Converts the RDD into a scala iterator at the master node.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def toLocalIterator:
Iterator[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 627px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR>val iter =
z.toLocalIterator<BR><BR>iter.next<BR>res51: Int = 1<BR><BR>
iter.next<BR>res52: Int = 2<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="top"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">top</SPAN></BIG></BIG><BR><BR>Utilizes the
implicit ordering of $T$ to determine the top $k$ values and returns them
as an array.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">ddef top(num: Int)(implicit ord:
Ordering[T]): Array[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)<BR>c.top(2)<BR>
res28: Array[Int] = Array(9, 8)</TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="toString"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">toString</SPAN></BIG></BIG><BR><BR>
Assembles a human-readable textual description of the
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">override def toString:
String<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
z = sc.parallelize(List(1,2,3,4,5,6), 2)<BR>z.toString<BR>res61:
String = ParallelCollectionRDD[80] at parallelize at
<console>:21<BR><BR>val randRDD = sc.parallelize(List(
(7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6,
"screen"), (7, "heater")))<BR>val sortedRDD =
randRDD.sortByKey()<BR>sortedRDD.toString<BR>res64: String =
ShuffledRDD[88] at sortByKey at
<console>:23<BR><BR></TD></TR></TBODY></TABLE></DIV><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="treeAggregate"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">treeAggregate</SPAN></BIG></BIG><BR><BR>
Computes the same thing as aggregate, except it aggregates the elements of
the RDD in a multi-level tree pattern. Another difference is that it does
not use the initial value for the second reduce function (combOp).
By default a tree of depth 2 is used, but this can be changed via the
depth parameter.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def treeAggregate[U](zeroValue: U)(seqOp:
(U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0:
ClassTag[U]): U<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR><BR>
<TABLE style="width: 713px; height: 376px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR><BR>// lets first print out
the contents of the RDD with partition labels<BR>def myfunc(index:
Int, iter: Iterator[(Int)]) : Iterator[String] = {<BR>
iter.toList.map(x => "[partID:" + index + ", val: " + x +
"]").iterator<BR>}<BR><BR>
z.mapPartitionsWithIndex(myfunc).collect<BR>res28: Array[String] =
Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])<BR><BR>
z.treeAggregate(0)(math.max(_, _), _ + _)<BR>res40: Int = 9<BR><BR>
// Note unlike normal aggregrate. Tree aggregate does not apply the
initial value for the second reduce<BR>// This example returns 11
since the initial value is 5<BR>// reduce of partition 0 will be
max(5, 1, 2, 3) = 5<BR>// reduce of partition 1 will be max(4, 5, 6)
= 6<BR>// final reduce across partitions will be 5 + 6 = 11<BR>//
note the final reduce does not include the initial value<BR>
z.treeAggregate(5)(math.max(_, _), _ + _)<BR>res42: Int =
11</TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="treeReduce"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">treeReduce</SPAN></BIG></BIG><BR><BR>
Works like reduce except reduces the elements of the RDD in a multi-level
tree pattern.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def treeReduce(f: (T, T) ⇒ T, depth:
Int = 2): T<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 753px; height: 43px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z =
sc.parallelize(List(1,2,3,4,5,6), 2)<BR>z.treeReduce(_+_)<BR>res49:
Int = 21<BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="union"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">union, ++</SPAN></BIG></BIG><BR><BR>Performs
the standard set operation: A union B<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def ++(other: RDD[T]): RDD[T]<BR>def
union(other: RDD[T]): RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 3, 1)<BR>val b = sc.parallelize(5 to 7,
1)<BR>(a ++ b).collect<BR>res0: Array[Int] = Array(1, 2, 3, 5, 6,
7)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="unpersist"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">unpersist</SPAN></BIG></BIG><BR><BR>
Dematerializes the RDD <SPAN style="font-style: italic;">(i.e. Erases all
data items from hard-disk and memory)</SPAN>. However, the RDD object
remains. If it is referenced in a computation, Spark will regenerate it
automatically using the stored dependency graph.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def unpersist(blocking: Boolean = true):
RDD[T]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
y = sc.parallelize(1 to 10, 10)<BR>val z = (y++y)<BR>z.collect<BR>
z.unpersist(true)<BR>14/04/19 03:04:57 INFO UnionRDD: Removing RDD
22 from persistence list<BR>14/04/19 03:04:57 INFO BlockManager:
Removing RDD 22</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="values"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">values</SPAN></BIG></BIG><BR><BR>
Extracts the values from all contained tuples and returns them in a new
RDD.<BR><BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def values: RDD[V]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther",
"eagle"), 2)<BR>val b = a.map(x => (x.length, x))<BR>
b.values.collect<BR>res3: Array[String] = Array(dog, tiger, lion,
cat, panther, eagle)</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="variance"></A><BR><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">variance
<SMALL>[Double]</SMALL>, sampleVariance
<SMALL>[Double]</SMALL></SPAN></BIG></BIG><BR><BR>Calls stats and extracts
either <SPAN style="font-style: italic;">variance</SPAN>-component or
corrected <SPAN
style="font-style: italic;">sampleVariance</SPAN>-component.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def variance(): Double<BR>def
sampleVariance(): Double<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1,
7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)<BR>a.variance<BR>res70:
Double = 10.605333333333332<BR><BR>val x = sc.parallelize(List(1.0,
2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)<BR>
x.variance<BR>res14: Double = 66.04584444444443<BR><BR>
x.sampleVariance<BR>res13: Double =
74.30157499999999</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="zip"></A><BR><BR><BIG><BIG><SPAN
style="font-weight: bold;">zip</SPAN></BIG></BIG><BR><BR>Joins two RDDs by
combining the i-th of either partition with each other. The resulting RDD
will consist of two-component tuples which are interpreted as key-value
pairs by the methods provided by the PairRDDFunctions
extension.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zip[U: ClassTag](other: RDD[U]):
RDD[(T, U)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(1 to 100, 3)<BR>val b = sc.parallelize(101 to
200, 3)<BR>a.zip(b).collect<BR>res1: Array[(Int, Int)] =
Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107),
(8,108), (9,109), (10,110), (11,111), (12,112), (13,113), (14,114),
(15,115), (16,116), (17,117), (18,118), (19,119), (20,120),
(21,121), (22,122), (23,123), (24,124), (25,125), (26,126),
(27,127), (28,128), (29,129), (30,130), (31,131), (32,132),
(33,133), (34,134), (35,135), (36,136), (37,137), (38,138),
(39,139), (40,140), (41,141), (42,142), (43,143), (44,144),
(45,145), (46,146), (47,147), (48,148), (49,149), (50,150),
(51,151), (52,152), (53,153), (54,154), (55,155), (56,156),
(57,157), (58,158), (59,159), (60,160), (61,161), (62,162),
(63,163), (64,164), (65,165), (66,166), (67,167), (68,168),
(69,169), (70,170), (71,171), (72,172), (73,173), (74,174),
(75,175), (76,176), (77,177), (78,...<BR><BR>val a =
sc.parallelize(1 to 100, 3)<BR>val b = sc.parallelize(101 to 200,
3)<BR>val c = sc.parallelize(201 to 300, 3)<BR>
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect<BR>
res12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202),
(3,103,203), (4,104,204), (5,105,205), (6,106,206), (7,107,207),
(8,108,208), (9,109,209), (10,110,210), (11,111,211), (12,112,212),
(13,113,213), (14,114,214), (15,115,215), (16,116,216),
(17,117,217), (18,118,218), (19,119,219), (20,120,220),
(21,121,221), (22,122,222), (23,123,223), (24,124,224),
(25,125,225), (26,126,226), (27,127,227), (28,128,228),
(29,129,229), (30,130,230), (31,131,231), (32,132,232),
(33,133,233), (34,134,234), (35,135,235), (36,136,236),
(37,137,237), (38,138,238), (39,139,239), (40,140,240),
(41,141,241), (42,142,242), (43,143,243), (44,144,244),
(45,145,245), (46,146,246), (47,147,247), (48,148,248),
(49,149,249), (50,150,250), (51,151,251), (52,152,252),
(53,153,253), (54,154,254),
(55,155,255)...</TD></TR></TBODY></TABLE></DIV><BR><BR>
<HR style="width: 100%; height: 2px;">
<BR><A name="zipPartitions"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipParititions</SPAN></BIG></BIG><BR><BR>
Similar to <SPAN style="font-style: italic;">zip</SPAN>. But provides more
control over the zipping process.<BR><BR><SPAN
style="font-weight: bold;">Listing Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipPartitions[B: ClassTag, V:
ClassTag](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V]):
RDD[V]<BR>def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) =>
Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag, V:
ClassTag](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B],
Iterator[C]) => Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag,
C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C])
=> Iterator[V]): RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag,
D: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f:
(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):
RDD[V]<BR>def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V:
ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D],
preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B],
Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<DIV style="margin-left: 40px;">
<TABLE style="width: 586px; height: 54px; text-align: left;" border="1"
cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD
style="vertical-align: top; background-color: rgb(242, 242, 242);">val
a = sc.parallelize(0 to 9, 3)<BR>val b = sc.parallelize(10 to 19,
3)<BR>val c = sc.parallelize(100 to 109, 3)<BR>def myfunc(aiter:
Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]):
Iterator[String] =<BR>{<BR> var res = List[String]()<BR>
while (aiter.hasNext && biter.hasNext &&
citer.hasNext)<BR> {<BR> val x = aiter.next
+ " " + biter.next + " " + citer.next<BR> res ::=
x<BR> }<BR> res.iterator<BR>}<BR>a.zipPartitions(b,
c)(myfunc).collect<BR>res50: Array[String] = Array(2 12 102, 1 11
101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7
17 107, 6 16 106)</TD></TR></TBODY></TABLE><BR></DIV><BR>
<HR style="width: 100%; height: 2px;">
<A name="zipWithIndex"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipWithIndex</SPAN></BIG></BIG><BR><BR>
Zips the elements of the RDD with its element indexes. The indexes start
from 0. If the RDD is spread across multiple partitions then a spark Job
is started to perform this operation.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipWithIndex(): RDD[(T,
Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 629px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z = sc.parallelize(Array("A",
"B", "C", "D"))<BR>val r = z.zipWithIndex<BR>res110: Array[(String,
Long)] = Array((A,0), (B,1), (C,2), (D,3))<BR><BR>val z =
sc.parallelize(100 to 120, 5)<BR>val r = z.zipWithIndex<BR>
r.collect<BR>res11: Array[(Int, Long)] = Array((100,0), (101,1),
(102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8),
(109,9), (110,10), (111,11), (112,12), (113,13), (114,14), (115,15),
(116,16), (117,17), (118,18), (119,19),
(120,20))<BR><BR></TD></TR></TBODY></TABLE><BR><BR>
<HR style="width: 100%; height: 2px;">
<A name="zipWithUniqueId"></A><BR><BR><BIG><BIG><SPAN style="font-weight: bold;">zipWithUniqueId</SPAN></BIG></BIG><BR><BR>
This is different from zipWithIndex since just gives a unique id to each
data element but the ids may not match the index number of the data
element. This operation does not start a spark job even if the RDD is
spread across multiple partitions.<BR>Compare the results of the example
below with that of the 2nd example of zipWithIndex. You should be able to
see the difference.<BR><BR><SPAN style="font-weight: bold;">Listing
Variants<BR></SPAN><BR>
<DIV style="margin-left: 40px;">def zipWithUniqueId(): RDD[(T,
Long)]<BR></DIV><BR><SPAN
style="font-weight: bold;">Example</SPAN><BR><BR>
<TABLE style="width: 672px; height: 28px; text-align: left; margin-left: 40px;"
border="1" cellspacing="2" cellpadding="2">
<TBODY>
<TR>
<TD style="vertical-align: top;">val z = sc.parallelize(100 to 120,
5)<BR>val r = z.zipWithUniqueId<BR>r.collect<BR><BR>res12:
Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15),
(104,1), (105,6), (106,11), (107,16), (108,2), (109,7), (110,12),
(111,17), (112,3), (113,8), (114,13), (115,18), (116,4), (117,9),
(118,14), (119,19),
(120,24))<BR></TD></TR></TBODY></TABLE><BR><BR></TD></TR></TBODY></TABLE><!-- Start of SimpleHitCounter Code -->
<DIV align="center"><A href="http://www.simplehitcounter.com/"
target="_blank"><IMG width="83" height="18" alt="hit counter website" src="Apache%20Spark%20RDD%20API%20Examples_files/hit.png"
border="0"></A><BR><A style="text-decoration: none;" href="http://www.simplehitcounter.com/"
target="_blank">hit counter website</A></DIV><!-- End of SimpleHitCounter Code -->
<BR><BR><!-- Start of StatCounter Code for Default Guide -->
<SCRIPT type="text/javascript">
var sc_project=10113188;
var sc_invisible=0;
var sc_security="5e1db937";
var scJsHost = (("https:" == document.location.protocol) ?
"https://secure." : "http://www.");
document.write("<sc"+"ript type='text/javascript' src='" +
scJsHost+
"statcounter.com/counter/counter.js'></"+"script>");
</SCRIPT>
<NOSCRIPT><div class="statcounter"><a title="hits counter"
href="http://statcounter.com/" target="_blank"><img class="statcounter"
src="http://c.statcounter.com/10113188/0/5e1db937/0/" alt="hits
counter"></a></div></NOSCRIPT> <!-- End of StatCounter Code for Default Guide -->
</BODY></HTML>