Spark修炼之道(进阶篇)——Spark入门到精通:第六节 Spark编程模型(三)

本节主要内容

  1. RDD transformation(续)
  2. RDD actions

1. RDD transformation(续)

(1)repartitionAndSortWithinPartitions(partitioner) 
repartitionAndSortWithinPartitions函数是repartition函数的变种,与repartition函数不同的是,repartitionAndSortWithinPartitions在给定的partitioner内部进行排序,性能比repartition要高。 
函数定义: 
/** 
* Repartition the RDD according to the given partitioner and, within each resulting partition, 
* sort records by their keys. 

* This is more efficient than calling repartition and then sorting within each partition 
* because it can push the sorting down into the shuffle machinery. 
*/ 
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

使用示例:

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(5,4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>),3)</span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[3] at parallelize at <console>:21</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.repartitionAndSortWithinPartitions<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">new</span> <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">HashPartitioner(3)</span>)</span>.collect</span>
<span class="hljs-title" style="box-sizing: border-box;">res3</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

这里写图片描述

(2)aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

aggregateByKey函数对PairRDD中相同Key的值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。其函数定义如下: 
/** 
* Aggregate the values of each key, using given combine functions and a neutral “zero value”. 
* This function can return a different result type, U, than the type of the values in this RDD, 
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, 
* as in scala.TraversableOnce. The former operation is used for merging values within a 
* partition, and the latter is used for merging values between partitions. To avoid memory 
* allocation, both of these functions are allowed to modify and return their first argument 
* instead of creating a new U. 
*/ 
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, 
combOp: (U, U) => U): RDD[(K, U)]

示例代码:

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-import" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.SparkContext._</span>
<span class="hljs-import" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.{SparkConf, SparkContext}</span>

<span class="hljs-title" style="box-sizing: border-box;">object</span> <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkWordCount</span>{
  def main(args: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">String</span>]) {
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) {
      <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">System</span>.err.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: SparkWordCount <inputfile> <outputfile>"</span>)
      <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">System</span>.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
    }

    val conf = new <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkConf</span>().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SparkWordCount"</span>).setMaster(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>)
    val sc = new <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkContext</span>(conf)

    val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>))</span>
    def seqOp(a:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, b:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>) : <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> ={
      println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"seq: "</span> + a + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t "</span> + b)
      math.max(a,b)
   }

   def combineOp(a:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, b:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>) : <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> ={
     println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"comb: "</span> + a + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t "</span> + b)
     a + b
   }

   val localIterator=<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.aggregateByKey<span class="hljs-container" style="box-sizing: border-box;">(1)</span><span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">seqOp</span>, <span class="hljs-title" style="box-sizing: border-box;">combineOp</span>)</span>.toLocalIterator</span>
    for(i<-localIterator) println(i)
    sc.stop()
  }
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li></ul>

执行结果:

seq: 1 3 
seq: 3 2 
seq: 3 4 
seq: 1 3 
seq: 3 4

(1,4) 
(2,4)

这里写图片描述

从输出结果来看,seqOp函数起作用了,但comineOp函数并没有起作用,在Spark 1.5、1.4及1.3三个版本中测试,结果都是一样的。这篇文章http://www.iteblog.com/archives/1261给出了aggregateByKey的使用,其Spark版本是1.1,其返回结果符合预期。个人觉得是版本原因造成的,具体后面有时间再来分析。

RDD中还有其它非常有用的transformation操作,参见API文档:http://spark.apache.org/docs/latest/api/scala/index.html

2. RDD actions

本小节将介绍常用的action操作,前面使用的collect方法便是一种action,它返回RDD中所有的数据元素,方法定义如下:

/** 
* Return an array that contains all of the elements in this RDD. 
*/ 
def collect(): Array[T]

(1) reduce(func) 
reduce采样累加或关联操作减少RDD中元素的数量,其方法定义如下: 
/** 
* Reduces the elements of this RDD using the specified commutative and 
* associative binary operator. 
*/ 
def reduce(f: (T, T) => T): T 
使用示例:

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.reduce<span class="hljs-container" style="box-sizing: border-box;">((<span class="hljs-title" style="box-sizing: border-box;">x</span>,<span class="hljs-title" style="box-sizing: border-box;">y</span>)</span>=>x+y)</span>
<span class="hljs-title" style="box-sizing: border-box;">res12</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">45</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.reduce<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">_</span>+<span class="hljs-title" style="box-sizing: border-box;">_</span>)</span></span>
<span class="hljs-title" style="box-sizing: border-box;">res13</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">45</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

这里写图片描述

(2)count()

/** 
* Return the number of elements in the RDD. 
*/ 
def count(): Long

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val data=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>)
<span class="hljs-label" style="box-sizing: border-box;">data:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span>
scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.count</span>
<span class="hljs-label" style="box-sizing: border-box;">res14:</span> Long = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

(3)first() 
/** 
* Return the first element in this RDD. 
*/ 
def first()

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val data=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>)
<span class="hljs-label" style="box-sizing: border-box;">data:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span>
scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.first</span>
<span class="hljs-label" style="box-sizing: border-box;">res15:</span> Int = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

(4)take(n)

/** 
* Take the first num elements of the RDD. It works by first scanning one partition, and use the 
* results from that partition to estimate the number of additional partitions needed to satisfy 
* the limit. 

* @note due to complications in the internal implementation, this method will raise 
* an exception if called on an RDD of Nothing or Null
*/ 
def take(num: Int): Array[T]

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span>
<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.take<span class="hljs-container" style="box-sizing: border-box;">(2)</span></span>
<span class="hljs-title" style="box-sizing: border-box;">res16</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

(5) takeSample(withReplacement, num, [seed])

对RDD中的数据进行采样 
/** 
* Return a fixed-size sampled subset of this RDD in an array 

* @param withReplacement whether sampling is done with replacement 
* @param num size of the returned sample 
* @param seed seed for the random number generator 
* @return sample of specified size in an array 
*/ 
// TODO: rewrite this without return statements so we can wrap it in a scope 
def takeSample( 
withReplacement: Boolean, 
num: Int, 
seed: Long = Utils.random.nextLong): Array[T]

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.takeSample<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">false</span>,5)</span></span>
<span class="hljs-title" style="box-sizing: border-box;">res17</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.takeSample<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">true</span>,5)</span></span>
<span class="hljs-title" style="box-sizing: border-box;">res18</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

这里写图片描述

(6) takeOrdered(n, [ordering])

/** 
* Returns the first k (smallest) elements from this RDD as defined by the specified 
* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]]. 
* For example: 
* {{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1) 
* // returns Array(2) 

* sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2) 
* // returns Array(2, 3) 
* }}} 

* @param num k, the number of elements to return 
* @param ord the implicit ordering for T 
* @return an array of top elements 
*/ 
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

(6) saveAsTextFile(path)

将RDD保存到文件,本地模式时保存在本地文件,集群模式指如果在Hadoop基础上则保存在HDFS上 
/** 
* Save this RDD as a text file, using string representations of elements. 
*/ 
def saveAsTextFile(path: String): Unit

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.saveAsTextFile</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/data.txt"</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

这里写图片描述

(7) countByKey() 
将RDD中的数据按Key计数 
/** 
* Count the number of elements for each key, collecting the results to a local Map. 

* Note that this method should only be used if the resulting map is expected to be small, as 
* the whole thing is loaded into the driver’s memory. 
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which 
* returns an RDD[T, Long] instead of a map. 
*/ 
def countByKey(): Map[K, Long]

使用示例:

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(5,4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>),3)</span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[10] at parallelize at <console>:22</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.countByKey<span class="hljs-container" style="box-sizing: border-box;">()</span></span>
<span class="hljs-title" style="box-sizing: border-box;">res22</span>: scala.collection.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Map</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>,<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Long</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Map</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)

</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>

这里写图片描述

(8)foreach(func) 
foreach方法遍历RDD中所有的元素 
// Actions (launch a job to return a value to the user program)

/** 
* Applies a function f to all elements of this RDD. 
*/ 
def foreach(f: T => Unit): Unit

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.SparkContext</span>._
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span>.{SparkConf, SparkContext}

object ForEachDemo{
  def main(args: Array[String]) {
    if (args<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.length</span> == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) {
      System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.err</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.println</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: SparkWordCount <inputfile> <outputfile>"</span>)
      System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.exit</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
    }

    val conf = new SparkConf()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setAppName</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SparkWordCount"</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setMaster</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>)
    val sc = new SparkContext(conf)

    val data = sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(List((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)))

    data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.foreach</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>=>println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"key="</span>+<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>._1+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">",value="</span>+<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>._2))
    sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.stop</span>()
  }
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li></ul>

这里写图片描述

Sparkh中还存在其它非常有用的action操作,如foldByKey、sampleByKey等,参见API文档:http://spark.apache.org/docs/latest/api/scala/index.html

转载: http://blog.csdn.net/lovehuangjiaju/article/details/48622757

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值