spark中local模式与cluster模式使用场景_Spark基础与原理

最新推荐文章于 2024-06-19 14:17:35 发布

weixin_39706561

最新推荐文章于 2024-06-19 14:17:35 发布

阅读量483

点赞数

文章标签： spark中local模式与cluster模式使用场景

1.概述

Apache Spark是专为大规模数据处理而设计的快速通用计算引擎，是一种与Hadoop相似的集群计算环境，最大不同之处是Spark启用了内存分布数据集，大型和低延迟是他的特点。Spark支持的语言有Scala、Python、Java。

对比MapReduce

Spark是一张有向五环图（从一个点出发最终无法回到该点的一个拓扑），向比较MapReduce，Spark具有如下优势：

MapReduce通常将中间结果放在HDFS上，Spark是基于内存并行大数据框架，中间结果存放在内存，对于迭代数据Spark效率高。
MapReduce总是消耗大量时间排序，而有些场景不需要排序，Spark可以避免不必要的排序所带来的开销。

总之，不同于MapReduce，Spark将Job中间输出结果保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘和机器学习领域。

AMPLab与Spark生态体系

以下是Spark生态体系

AMPLab这个实验室非常厉害，做大数据，云计算，跟工业界结合很紧密，之前就是他们做mesos，hadoop online, crowddb, Twitter，Linkedin等很多知名公司都喜欢从Berkeley找人，比如Twitter也专门开了门课程 Analyzing Big Data with Twitter 还有个BDAS (Bad Ass)引以为傲: The lab that created Spark wants to speed up everything, including cures for cancer

在2013年，大牛从Berkeley AMPLab出去成立了Databricks，半年就做了2次summit参会1000人，引无数Hadoop大佬尽折腰，大家看一下Summit的sponsor ，所有hadoop厂商全来了，并且各个技术公司也在巴结，cloudrea, hortonworks, mapr, datastax, yahoo, ooyala, 根据CTO说 Spark新增代码量活跃度今年远远超过了Hadoop本身，要推出商业化产品Cloud。

运行模式

Local(用于开发、测试)。
Standlone(独立集群模式)
Spark on YARN
Spark on Mesos

2.Spark原理

Spark有Driver和Worker两个角色：Driver程序启动多个Worker，Worker从文件系统加载数据并产生RDD（RDD是一种数据结构，数据放在RDD中），并按照不同分区Cache到内存中。

Spark执行步骤如下：

RDD是Spark的核心，将RDD是如何容错、如何高效处理数据讲清楚、说明白也就掌握了Spark的核心原理，后面我们将一一说明。

Spark通过useDisk、useMemory、deserialized、replication 4个参数组成11种缓存策略。

Spark算子大致可分成3大类：

Value数据类型的Transformation算子，这种变换不触发提交作业，针对处理的数据项是Value型的数据。
Key-Value数据类型的Transformation算子，这种变换不触发提交作业，针对处理的数据项是Key-Value型的数据。
Action算子，这类算子会触发SparkContext提交作业。

RDD

RDD,全称Resilent Distributed Dataset（中文名弹性分布式数据集），是一个容错的、并行的数据结构，可以让用户显式地将数据存储到磁盘和内存中，并能控制数据的分区。RDD可看作一个spark的对象，它本身存在于内存中（如对文件计算是一个RDD）。

此外，RDD还提供了一组丰富的操作来操作这些数据。RDD是只读的记录分区的集合，只能通过在其他RDD执行确定的转换操作（transformation操作）而创建，创建方式：

集合转换；
从文件系统（本地文件、HDFS、HBase）输入.
从父RDD转换

RDD抽象的是数据，数据分散在各个节点；RDD可分区，分区的个数是我们可以指定的。默认情况下，一个hdfs块就是一个分区。

宽窄依赖

一个RDD可以包含多个分区(partition)，每个分区就是一个dataset片段。RDD可以相互依赖。如果RDD的每个分区最多只能被一个Child RDD的一个分区使用，则称之为narrow dependency；若多个Child RDD分区都可以依赖，则称之为wide dependency。

宽依赖：父RDD的分区被子RDD的多个分区使用例如 groupByKey、reduceByKey、sortByKey等操作会产生宽依赖，会产生shuffle

窄依赖：父RDD的每个分区都只被子RDD的一个分区使用例如map、filter、union等操作会产生窄依赖

join操作有两种情况：如果两个RDD在进行join操作时，一个RDD的partition仅仅和另一个RDD中已知个数的Partition进行join，那么这种类型的join操作就是窄依赖，例上图中左半部分的join操作(join with inputsco-partitioned)；其它情况的join操作就是宽依赖,如上图中右半部分的join操作(join with inputsnot co-partitioned)，由于是需要父RDD的所有partition进行join的转换，这就涉及到了shuffle，因此这种类型的join操作也是宽依赖。

容错

RDD大部分操作在内存里面，少部分在磁盘，例如reduceByKey操作，就需要放在磁盘，为了保证数据的安全性，先落地，再从磁盘被读取出到内存上面。那么，针对宽窄依赖，Spark是如何做的？每个RDD都会记录自己依赖的父DD，一旦出现某个RDD的某些partition丢失，可以通过并行计算迅速恢复：

窄依赖：每个Partition最多只能给一个RDD使用，由于没有多重依赖，所以在一个节点上可以一次性将Partition处理完，且一旦数据发生丢失或者损坏，可以迅速从上一个RDD恢复。
宽依赖：每个Partition可以给多个RDD使用，由于多重依赖，只有等到所有到达节点的数据处理完毕才能进行下一步处理，一旦发生数据丢失或者损坏，则完蛋了。因此，在发生之前，必须将上一次所有节点的数据进行物化（持久化）处理，以达到恢复目的。

高效

RDD提供了两方面的特性persistence和patitioning，用户可通过persist与patitionBy函数来控制。RDD的分区特性与并行计算能力(RDD定义了parallerize函数)，使得Spark可以更好地利用可伸缩的硬件资源。若将分区与持久化二者结合起来，就能更加高效地处理海量数据。例如：

input

partitionBy函数需要接受一个Partitioner对象，如：

val partitioner = new HashPartitioner(sc.defaultParallelism)

前面我们讲过，RDD本质上是一个内存数据集，在访问RDD时，指针只会指向与操作相关的部分。例如存在一个面向列的数据结构，其中一个实现为Int的数组，另一个实现为Float的数组。如果只需要访问Int字段，RDD的指针可以只访问Int数组，这样就避免了对整个数据结构的扫描。

RDD将操作分为两类：transformation与action。无论执行了多少次transformation操作，RDD都不会真正执行运算，只有当action操作被执行时，运算才会触发（RDD的内部基于迭代器），这样做可以使数据访问更高效，也避免了大量中间结果对内存的消耗。

在实现时，RDD针对transformation操作，都提供了对应的继承自RDD的类型，例如map操作会返回MappedRDD，而flatMap则返回FlatMappedRDD。当我们执行map或flatMap操作时，不过是将当前RDD对象传递给对应的RDD对象而已。例如：

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this,sc.clean(f))

这些继承自RDD的类都定义了compute函数。该函数会在action操作被调用时触发，在函数内部是通过迭代器进行对应的转换操作：

private[spark] class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U) extends RDD[U](prev) {
 override def getPartitions:Array[Partition] = firstParent[T].partitions
 override def compute(split:Partition, context: TaskContext) = firstParent[T].iterator(split, context).map(f)
}

RDD是Spark的核心，也是整个Spark的架构基础。它是不变的数据结构存储，它是支持跨集群的分布式数据结构，它可以根据数据记录的key对结构进行分区，它提供了粗粒度的操作，且这些操作都支持分区，它将数据存储在内存中，从而提供了低延迟性。

3.RDD常用操作

Value型Trasformation算子

处理数据类型为Value型的Transformation算子可以根据RDD变换算子的输入分区与输出分区关系分为以下几种类型:

1）输入分区与输出分区一对一型：map、flatMap、mapPartitions、glom

2）输入分区与输出分区多对一型：union、certesian

3）输入分区与输出分区多对多型：groupBy、

4）输出分区为输入分区子集型：filter、distinct、subtract、sample、takeSample

5）还有一种特殊的输入与输出分区一对一的算子类型：Cache型。 Cache算子对RDD分区进行缓存

map

数据集的每个元素经过用户自定义函数转换成一个新的RDD，这种新的RDD叫MappedRDD。

举例1

val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.length);
val c = a.zip(b); // zip函数用于将两个RDD组合成Key/Value形成的RDD
c.collect

结果：

res3: Array[(String, Int)] = Array((dog,3), (cat,3), (hippopotamus,12), (sheep,5), (pig,3))

val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.split(","));
b.collect;

结果：

res4: Array[Array[String]] = Array(Array(dog), Array(cat), Array(hippopotamus), Array(sheep), Array(pig))

flatMap

与map类似，但每个元素输入项都可以被映射到0个或多个的输出项，最终将结果“扁平化“后输出。

举例1

val a = sc.parallelize(1 to 10, 5);
a.flatMap(1 to ).collect;

结果：

res7: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

举例2

sc.parallelize(List(1,2,3,2)).flatMap(x => List(x,x,x)).collect;

结果：

res7: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3, 2, 2, 2)

mapPartitions

类似于map，map作用于每个分区的每个元素，但mapPartitions作用于每个分区的func的类型：Iterator[T] => Iterator[U]假设有N个元素，有M个分区，那么map的函数将被调用N次，而mapPartitions被调用M次，当在映射的过程中不断地创建对象时就可以使用mapPartitions，比map的效率要高很多。比如：当向数据库写入数据时，如果使用map，就需要为每个元素创建connection对象；但使用mapPartitions的话，就需要为每个分区创建mapPartitions对象

举例

val name = List(("name","zhangsan"),("sex","man"),("address","pek"),("username","zhangsan"));
val rdd = sc.parallelize(name,2);
rdd.mapPartitions(x => x.filter(_._2 =="zhangsan"))
.foreachPartition(p=>{
 println(p.toList)
 println("=====分区分割线=====")
})

结果：

List((name,zhangsan))

=====分区分割线=====

List((username,zhangsan))

=====分区分割线=====

glom

将RDD的每个分区中的类型为T的元素转换为数组Array[T]

举例

val number = sc.parallelize(1 to 100, 3);
number.glom.collect;

结果：

res12: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))

union

UNION指将两个RDD中的数据进行合并，最终返回两个RDD的并集，若RDD中存在相同的元素，也不会去重。

举例

代码块
Scala




val a = sc.parallelize(1 to 3, 1);
val b = sc.parallelize(1 to 7,1);
(a ++ b).collect;

结果

res15: Array[Int] = Array(1, 2, 3, 1, 2, 3, 4, 5, 6, 7)

cartesial

对两个RDD中的所有元素进行笛卡尔积操作

代码块
Scala




val x = sc.parallelize(List(1,2,3,4,5));
val y = sc.parallelize(List(6,7,8,9,10));
x.cartesian(y).collect;

结果：

res16: Array[(Int, Int)] = Array((1,6), (1,7), (2,6), (2,7), (1,8), (1,9), (1,10), (2,8), (2,9), (2,10), (3,6), (3,7), (4,6), (4,7), (5,6), (5,7), (3,8), (3,9), (3,10), (4,8), (4,9), (4,10), (5,8), (5,9), (5,10))

groupBy

生成相应的key，相同的放在一起。

举例

代码块
Scala




val a = sc.parallelize(1 to 9, 3);
a.groupBy(x => { if (x%2 ==0) "even" else "odd"}).collect;

结果：

res17: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

filter

对元素进行过滤，对每个元素应用f函数，返回值为true对元素在RDD中保留，返回为false都将过滤掉。

举例

代码块
Scala




val a = sc.parallelize(1 to 10 ,3);
val b = a.filter(_ % 2 ==0);
b.collect;

结果

res18: Array[Int] = Array(2, 4, 6, 8, 10)

distinct

distinct用于去重

举例：

代码块
Scala




val str = sc.parallelize(List("abc","adc","qwe","aaa","adc"),2);
str.distinct.collect;

结果：

res19: Array[String] = Array(adc, abc, qwe, aaa)

subtract

去掉含有重复的项

举例

代码块
Scala




val a = sc.parallelize(1 to 6, 3);
val b = sc.parallelize(1 to 3, 3);
val c = a.subtract(b);
c.collect;

结果：

res21: Array[Int] = Array(6, 4, 5)

sample

以指定的随机种随机抽样出数量为fraction的数据，withReplacement表示是抽出的数是否返回，true为有放回的抽样，false为无放回的抽样

举例

代码块
Scala




val a = sc.parallelize(1 to 10000,3);
a.sample(false ,0.1,0).count;

结果：

res22: Long = 1032

takesample

takesample()函数和sample函数是一个原理，但是不使用相对比例采样，而是按照设定的采样个数进行采样，同时返回结果不再是RDD，而是相对于对采样后对数据进行collect(),返回结果对集合为单机对数组。

举例

代码块
Scala




val x = sc.parallelize(1 to 1000,3);
x.takeSample(true,100,1);

结果：

res25: Array[Int] = Array(764, 815, 274, 452, 39, 538, 238, 544, 475, 480, 416, 868, 517, 363, 39, 316, 37, 90, 210, 202, 335, 773, 572, 243, 354, 305, 584, 820, 528, 749, 188, 366, 913, 667, 214, 540, 807, 738, 204, 968, 39, 863, 541, 703, 397, 489, 172, 29, 211, 542, 600, 977, 941, 923, 900, 485, 575, 650, 258, 31, 737, 155, 685, 562, 223, 675, 330, 864, 291, 536, 392, 108, 188, 408, 475, 565, 873, 504, 34, 343, 79, 493, 868, 974, 973, 110, 587, 457, 739, 745, 977, 800, 783, 59, 276, 987, 160, 351, 515, 901)

cache、persist

cache和persist都是用于将一个RDD进行缓存对，这样在之后对使用过程中就不需要重新计算了，可以大大节省程序运行时间。

举例1

代码块
Scala




val stringStr = sc.parallelize(List("aaa","bbb","ccc","ddd","aaa"),2);
c.getStorageLevel;

结果：

org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas)

举例2

代码块
Scala




c.cache;
c.getStorageLevel;

结果：

res28: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)

KV型Transformation算子

mapValues

mapValues是针对[K,V]中对V的值进行map

举例

代码块
Scala




val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x.length, x)) // 3,"dog" 5,"tiger" ...
b.mapValues("x" + _ + "x").collect

结果：

res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx),(3,xcatx), (7,xpantherx), (5,xeaglex))

combineByKey

使用用户设置好的聚合函数对每个Key中对Value进行组合(combine)，可以将输入类型为 RDD[(K,V)] 转成RDD[(K,C)]

举例

代码块
Scala




val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect

结果

res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

reduceByKey

对元素为KV对的RDD中Key相同对元素对Value进行binary_function的reduce操作，因此Key相同的多个元素的值被reduce为一个值，然后与原RDD中的Key组成一个新的KV对。

举例

代码块
Scala




val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect

结果

res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

举例

代码块
Scala




val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect

结果

res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

partitionBy

对RDD进行分区操作

cogroup

cogroup指对两个RDD中的KV元素，每个RDD中相同Key中的元素分别聚合成一个集合

举例

代码块
Scala




val a = sc.parallelize(List(1, 2, 1, 3), 1) // 
val b = a.map((_, "b")) // 1,b 2,b 3,b 1,b,1b
val c = a.map((_, "c")) // 1,c 2,c 3,c 1,c 1,c
b.cogroup(c).collect // 1,(b,b) 1,(c,c) 2,b 2,c 3,b,3,c

结果

res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

举例2

代码块
Scala




val d = a.map((_, "d"))
b.cogroup(c, d).collect

结果

res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))

)

举例2

代码块
Scala




val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect

结果

res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

join

对两个需要连接对RDD进行cogroup函数操作

举例

代码块
Scala




val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect

结果

res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

leftOuterJoin

Performs an left outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work correctly.

举例

代码块
Scala




val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.leftOuterJoin(d).collect

结果
res1: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))

rightOuterJoin

Performs an right outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work correctly.

举例

代码块
Scala




val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.rightOuterJoin(d).collect

结果
res2: Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))

Actions算子

Action算子，这类算子会触发SparkContext提交作业。

foreach

打印输出

代码块
Scala




val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3)
c.foreach(x => println(x + "s are yummy"))

结果

lions are yummy

gnus are yummy

crocodiles are yummy

ants are yummy

whales are yummy

dolphins are yummy

spiders are yummy

saveAsTextFile

保存结果到HDFS

代码块
Scala




val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("mydata_a")

saveAsObjectFile

将RDD中的元素序列化成对象，存储到文件中。对于HDFS，默认采用SequenceFile保存。

举例

代码块
Scala




val x = sc.parallelize(1 to 100, 3)
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Int]("objFile")
y.collect

结果

res52: Array[Int] = Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

collect

将RDD中的数据收集起来，变成一个Array，仅限数据量比较小的时候。

代码块
Scala




val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect

结果

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

collectAsMap

返回hashMap包含所有RDD中的分片，key如果重复，后边的元素会覆盖前面的元素，zip函数用于将两个RDD组合成KV形式的RDD。

举例

代码块
Scala




val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap

结果

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)

reduceByKeyLocally

先执行reduce，然后再执行collectAsMap。

举例1

代码块
Scala




val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect

结果

res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

举例2:

代码块
Scala




val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect

结果

res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

lookup

查找，针对KV类型的RDD。

举例

代码块
Scala




val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)

结果
res0: Seq[String] = WrappedArray(tiger, eagle)

count

总数

举例

代码块
Scala




val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count

结果

res2: Long = 4

top

返回最大的K个元素

举例

代码块
Scala




val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)

结果

res28: Array[Int] = Array(9, 8)

reduce

相当于对RDD中的元素进行reduceLeft函数的操作。

举例

代码块
Scala




val a = sc.parallelize(1 to 100, 3)
a.reduce(_ + _)

结果

res41: Int = 5050

fold

fold()与reduce()类似，接收与reduce接收的函数签名相同的函数，另外再加上一个初始值为第一次调用的结果。结果为：（区+1）*（初始值）+list（值）。

举例

代码块
Scala




val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)

结果

res59: Int = 6

aggregate

先对每个分区的所有元素进行aggregate操作，在对分区的结果进行fold操作。

举例


val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels

def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {

 iter.map(x => "[partID:" + index + ", val: " + x + "]")

}

z.mapPartitionsWithIndex(myfunc).collect

res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

// This example returns 16 since the initial value is 5

// reduce of partition 0 will be max(5, 1, 2, 3) = 5

// reduce of partition 1 will be max(5, 4, 5, 6) = 6

// final reduce across partitions will be 5 + 5 + 6 = 16

// note the final reduce include the initial value

z.aggregate(5)(math.max(_, _), _ + _)

res29: Int = 16

val z = sc.parallelize(List("a","b","c","d","e","f"),2)

//lets first print out the contents of the RDD with partition labels

def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {

 iter.map(x => "[partID:" + index + ", val: " + x + "]")

}

z.mapPartitionsWithIndex(myfunc).collect

res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

// See here how the initial value "x" is applied three times.

// - once for each partition

// - once when combining all the partitions in the second reduce function.

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

// Below are some more advanced examples. Some are quite tricky to work out.

val z = sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The END.

weixin_39706561

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark中local模式与cluster模式使用场景_Spark基础与原理

1.概述Apache Spark是专为大规模数据处理而设计的快速通用计算引擎，是一种与Hadoop相似的集群计算环境，最大不同之处是Spark启用了内存分布数据集，大型和低延迟是他的特点。Spark支持的语言有Scala、Python、Java。对比MapReduceSpark是一张有向五环图（从一个点出发最终无法回到该点的一个拓扑），向比较MapReduce，Spark具有如下优势：MapRed...
复制链接

扫一扫