spark算子实践

     本文是照着大神“lxw的大数据田地”的文章做的,其中有部分例子和代码我实践时觉得有点晦涩,故加了一些自己的理解进去。在这里,十分感谢作者,作者写了大数据一系列文章,对我帮助很大,如果有读者对大数据学习的需求,可以去大神的网站学习。

大神的文章网址为:http://lxw1234.com/archives/2015/07/363.htm

创建RDD算子:

从集合创建rdd

parallelize

scala> varrdd=sc.parallelize(1 to 10)

rdd:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[86] at parallelize at<console>:27

 

scala>rdd.collect

res127: Array[Int]= Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 

scala> rdd.partitions.size

res128: Int = 4

 

scala> varrdd=sc.parallelize(1 to 10,6)

rdd:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at<console>:27

 

scala>rdd.partitions.size

res129: Int = 6

 

makeRDD

scala>varcollect=Seq((1 to 10,Seq("spark","hbase")),(11 to15,Seq("hdfs","impala")))

collect:Seq[(scala.collection.immutable.Range.Inclusive,Seq[String])]=List((Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),List(spark, hbase)), (Range(11, 12,13, 14, List(hdfs, impala)))

 

scala> varrdd=sc.makeRDD(collect)

rdd:org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive]=ParallelCollectionRDD[88] at makeRDD at <console>:29

 

scala>rdd.partitions.size

res130: Int = 2

 

scala>rdd.preferredLocations(rdd.partitions(0))

res134:Seq[String] = List(spark, hbase)

 

scala>rdd.preferredLocations(rdd.partitions(1))

res135:Seq[String] = List(hdfs, impala)

指定分区位置,对后续的调度调优有帮助。

 

从外部存储创建rdd

textFile

varrdd1=sc.textFile("/tmp/lw/1.txt")

rdd1:org.apache.spark.rdd.RDD[String] = /tmp/lw/1.txt MapPartitionsRDD[90] attextFile at <console>:27

 

scala>rdd1.count

res0: Long = 4

注意若是本地文件路径需要在DriverExecutor端存在。

HBASE读取文件创建rdd

scala>import org.apache.hadoop.hbase.{HBaseConfiguration,HTableDescriptor,TableName}

import org.apache.hadoop.hbase.{HBaseConfiguration,HTableDescriptor, TableName}

 

scala> importorg.apache.hadoop.hbase.mapreduce.TableInputFormat

importorg.apache.hadoop.hbase.mapreduce.TableInputFormat

 

scala> importorg.apache.hadoop.hbase.client.HBaseAdmin

importorg.apache.hadoop.hbase.client.HBaseAdmin

 

scala> val conf = HBaseConfiguration.create()

 

scala>conf.set(TableInputFormat.INPUT_TABLE,"lw")

 

scala> varhbaseRDD = sc.newAPIHadoopRDD(conf,

classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],

classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],

classOf[org.apache.hadoop.hbase.client.Result])

 

scala> hbaseRDD.count

res52: Long = 1

转换算子

mapflatMapdistinct

map

scala> varres=rdd1.map(_.split("\\s"))

res:org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at<console>:29

 

scala>res.collect

res2:Array[Array[String]] = Array(Array(hello, world), Array(hello, spark),Array(hello, hive), Array(""))

 

flatMap

 

scala> varflatMapResult=rdd1.flatMap(_.split("\\s"))

flatMapResult:org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at<console>:29

 

scala>flatMapResult.collect

res4:Array[String] = Array(hello, world, hello, spark, hello, hive, "")

 

distinct

scala>rdd1.map(_.toUpperCase).collect

res6:Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, "")

 

scala>rdd1.flatMap(_.toUpperCase).collect

res7: Array[Char]= Array(H, E, L, L, O,  , W, O, R, L, D,H, E, L, L, O,  , S, P, A, R, K, H, E, L,L, O,  , H, I, V, E)

 

scala>rdd1.flatMap(_.toUpperCase).distinct.collect

res8: Array[Char]= Array(L, R, P,  , H, V, D, O, A, I, K,S, E, W)

 

重分区算子:coalescereparation

scala>rdd1.partitions.size

res9: Int = 2

 

scala>rdd1.coalesce(4).partitions.size

res10: Int = 2

 

scala>rdd1.coalesce(1).partitions.size

res11: Int = 1

 

scala>rdd1.coalesce(4,true).partitions.size

res12: Int = 4

 

scala>rdd1.partitions.size

res13: Int = 2

 

scala>rdd1.repartition(3).partitions.size

res14: Int = 3

 

rdd切分算子:randomSplitglom

randomSplit        函数根据weights权重,将一个RDD切分成多个RDD。该权重参数为一个Double数组,第二个参数为random的种子,基本可忽略。

scala> varrandomSplitResult=rdd1.randomSplit(Array(1.0,2.0,3.0,4.0))

randomSplitResult:Array[org.apache.spark.rdd.RDD[String]]=Array(MapPartitionsRDD[22] at randomSplit at <console>:29,

         MapPartitionsRDD[23]at randomSplit at <console>:29,

         MapPartitionsRDD[24]at randomSplit at <console>:29,

         MapPartitionsRDD[25]at randomSplit at <console>:29)

randomSplit返回结果是一个rdd数组

scala>randomSplitResult(0).collect

res16:Array[String] = Array()

 

scala>randomSplitResult(3).collect

res17:Array[String] = Array(hello world, hello spark, hello hive, "")

 

scala>randomSplitResult(2).collect

res18:Array[String] = Array()

scala>randomSplitResult.size

res21: Int = 4

glom

def glom(): RDD[Array[T]]

该函数是将RDD中每一个分区中类型为T的元素转换成Array[T],这样每一个分区就只有一个数组元素

 

scala> varglomResult=rdd1.glom

glomResult:org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[26] at glom at<console>:29

 

scala>glomResult.collect

res22:Array[Array[String]] = Array(Array(hello world, hello spark), Array(hello hive,""))

 

scala>rdd1.collect

res23:Array[String] = Array(hello world, hello spark, hello hive, "")

glom将每个分区中的元素放到一个数组中,这样,结果就变成了3个数组

 

union

def union(other: RDD[T]): RDD[T]该函数比较简单,就是将两个RDD进行合并,不去重。

scala> varrdd2=sc.makeRDD(1 to 3,3)

rdd2:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at<console>:27

 

scala> varrdd3=sc.makeRDD(3 to 5,3)

rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at makeRDD at<console>:27

 

scala>rdd2.union(rdd3).collect

res24: Array[Int]= Array(1, 2, 3, 3, 4, 5)

 

 

intersection

def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord:Ordering[T] = null): RDD[T]

该函数返回两个RDD的交集,并且去重。参数numPartitions指定返回的RDD的分区数。参数partitioner用于指定分区函数

scala> varrdd2=sc.makeRDD(1 to 3,3)

rdd2:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at<console>:27

 

scala> varrdd3=sc.makeRDD(3 to 5,3)

rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at makeRDD at<console>:27

 

scala>rdd2.intersection(rdd3).collect

res26: Array[Int]= Array(3)

 

scala>rdd2.intersection(rdd3).partitions.size

res1: Int = 3

 

subtract

def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T]= null): RDD[T]

该函数类似于intersection,但返回在RDD中出现,并且不在otherRDD中出现的元素,不去重。参数含义同intersection

scala>rdd2.subtract(rdd3).collect

res3: Array[Int] =Array(1, 2)

 

scala>rdd3.subtract(rdd2).collect

res4: Array[Int] =Array(4, 5)

 

mapPartitions

def mapPartitions[U](f: (Iterator[T]) =>Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0:ClassTag[U]): RDD[U]

该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。参数preservesPartitioning表示是否保留父RDDpartitioner分区信息。

scala> varrdd4=sc.makeRDD(1 to 5,2)

rdd4:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at makeRDD at<console>:27

 

scala> varrdd5=rdd4.mapPartitions{x=>{

     | var result=List[Int]()

     | var i=0

     | while(x.hasNext){

     | i+=x.next()}

     | result.::(i).iterator}}

rdd5:org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at mapPartitions at<console>:29

 

scala>rdd5.collect

res5: Array[Int] =Array(3, 12)

 

scala>rdd5.partitions.size

res6: Int = 2

 

mapPartitionsWithIndex

def mapPartitionsWithIndex[U](f: (Int, Iterator[T])=> Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0:ClassTag[U]): RDD[U]

函数作用同mapPartitions,不过提供了两个参数,第一个参数为分区的索引。

scala> varrdd6=rdd4.mapPartitionsWithIndex{(x,iter)=>{

     | var result=List[String]()

     | var i=0

     | while(iter.hasNext){

     | i+=iter.next()

     | }

     | result.::(x+"|"+i).iterator}}

rdd6:org.apache.spark.rdd.RDD[String]=MapPartitionsRDD[24]atmapPartitionsWithIndex at <console>:29

 

scala>rdd6.collect

res7:Array[String] = Array(0|3, 1|12)

 

zip

def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]):RDD[(T, U)]

zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDDpartition数量以及元素数量都相同,否则会抛出异常。

scala>rdd2.partitions.size

res14: Int = 3

 

scala> varrdd7=sc.makeRDD(Seq("A","B","C"),3)

rdd7:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at makeRDD at<console>:27

 

scala>rdd2.zip(rdd7).collect

res16: Array[(Int,String)] = Array((1,A), (2,B), (3,C))

 

zipPartitions

zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。该函数有好几种实现,可分为三类:

参数是一个RDD

defzipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T],Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]):RDD[V]

defzipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f:(Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1:ClassTag[V]): RDD[V]

这两个区别就是参数preservesPartitioning,是否保留父RDDpartitioner分区信息映射方法f参数为两个RDD的迭代器。

scala> varrdd1=sc.makeRDD(1 to 5,3)

rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at makeRDD at<console>:27

 

scala> varrdd2=sc.makeRDD(Seq("A","B","C","D","E"),3)

rdd2:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at makeRDD at<console>:27

 

scala>rdd1.mapPartitionsWithIndex{(x,iter)=>{

     | var result=List[String]()

     | while(iter.hasNext){

     |result::=("part_"+x+"|"+iter.next())}

     | result.iterator}

     | }.collect

res17:Array[String] = Array(part_0|1, part_1|3, part_1|2, part_2|5, part_2|4)

 

scala>rdd2.mapPartitionsWithIndex{(x,iter)=>{

     | var result=List[String]()

     | while(iter.hasNext){

     |result::=("part_"+x+"|"+iter.next())}

     | result.iterator}

     | }.collect

res18:Array[String] = Array(part_0|A, part_1|C, part_1|B, part_2|E, part_2|D)

 

scala>rdd1.zipPartitions(rdd2){(rdd1Iter,rdd2Iter)=>{

     | var result=List[String]()

     |while(rdd1Iter.hasNext&&rdd2Iter.hasNext){

     | result::=(rdd1Iter.next()+"_"+rdd2Iter.next())}

     | result.iterator}

     | }.collect

res19:Array[String] = Array(1_A, 3_C, 2_B, 5_E, 4_D)

 

参数是两个RDD

defzipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0:ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]

defzipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning:Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) =>Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]):RDD[V]

用法同上面,只不过该函数参数为两个RDD,映射方法f输入参数为两个RDD的迭代器。

scala> rdd1

res20:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at makeRDD at<console>:27

 

scala> rdd2

res21:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at makeRDD at<console>:27

 

scala>rdd3.mapPartitionsWithIndex{(x,iter)=>{

     | var result=List[String]()

     | while(iter.hasNext){

     |result::=("part_"+x+"|"+iter.next())}

     | result.iterator}

     | }.collect

res22:Array[String] = Array(part_0|a, part_1|c, part_1|b, part_2|e, part_2|d)

 

scala> varrdd4=rdd1.zipPartitions(rdd2,rdd3){(rdd1Iter,rdd2Iter,rdd3Iter)=>{

     | var result=List[String]()

     |while(rdd1Iter.hasNext&&rdd2Iter.hasNext&&rdd3Iter.hasNext){

     |result::=(rdd1Iter.next()+"_"+rdd2Iter.next()+"_"+rdd3Iter.next())}

     | result.iterator}

     | }

rdd4:org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[38] at zipPartitions at<console>:33

 

scala>rdd4.collect

res23:Array[String] = Array(1_A_a, 3_C_c, 2_B_b, 5_E_e, 4_D_d)

参数是三个RDD

defzipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) =>Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D],arg3: ClassTag[V]): RDD[V]

def zipPartitions[B,C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning:Boolean)(f: (Iterator[T], Iterator[B], Iterator[C],Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C],arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]

用法同上面,只不过这里又多了个一个RDD而已。

 

zipWithIndex

def zipWithIndex(): RDD[(T, Long)]

该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。

scala> rdd2.zipWithIndex().collect

res1: Array[(String, Long)] = Array((A,0), (B,1),(C,2), (D,3), (E,4))

zipWithUniqueId

def zipWithUniqueId(): RDD[(T, Long)]

该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:

每个分区中第一个元素的唯一ID值为:该分区索引号,

每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID)+ (RDD总的分区数)

scala>rdd2.zipWithUniqueId().collect

res2:Array[(String, Long)] = Array((A,0), (B,1), (C,4), (D,2), (E,5))

 

Spark算子:统计RDD分区中的元素及数量

Spark RDD是被分区的,在生成RDD时候,一般可以指定分区的数量,如果不指定分区数量,当RDD从集合创建时候,则默认为该程序所分配到的资源的CPU核数,如果是从HDFS文件创建,默认为文件的Block数。

可以利用RDDmapPartitionsWithIndex方法来统计每个分区中的元素及数量。

 

//统计rdd1每个分区中元素数量

scala>rdd3.mapPartitionsWithIndex{(partIdx,iter)=>{

     | var part_map=scala.collection.mutable.Map[String,Int]()//集合存放分区号和元素数量

     | while(iter.hasNext){//迭代分区元素

     | var part_name="part_"+partIdx//用分区好组件mapkey

     | if(part_map.contains(part_name)){//判读map中是否有key

     | var ele_cnt=part_map(part_name)//value赋值给中间变量

     | part_map(part_name)=ele_cnt+1}//if,将分区中元素个数当作value

     | else{

     | part_map(part_name)=1}//else,根据key赋值数量为1

     | iter.next()}//while,自增

     | part_map.iterator}//inner

     | }.collect//outter

res5:Array[(String, Int)] = Array((part_0,3), (part_1,3), (part_2,4), (part_3,3),(part_4,3), (part_5,4), (part_6,3), (part_7,3), (part_8,4), (part_9,3),(part_10,3), (part_11,4), (part_12,3), (part_13,3), (part_14,4))

part_0part_14,每个分区中的元素数量

 

//统计rdd1每个分区中有哪些元素

scala>rdd3.mapPartitionsWithIndex{(partIdx,iter)=>{

     | varpart_map=scala.collection.mutable.Map[String,List[Int]]()//存放分区号和分区内元素组成的list

     | while(iter.hasNext){//当前分区是否有下一个元素

     | var part_name="part_"+partIdx//用分区号组成map的一个key

     | var elem=iter.next()//将当前分区的下一个元素取出

     | if(part_map.contains(part_name)){//判断map中是否有当前key

     | var elems=part_map(part_name)//将该key对应的元素list赋值给elems

     | elems::=elem//将当前分区的下一个元素存入元素list

     | part_map(part_name)=elems}else{//将元素list作为value存入map

     | part_map(part_name)=List[Int]{elem}}}//当进入下一个分区。

     |part_map.iterator}}.collect

res6:Array[(String, List[Int])] = Array((part_0,List(3, 2, 1)), (part_1,List(6,5,               4)), (part_2,List(10, 9,8, 7)), (part_3,List(13, 12, 11)), (part_4,List(16, 15, 1              4)), (part_5,List(20, 19, 18,17)), (part_6,List(23, 22, 21)), (part_7,List(26, 25,               24)), (part_8,List(30, 29, 28,27)), (part_9,List(33, 32, 31)), (part_10,List(36,               35, 34)), (part_11,List(40, 39,38, 37)), (part_12,List(43, 42, 41)), (part_13,List              (46, 45, 44)), (part_14,List(50, 49, 48,47)))

//part_0part14,每个分区中包含的元素

 

partitionBymapValuesflatMapValues

partitionBy

def partitionBy(partitioner: Partitioner): RDD[(K, V)]

该函数根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区。

scala> varrdd4=sc.makeRDD(

     |Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)

rdd4:org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[8] at makeRDDat <console>:24

 

scala>rdd4.mapPartitionsWithIndex{(partIdx,iter)=>{

     | var part_map=scala.collection.mutable.Map[String,List[(Int,String)]]()

     | while(iter.hasNext){varpart_name="part_"+partIdx

     | var elem=iter.next()

     | if(part_map.contains(part_name)){

     | var elems=part_map(part_name)

     | elems::=elem

     | part_map(part_name)=elems}else{

     |part_map(part_name)=List[(Int,String)]{elem}}

     | }

     | part_map.iterator

     | }}.collect

res8:Array[(String, List[(Int, String)])] = Array((part_0,List((2,B), (1,A))),(part_1,List((4,D), (3,C))))

//(2,B),(1,A)part_0中,(4,D),(3,C)part_1

//使用partitionBy重分区

scala> varrdd5=rdd4.partitionBy(new org.apache.spark.HashPartitioner(2))

rdd5:org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[10] at partitionBy at<console>:26

 

scala>rdd5.mapPartitionsWithIndex{(partIdx,iter)=>{

     |var part_map=scala.collection.mutable.map()

     | varpart_map=scala.collection.mutable.Map

     | varpart_map=scala.collection.mutable.Map[String,List[(Int,String)]]()

     |while(iter.hasNext){var part_name="part_"+partIdx

     | var elem=iter.next()

     | if(part_map.contains(part_name)){

     | var elems=part_map(part_name)

     | elems::=elem

     | part_map(part_name)=elems}else{

     | part_map(part_name)=List[(Int,String)]{elem}}

     | }

     | part_map.iterator}}.collect

 

res9:Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))),(part_1,List((1,A), (3,C))))

//(4,D),(2,B)part_0中,(3,C),(1,A)part_1

 

mapValues

def mapValues[U](f: (V) => U): RDD[(K, U)]

同基本转换操作中的map,只不过mapValues是针对[K,V]中的V值进行map操作。

scala>rdd4.collect

res10: Array[(Int,String)] = Array((1,A), (2,B), (3,C), (4,D))

 

scala>rdd4.mapValues(x=>x+"_").collect

res11: Array[(Int,String)] = Array((1,A_), (2,B_), (3,C_), (4,D_))

 

flatMapValues

def flatMapValues[U](f: (V) => TraversableOnce[U]):RDD[(K, U)]

同基本转换操作中的flatMap,只不过flatMapValues是针对[K,V]中的V值进行flatMap操作。

scala>rdd4.flatMapValues(x=>x+"_").collect

res13: Array[(Int,Char)] = Array((1,A), (1,_), (2,B), (2,_), (3,C), (3,_), (4,D), (4,_))

 

combineByKeyfoldByKey

combineByKey

def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner,mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

 该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。

其中的参数:

createCombiner:组合器函数,用于将V类型转换成C类型,输入参数为RDD[K,V]中的V,输出为C

mergeValue:合并值函数,将一个C类型和一个V类型值合并成一个C类型,输入参数为(C,V),输出为C

mergeCombiners:合并组合器函数,用于将两个C类型值合并成一个C类型,输入参数为(C,C),输出为C

numPartitions:结果RDD分区数,默认保持原有的分区数

partitioner:分区函数,默认为HashPartitioner

mapSideCombine:是否需要在Map端进行combine操作,类似于MapReduce中的combine,默认为true

scala> varrdd6=sc.makeRDD(Array(("A",1),("A",3),("C",1),("C",3),("C",1)))

rdd6:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at makeRDDat <console>:24

 

scala>rdd6.combineByKey(

     | (v:Int)=>v+"_",

     | (c:String,v:Int)=>c+"@"+v,

     |(c1:String,c2:String)=>c1+"$"+c2).collect

res14:Array[(String, String)] = Array((A,3_$1_), (C,1_$3_$1_))

其中三个映射函数分别为:

createCombiner:(V) => C

(v : Int) => v+ _ //在每一个V值后面加上字符_,返回C类型(String)

mergeValue: (C, V)=> C

(c : String, v :Int) => c + @ + v //合并C类型和V类型,中间加字符@,返回C(String)

mergeCombiners:(C, C) => C

(c1 : String, c2 :String) => c1 + $ + c2 //合并C类型和C类型,中间加$,返回C(String)

其他参数为默认值。

最终,将RDD[String,Int]转换为RDD[String,String]

scala>rdd6.combineByKey(

     | (v:Int)=>List(v),

     | (c:List[Int],v:Int)=>v::c,

     |(c1:List[Int],c2:List[Int])=>c1:::c2).collect

res16:Array[(String, List[Int])] = Array((A,List(3, 1)), (C,List(1, 3, 1)))

 

最终将RDD[String,Int]转换为RDD[String,List[Int]]

 

foldByKey

def foldByKey(zeroValue: V)(func: (V, V) => V):RDD[(K, V)]

def foldByKey(zeroValue: V, numPartitions: Int)(func:(V, V) => V): RDD[(K, V)]

def foldByKey(zeroValue: V, partitioner:Partitioner)(func: (V, V) => V): RDD[(K, V)]

 该函数用于RDD[K,V]根据KV做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.

scala> varrdd7=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))

rdd7:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[17] at makeRDDat <console>:24

 

scala>rdd7.foldByKey(0)(_+_).collect

res17:Array[(String, Int)] = Array((A,2), (B,3), (C,1))

//rdd1中每个key对应的V进行累加,注意zeroValue=0,需要先初始化V,映射函数为+操作,比如("A",0),("A",2),先将zeroValue应用于每个V,得到:("A",0+0),("A",2+0),即:("A",0),("A",2),再将映射函数应用于初始化后的V,最后得到(A,0+2),(A,2)

 

scala>rdd7.foldByKey(2)(_+_).collect

res18:Array[(String, Int)] = Array((A,6), (B,7), (C,3))

//先将zeroValue=2应用于每个V,得到:("A",0+2),("A",2+2),即:("A",2),("A",4),再将映射函数应用于初始化后的V,最后得到:(A,2+4),即:(A,6)

 

scala>rdd7.foldByKey(0)(_*_).collect

res19:Array[(String, Int)] = Array((A,0), (B,0), (C,0))

//先将zeroValue=0应用于每个V,注意,这次映射函数为乘法,得到:("A",0*0),("A",2*0),即:("A",0),("A",0),再将映射函//数应用于初始化后的V,最后得到:(A,0*0),即:(A,0)其他K也一样,最终都得到了V=0

 

cala>rdd1.foldByKey(1)(_*_).collect

res78:Array[(String, Int)] = Array((A,0), (B,2), (C,1))

//映射函数为乘法时,需要将zeroValue设为1,才能得到我们想要的结果。

 

groupByKeyreduceByKeyreduceByKeyLocally

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]

def groupByKey(numPartitions: Int): RDD[(K,Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K,Iterable[V])]

该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,

参数numPartitions用于指定分区数;参数partitioner用于指定分区函数;

 

scala> varrdd1=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))

rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[22] at makeRDDat <console>:24

 

scala>rdd1.groupByKey().collect

res21:Array[(String, Iterable[Int])] = Array((A,CompactBuffer(2, 0)),(B,CompactBuffer(1, 2)), (C,CompactBuffer(1)))

 

reduceByKey

defreduceByKey(func: (V, V) => V): RDD[(K, V)]

defreduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

defreduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

 

该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。

参数numPartitions用于指定分区数;

参数partitioner用于指定分区函数;

 

scala>rdd1.partitions.size

res22: Int = 5

 

scala> varrdd2=rdd1.reduceByKey((x,y)=>x+y)

rdd2:org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[24] at reduceByKey at<console>:26

 

scala>rdd2.collect

res23:Array[(String, Int)] = Array((A,2), (B,3), (C,1))

 

scala>rdd2.partitions.size

res24: Int = 5

 

scala>varrdd2=rdd1.reduceByKey(neworg.apache.spark.HashPartitioner(2),(x,y)=>x+y)

 

rdd2:org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at<console>:26

 

scala>rdd2.collect

res25:Array[(String, Int)] = Array((B,3), (A,2), (C,1))

 

scala>rdd2.partitions.size

res26: Int = 2

 

reduceByKeyLocally

def reduceByKeyLocally(func: (V, V) => V): Map[K,V]

该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]

 

scala>rdd1.reduceByKeyLocally((x,y)=>x+y)

res27:scala.collection.Map[String,Int] = Map(A -> 2, B -> 3, C -> 1)

 

scala>rdd1.collect

res28:Array[(String, Int)] = Array((A,0), (A,2), (B,1), (B,2), (C,1))

 

cogroupjoin

cogroup

##参数为1RDD

def cogroup[W](other: RDD[(K, W)]): RDD[(K,(Iterable[V], Iterable[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Iterable[V], Iterable[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]

 ##参数为2RDD

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2]))]

 ##参数为3RDD

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2], Iterable[W3]))]

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V],Iterable[W1], Iterable[W2], Iterable[W3]))]

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner): RDD[(K,(Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

 cogroup相当于SQL中的全外关联fullouter join,返回左右RDD中的记录,关联不上的为空。

参数numPartitions用于指定结果的分区数。

参数partitioner用于指定分区函数。

##参数为1RDD的例子

scala> varrdd1=sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)

rdd1:org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDDat <console>:24

 

scala> varrdd2=sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)

rdd2:org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] atmakeRDD at <console>:24

 

scala> varrdd3=rdd1.cogroup(rdd2).collect

rdd3: Array[(String,(Iterable[String], Iterable[String]))] =Array((B,(CompactBuffer(2),CompactBuffer())),(D,(CompactBuffer(),CompactBuffer(d))),(A,(CompactBuffer(1),CompactBuffer(a))),(C,(CompactBuffer(3),CompactBuffer(c))))

 

scala>rdd1.cogroup(rdd2).partitions.size

res30: Int = 2

 

scala> varrdd4=rdd1.cogroup(rdd2,3)

rdd4:org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String]))] =MapPartitionsRDD[38] at cogroup at <console>:28

 

scala>rdd4.partitions.size

res31: Int = 3

 

scala>rdd4.collect

res33:Array[(String,(Iterable[String],Iterable[String]))]=Array((B,(CompactBuffer(2),CompactBuffer())),(C,(CompactBuffer(3),CompactBuffer(c))),(A,(CompactBuffer(1),CompactBuffer(a))),(D,(CompactBuffer(),CompactBuffer(d))))

 

scala>rdd1.partitions.size

res34: Int = 2

 

scala>rdd2.partitions.size

res35: Int = 2

 

scala> varrdd4=rdd1.cogroup(rdd2,rdd3)

rdd4:org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String],Iterable[String]))] = MapPartitionsRDD[41] at cogroup at <console>:30

 

scala> rdd4.partitions.size

res36: Int = 2

 

scala>rdd4.collect

res37:Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] =Array((B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())),(D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())), (A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))),(C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())),(E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))

 

join

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int):RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, W))]

join相当于SQL中的内关联join,只返回两个RDD根据K可以关联上的结果,join只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。

参数numPartitions用于指定结果的分区数

参数partitioner用于指定分区函数

scala>rdd1.join(rdd2).collect

res38:Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

 

leftOuterJoinrightOuterJoinsubtractByKey

leftOuterJoin

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V,Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (V, Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, Option[W]))]

leftOuterJoin类似于SQL中的左外关联leftouter join,返回结果以前面的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。

参数numPartitions用于指定结果的分区数

参数partitioner用于指定分区函数

scala>rdd1.collect

res43:Array[(String, String)] = Array((A,1), (B,2), (C,3))

 

scala>rdd2.collect

res44:Array[(String, String)] = Array((A,a), (C,c), (D,d))

 

scala>rdd1.leftOuterJoin(rdd2).collect

res45:Array[(String, (String, Option[String]))] = Array((B,(2,None)),(A,(1,Some(a))), (C,(3,Some(c))))

 

rightOuterJoin

def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K,(Option[V], W))]

def rightOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (Option[V], W))]

def rightOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Option[V], W))]

 rightOuterJoin类似于SQL中的有外关联rightouter join,返回结果以参数中的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。

参数numPartitions用于指定结果的分区数

参数partitioner用于指定分区函数

scala>rdd1.rightOuterJoin(rdd2).collect

res46:Array[(String, (Option[String], String))] = Array((D,(None,d)),(A,(Some(1),a)), (C,(Some(3),c)))

 

subtractByKey

def subtractByKey[W](other: RDD[(K, W)])(implicitarg0: ClassTag[W]): RDD[(K, V)]

def subtractByKey[W](other: RDD[(K, W)],numPartitions: Int)(implicit arg0: ClassTag[W]): RDD[(K, V)]

def subtractByKey[W](other: RDD[(K, W)], p:Partitioner)(implicit arg0: ClassTag[W]): RDD[(K, V)]

 subtractByKey和基本转换操作中的subtract类似只不过这里是针对K的,返回在主RDD中出现,并且不在otherRDD中出现的元素。

参数numPartitions用于指定结果的分区数

参数partitioner用于指定分区函数

scala>rdd1.subtractByKey(rdd2).collect

res47:Array[(String, String)] = Array((B,2))

 

firstcountreducecollect

 

scala>rdd1.collect

res48:Array[(String, String)] = Array((A,1), (B,2), (C,3))

 

scala>rdd1.first

res49: (String,String) = (A,1)

 

scala> varrdd3=sc.makeRDD(Seq(10,4,23,33,2))

rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at makeRDD at<console>:24

 

scala>rdd3.first

res50: Int = 10

count

def count(): Long

count返回RDD中的元素数量。

scala>rdd3.count

res51: Long = 5

 

reduce

def reduce(f: (T, T) T): T

根据映射函数f,对RDD中的元素进行二元计算,返回计算结果

 

scala>rdd3.reduce(_+_)

res52: Int = 72

 

scala> varrdd4=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))

rdd4:org.apache.spark.rdd.RDD[(String,Int)]=ParallelCollectionRDD[56]atmakeRDD at <console>:24

 

scala>rdd4.reduce((x,y)=>{

     | (x._1+y._1,x._2+y._2)})

res53: (String,Int) = (BACBA,6)

 

collect

def collect(): Array[T]

collect用于将一个RDD转换成数组。

 

scala> rdd3.collect

res54: Array[Int]= Array(10, 4, 23, 33, 2)

 

taketoptakeOrdered

take

def take(num: Int): Array[T]

take用于获取RDD中从0num-1下标的元素,不排序。

 

scala>rdd3.take(2)

res55: Array[Int]= Array(10, 4)

 

top

def top(num: Int)(implicit ord: Ordering[T]): Array[T]

top函数用于从RDD中,按照默认(降序)或者指定的排序规则,返回前num个元素。

 

scala>rdd3.top(3)

res56: Array[Int]= Array(33, 23, 10)

//指定排序规则

scala> implicitval myOrder=implicitly[Ordering[Int]].reverse

myOrder:scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@58ac9ec9

 

scala>rdd3.top(3)

res57: Array[Int]= Array(2, 4, 10)

 

takeOrdered

def takeOrdered(num: Int)(implicit ord: Ordering[T]):Array[T]

takeOrderedtop类似,只不过以和top相反的顺序返回元素。

scala>rdd3.takeOrdered(3)

res60: Array[Int]= Array(33, 23, 10)

 

aggregatefoldlookup

aggregate

def aggregate[U](zeroValue: U)(seqOp: (U, T) U, combOp: (U, U) U)(implicit arg0: ClassTag[U]): U

aggregate用户聚合RDD中的元素,先使用seqOpRDD中每个分区中的T类型元素聚合成U类型,再使用combOp将之前每个分区聚合后的U类型聚合成U类型,特别注意seqOpcombOp都会使用zeroValue的值,zeroValue的类型为U

scala>rdd1.collect

res61: Array[Int]= Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 

scala>rdd1.mapPartitionsWithIndex{(partIdx,iter)=>{

     | varpart_map=scala.collection.mutable.Map[String,List[Int]]()

     | while(iter.hasNext){

     | var part_name="part_"+partIdx

     | var elem=iter.next()

     | if(part_map.contains(part_name)){

     | var elems=part_map(part_name)

     | elems::=elem

     | part_map(part_name)=elems}else{

     | part_map(part_name)=List[Int]{elem}}}

     | part_map.iterator}}.collect

res62:Array[(String, List[Int])] = Array((part_0,List(5, 4, 3, 2, 1)),(part_1,List(10, 9, 8, 7, 6)))

##第一个分区中包含5,4,3,2,1

##第二个分区中包含10,9,8,7,6

scala>rdd1.aggregate(1)({(x:Int,y:Int)=>x+y},

     | {(a:Int,b:Int)=>a+b})

res63: Int = 58

结果为什么是58,看下面的计算过程:

##先在每个分区中迭代执行(x : Int,y : Int) => x + y 并且使用zeroValue的值1

##即:part_0zeroValue+5+4+3+2+1 = 1+5+4+3+2+1 = 16

## part_1zeroValue+10+9+8+7+6 = 1+10+9+8+7+6 = 41

##再将两个分区的结果合并(a: Int,b : Int) => a + b ,并且使用zeroValue的值1

##即:zeroValue+part_0+part_1 = 1 +16 + 41 = 58

scala>rdd1.aggregate(2)({(x:Int,y:Int)=>x+y},

     | {(a:Int,b:Int)=>a*b})

res64: Int = 1428

##这次zeroValue=2

##part_0zeroValue+5+4+3+2+1 = 2+5+4+3+2+1 = 17

##part_1zeroValue+10+9+8+7+6 = 2+10+9+8+7+6 = 42

##最后:zeroValue*part_0*part_1= 2 * 17 * 42 = 1428

因此,zeroValue即确定了U的类型,也会对结果产生至关重要的影响,使用时候要特别注意。

fold

def fold(zeroValue: T)(op:(T, T) T): T

foldaggregate的简化,将aggregate中的seqOpcombOp使用同一个函数op

scala>rdd1.fold(1)((x,y)=>x+y)

res65: Int = 58

lookup

def lookup(key: K): Seq[V]

lookup用于(K,V)类型的RDD,指定K值,返回RDD中该K对应的所有V

scala>rdd2.collect

res66:Array[(String, String)] = Array((A,a), (C,c), (D,d))

 

scala>rdd2.lookup("A")

res68: Seq[String]= WrappedArray(a)

countByKeyforeachforeachPartitionsortBy

countByKey

def countByKey(): Map[K,Long]

countByKey用于统计RDD[K,V]中每个K的数量。

scala> varrdd1=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("B",3)))

rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[64] at makeRDDat <console>:26

 

scala> rdd1.countByKey

res69:scala.collection.Map[String,Long] = Map(A -> 2, B -> 3)

 

foreach

def foreach(f: (T) Unit): Unit

foreach用于遍历RDD,将函数f应用于每一个元素。

但要注意,如果对RDD执行foreach,只会在Executor端有效,而并不是Driver端。

比如:rdd.foreach(println),只会在Executorstdout中打印出来,Driver端是看不到的。

我在Spark1.4中是这样,不知道是否真如此。

 这时候,使用accumulator共享变量与foreach结合,倒是个不错的选择。

scala> varcnt=sc.accumulator(0)

warning: therewere two deprecation warnings; re-run with -deprecation for details

cnt:org.apache.spark.Accumulator[Int] = 0

 

scala> varrdd1=sc.makeRDD(1 to 10,2)

rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at<console>:26

 

scala>rdd1.foreach(x=>cnt+=x)

 

scala>cnt.value

res72: Int = 55

 

scala>rdd1.collect.foreach(println)

1

2

3

4

5

6

7

8

9

10

foreachPartition

def foreachPartition(f: (Iterator[T]) Unit): Unit

foreachPartitionforeach类似,只不过是对每一个分区使用f

scala> varallsize=sc.accumulator(0)

warning: therewere two deprecation warnings; re-run with -deprecation for details

allsize:org.apache.spark.Accumulator[Int] = 0

 

scala>rdd1.foreachPartition{x=>{

     | allsize +=x.size}}

 

scala>println(allsize.value)

10

sortBy

def sortBy[K](f: (T) K, ascending:Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord:Ordering[K], ctag: ClassTag[K]): RDD[T]

sortBy根据给定的排序k函数将RDD中的元素进行排序。

scala> varrdd1=sc.makeRDD(Seq(3,6,7,1,2,0),2)

rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[68] at makeRDD at<console>:26

 

scala>rdd1.sortBy(x=>x).collect

res79: Array[Int]= Array(7, 6, 3, 2, 1, 0)

 

scala>rdd1.sortBy(x=>x,false).collect

res80: Array[Int]= Array(0, 1, 2, 3, 6, 7)

scala> varrdd1=sc.makeRDD(Seq(3,6,7,1,2,0),2)

rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[68] at makeRDD at<console>:26

 

scala>rdd1.sortBy(x=>x).collect

res79: Array[Int] =Array(7, 6, 3, 2, 1, 0)

 

scala>rdd1.sortBy(x=>x,false).collect

res80: Array[Int]= Array(0, 1, 2, 3, 6, 7)

 

scala> varrdd1=sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))

rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[79] at makeRDDat <console>:26

 

scala>rdd1.sortBy(x=>x).collect

res81:Array[(String, Int)] = Array((A,2), (A,1), (B,7), (B,6), (B,3))

 

scala>rdd1.sortBy(x=>x._2,false).collect

res82:Array[(String, Int)] = Array((A,1), (A,2), (B,3), (B,6), (B,7))

 

saveAsTextFilesaveAsSequenceFilesaveAsObjectFile

saveAsSequenceFile

saveAsSequenceFile用于将RDDSequenceFile的文件格式保存到HDFS上。用法同saveAsTextFile

saveAsObjectFile

def saveAsObjectFile(path:String): Unit

saveAsObjectFile用于将RDD中的元素序列化成对象,存储到文件中。

对于HDFS,默认采用SequenceFile保存。

saveAsHadoopFilesaveAsHadoopDataset

saveAsHadoopFile

def saveAsHadoopFile(path:String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_<: OutputFormat[_, _]], codec: Class[_ <: CompressionCodec]): Unit

def saveAsHadoopFile(path:String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_<: OutputFormat[_, _]], conf: JobConf = …, codec: Option[Class[_ <:CompressionCodec]] = None): Unit

saveAsHadoopFile是将RDD存储在HDFS上的文件中,支持老版本Hadoop API

可以指定outputKeyClassoutputValueClass以及压缩格式。

每个分区输出一个文件。

saveAsHadoopDataset

defsaveAsHadoopDataset(conf: JobConf): Unit

saveAsHadoopDataset用于将RDD保存到除了HDFS的其他存储中,比如HBase

JobConf中,通常需要关注或者设置五个参数:

文件的保存路径、key值的class类型、value值的class类型、RDD的输出格式(OutputFormat)、以及压缩相关的参数。

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值