本文是照着大神“lxw的大数据田地”的文章做的,其中有部分例子和代码我实践时觉得有点晦涩,故加了一些自己的理解进去。在这里,十分感谢作者,作者写了大数据一系列文章,对我帮助很大,如果有读者对大数据学习的需求,可以去大神的网站学习。
大神的文章网址为:http://lxw1234.com/archives/2015/07/363.htm
创建RDD算子:
从集合创建rdd
parallelize
scala> varrdd=sc.parallelize(1 to 10)
rdd:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[86] at parallelize at<console>:27
scala>rdd.collect
res127: Array[Int]= Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd.partitions.size
res128: Int = 4
scala> varrdd=sc.parallelize(1 to 10,6)
rdd:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at<console>:27
scala>rdd.partitions.size
res129: Int = 6
makeRDD
scala>varcollect=Seq((1 to 10,Seq("spark","hbase")),(11 to15,Seq("hdfs","impala")))
collect:Seq[(scala.collection.immutable.Range.Inclusive,Seq[String])]=List((Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),List(spark, hbase)), (Range(11, 12,13, 14, List(hdfs, impala)))
scala> varrdd=sc.makeRDD(collect)
rdd:org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive]=ParallelCollectionRDD[88] at makeRDD at <console>:29
scala>rdd.partitions.size
res130: Int = 2
scala>rdd.preferredLocations(rdd.partitions(0))
res134:Seq[String] = List(spark, hbase)
scala>rdd.preferredLocations(rdd.partitions(1))
res135:Seq[String] = List(hdfs, impala)
指定分区位置,对后续的调度调优有帮助。
从外部存储创建rdd
textFile
varrdd1=sc.textFile("/tmp/lw/1.txt")
rdd1:org.apache.spark.rdd.RDD[String] = /tmp/lw/1.txt MapPartitionsRDD[90] attextFile at <console>:27
scala>rdd1.count
res0: Long = 4
注意若是本地文件路径需要在Driver和Executor端存在。
从HBASE读取文件创建rdd。
scala>import org.apache.hadoop.hbase.{HBaseConfiguration,HTableDescriptor,TableName}
import org.apache.hadoop.hbase.{HBaseConfiguration,HTableDescriptor, TableName}
scala> importorg.apache.hadoop.hbase.mapreduce.TableInputFormat
importorg.apache.hadoop.hbase.mapreduce.TableInputFormat
scala> importorg.apache.hadoop.hbase.client.HBaseAdmin
importorg.apache.hadoop.hbase.client.HBaseAdmin
scala> val conf = HBaseConfiguration.create()
scala>conf.set(TableInputFormat.INPUT_TABLE,"lw")
scala> varhbaseRDD = sc.newAPIHadoopRDD(conf,
classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
scala> hbaseRDD.count
res52: Long = 1
转换算子
map,flatMap,distinct
map
scala> varres=rdd1.map(_.split("\\s"))
res:org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at<console>:29
scala>res.collect
res2:Array[Array[String]] = Array(Array(hello, world), Array(hello, spark),Array(hello, hive), Array(""))
flatMap
scala> varflatMapResult=rdd1.flatMap(_.split("\\s"))
flatMapResult:org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at<console>:29
scala>flatMapResult.collect
res4:Array[String] = Array(hello, world, hello, spark, hello, hive, "")
distinct
scala>rdd1.map(_.toUpperCase).collect
res6:Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, "")
scala>rdd1.flatMap(_.toUpperCase).collect
res7: Array[Char]= Array(H, E, L, L, O, , W, O, R, L, D,H, E, L, L, O, , S, P, A, R, K, H, E, L,L, O, , H, I, V, E)
scala>rdd1.flatMap(_.toUpperCase).distinct.collect
res8: Array[Char]= Array(L, R, P, , H, V, D, O, A, I, K,S, E, W)
重分区算子:coalesce,reparation
scala>rdd1.partitions.size
res9: Int = 2
scala>rdd1.coalesce(4).partitions.size
res10: Int = 2
scala>rdd1.coalesce(1).partitions.size
res11: Int = 1
scala>rdd1.coalesce(4,true).partitions.size
res12: Int = 4
scala>rdd1.partitions.size
res13: Int = 2
scala>rdd1.repartition(3).partitions.size
res14: Int = 3
rdd切分算子:randomSplit,glom
randomSplit 函数根据weights权重,将一个RDD切分成多个RDD。该权重参数为一个Double数组,第二个参数为random的种子,基本可忽略。
scala> varrandomSplitResult=rdd1.randomSplit(Array(1.0,2.0,3.0,4.0))
randomSplitResult:Array[org.apache.spark.rdd.RDD[String]]=Array(MapPartitionsRDD[22] at randomSplit at <console>:29,
MapPartitionsRDD[23]at randomSplit at <console>:29,
MapPartitionsRDD[24]at randomSplit at <console>:29,
MapPartitionsRDD[25]at randomSplit at <console>:29)
randomSplit返回结果是一个rdd数组
scala>randomSplitResult(0).collect
res16:Array[String] = Array()
scala>randomSplitResult(3).collect
res17:Array[String] = Array(hello world, hello spark, hello hive, "")
scala>randomSplitResult(2).collect
res18:Array[String] = Array()
scala>randomSplitResult.size
res21: Int = 4
glom
def glom(): RDD[Array[T]]
该函数是将RDD中每一个分区中类型为T的元素转换成Array[T],这样每一个分区就只有一个数组元素
scala> varglomResult=rdd1.glom
glomResult:org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[26] at glom at<console>:29
scala>glomResult.collect
res22:Array[Array[String]] = Array(Array(hello world, hello spark), Array(hello hive,""))
scala>rdd1.collect
res23:Array[String] = Array(hello world, hello spark, hello hive, "")
glom将每个分区中的元素放到一个数组中,这样,结果就变成了3个数组
union
def union(other: RDD[T]): RDD[T]该函数比较简单,就是将两个RDD进行合并,不去重。
scala> varrdd2=sc.makeRDD(1 to 3,3)
rdd2:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at<console>:27
scala> varrdd3=sc.makeRDD(3 to 5,3)
rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at makeRDD at<console>:27
scala>rdd2.union(rdd3).collect
res24: Array[Int]= Array(1, 2, 3, 3, 4, 5)
intersection
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord:Ordering[T] = null): RDD[T]
该函数返回两个RDD的交集,并且去重。参数numPartitions指定返回的RDD的分区数。参数partitioner用于指定分区函数
scala> varrdd2=sc.makeRDD(1 to 3,3)
rdd2:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at<console>:27
scala> varrdd3=sc.makeRDD(3 to 5,3)
rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at makeRDD at<console>:27
scala>rdd2.intersection(rdd3).collect
res26: Array[Int]= Array(3)
scala>rdd2.intersection(rdd3).partitions.size
res1: Int = 3
subtract
def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T]= null): RDD[T]
该函数类似于intersection,但返回在RDD中出现,并且不在otherRDD中出现的元素,不去重。参数含义同intersection
scala>rdd2.subtract(rdd3).collect
res3: Array[Int] =Array(1, 2)
scala>rdd3.subtract(rdd2).collect
res4: Array[Int] =Array(4, 5)
mapPartitions
def mapPartitions[U](f: (Iterator[T]) =>Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0:ClassTag[U]): RDD[U]
该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。参数preservesPartitioning表示是否保留父RDD的partitioner分区信息。
scala> varrdd4=sc.makeRDD(1 to 5,2)
rdd4:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at makeRDD at<console>:27
scala> varrdd5=rdd4.mapPartitions{x=>{
| var result=List[Int]()
| var i=0
| while(x.hasNext){
| i+=x.next()}
| result.::(i).iterator}}
rdd5:org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at mapPartitions at<console>:29
scala>rdd5.collect
res5: Array[Int] =Array(3, 12)
scala>rdd5.partitions.size
res6: Int = 2
mapPartitionsWithIndex
def mapPartitionsWithIndex[U](f: (Int, Iterator[T])=> Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0:ClassTag[U]): RDD[U]
函数作用同mapPartitions,不过提供了两个参数,第一个参数为分区的索引。
scala> varrdd6=rdd4.mapPartitionsWithIndex{(x,iter)=>{
| var result=List[String]()
| var i=0
| while(iter.hasNext){
| i+=iter.next()
| }
| result.::(x+"|"+i).iterator}}
rdd6:org.apache.spark.rdd.RDD[String]=MapPartitionsRDD[24]atmapPartitionsWithIndex at <console>:29
scala>rdd6.collect
res7:Array[String] = Array(0|3, 1|12)
zip
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]):RDD[(T, U)]
zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
scala>rdd2.partitions.size
res14: Int = 3
scala> varrdd7=sc.makeRDD(Seq("A","B","C"),3)
rdd7:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at makeRDD at<console>:27
scala>rdd2.zip(rdd7).collect
res16: Array[(Int,String)] = Array((1,A), (2,B), (3,C))
zipPartitions
zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。该函数有好几种实现,可分为三类:
参数是一个RDD
defzipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T],Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]):RDD[V]
defzipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f:(Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1:ClassTag[V]): RDD[V]
这两个区别就是参数preservesPartitioning,是否保留父RDD的partitioner分区信息映射方法f参数为两个RDD的迭代器。
scala> varrdd1=sc.makeRDD(1 to 5,3)
rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at makeRDD at<console>:27
scala> varrdd2=sc.makeRDD(Seq("A","B","C","D","E"),3)
rdd2:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at makeRDD at<console>:27
scala>rdd1.mapPartitionsWithIndex{(x,iter)=>{
| var result=List[String]()
| while(iter.hasNext){
|result::=("part_"+x+"|"+iter.next())}
| result.iterator}
| }.collect
res17:Array[String] = Array(part_0|1, part_1|3, part_1|2, part_2|5, part_2|4)
scala>rdd2.mapPartitionsWithIndex{(x,iter)=>{
| var result=List[String]()
| while(iter.hasNext){
|result::=("part_"+x+"|"+iter.next())}
| result.iterator}
| }.collect
res18:Array[String] = Array(part_0|A, part_1|C, part_1|B, part_2|E, part_2|D)
scala>rdd1.zipPartitions(rdd2){(rdd1Iter,rdd2Iter)=>{
| var result=List[String]()
|while(rdd1Iter.hasNext&&rdd2Iter.hasNext){
| result::=(rdd1Iter.next()+"_"+rdd2Iter.next())}
| result.iterator}
| }.collect
res19:Array[String] = Array(1_A, 3_C, 2_B, 5_E, 4_D)
参数是两个RDD
defzipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0:ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
defzipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning:Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) =>Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]):RDD[V]
用法同上面,只不过该函数参数为两个RDD,映射方法f输入参数为两个RDD的迭代器。
scala> rdd1
res20:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at makeRDD at<console>:27
scala> rdd2
res21:org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at makeRDD at<console>:27
scala>rdd3.mapPartitionsWithIndex{(x,iter)=>{
| var result=List[String]()
| while(iter.hasNext){
|result::=("part_"+x+"|"+iter.next())}
| result.iterator}
| }.collect
res22:Array[String] = Array(part_0|a, part_1|c, part_1|b, part_2|e, part_2|d)
scala> varrdd4=rdd1.zipPartitions(rdd2,rdd3){(rdd1Iter,rdd2Iter,rdd3Iter)=>{
| var result=List[String]()
|while(rdd1Iter.hasNext&&rdd2Iter.hasNext&&rdd3Iter.hasNext){
|result::=(rdd1Iter.next()+"_"+rdd2Iter.next()+"_"+rdd3Iter.next())}
| result.iterator}
| }
rdd4:org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[38] at zipPartitions at<console>:33
scala>rdd4.collect
res23:Array[String] = Array(1_A_a, 3_C_c, 2_B_b, 5_E_e, 4_D_d)
参数是三个RDD
defzipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) =>Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D],arg3: ClassTag[V]): RDD[V]
def zipPartitions[B,C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning:Boolean)(f: (Iterator[T], Iterator[B], Iterator[C],Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C],arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
用法同上面,只不过这里又多了个一个RDD而已。
zipWithIndex
def zipWithIndex(): RDD[(T, Long)]
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
scala> rdd2.zipWithIndex().collect
res1: Array[(String, Long)] = Array((A,0), (B,1),(C,2), (D,3), (E,4))
zipWithUniqueId
def zipWithUniqueId(): RDD[(T, Long)]
该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:
每个分区中第一个元素的唯一ID值为:该分区索引号,
每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值)+ (该RDD总的分区数)
scala>rdd2.zipWithUniqueId().collect
res2:Array[(String, Long)] = Array((A,0), (B,1), (C,4), (D,2), (E,5))
Spark算子:统计RDD分区中的元素及数量
Spark RDD是被分区的,在生成RDD时候,一般可以指定分区的数量,如果不指定分区数量,当RDD从集合创建时候,则默认为该程序所分配到的资源的CPU核数,如果是从HDFS文件创建,默认为文件的Block数。
可以利用RDD的mapPartitionsWithIndex方法来统计每个分区中的元素及数量。
//统计rdd1每个分区中元素数量
scala>rdd3.mapPartitionsWithIndex{(partIdx,iter)=>{
| var part_map=scala.collection.mutable.Map[String,Int]()//集合存放分区号和元素数量
| while(iter.hasNext){//迭代分区元素
| var part_name="part_"+partIdx//用分区好组件map的key
| if(part_map.contains(part_name)){//判读map中是否有key
| var ele_cnt=part_map(part_name)//将value赋值给中间变量
| part_map(part_name)=ele_cnt+1}//if,将分区中元素个数当作value
| else{
| part_map(part_name)=1}//else,根据key赋值数量为1
| iter.next()}//while,自增
| part_map.iterator}//inner
| }.collect//outter
res5:Array[(String, Int)] = Array((part_0,3), (part_1,3), (part_2,4), (part_3,3),(part_4,3), (part_5,4), (part_6,3), (part_7,3), (part_8,4), (part_9,3),(part_10,3), (part_11,4), (part_12,3), (part_13,3), (part_14,4))
从part_0到part_14,每个分区中的元素数量
//统计rdd1每个分区中有哪些元素
scala>rdd3.mapPartitionsWithIndex{(partIdx,iter)=>{
| varpart_map=scala.collection.mutable.Map[String,List[Int]]()//存放分区号和分区内元素组成的list
| while(iter.hasNext){//当前分区是否有下一个元素
| var part_name="part_"+partIdx//用分区号组成map的一个key
| var elem=iter.next()//将当前分区的下一个元素取出
| if(part_map.contains(part_name)){//判断map中是否有当前key
| var elems=part_map(part_name)//将该key对应的元素list赋值给elems
| elems::=elem//将当前分区的下一个元素存入元素list
| part_map(part_name)=elems}else{//将元素list作为value存入map
| part_map(part_name)=List[Int]{elem}}}//当进入下一个分区。
|part_map.iterator}}.collect
res6:Array[(String, List[Int])] = Array((part_0,List(3, 2, 1)), (part_1,List(6,5, 4)), (part_2,List(10, 9,8, 7)), (part_3,List(13, 12, 11)), (part_4,List(16, 15, 1 4)), (part_5,List(20, 19, 18,17)), (part_6,List(23, 22, 21)), (part_7,List(26, 25, 24)), (part_8,List(30, 29, 28,27)), (part_9,List(33, 32, 31)), (part_10,List(36, 35, 34)), (part_11,List(40, 39,38, 37)), (part_12,List(43, 42, 41)), (part_13,List (46, 45, 44)), (part_14,List(50, 49, 48,47)))
//从part_0到part14,每个分区中包含的元素
partitionBy、mapValues、flatMapValues
partitionBy
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
该函数根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区。
scala> varrdd4=sc.makeRDD(
|Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd4:org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[8] at makeRDDat <console>:24
scala>rdd4.mapPartitionsWithIndex{(partIdx,iter)=>{
| var part_map=scala.collection.mutable.Map[String,List[(Int,String)]]()
| while(iter.hasNext){varpart_name="part_"+partIdx
| var elem=iter.next()
| if(part_map.contains(part_name)){
| var elems=part_map(part_name)
| elems::=elem
| part_map(part_name)=elems}else{
|part_map(part_name)=List[(Int,String)]{elem}}
| }
| part_map.iterator
| }}.collect
res8:Array[(String, List[(Int, String)])] = Array((part_0,List((2,B), (1,A))),(part_1,List((4,D), (3,C))))
//(2,B),(1,A)在part_0中,(4,D),(3,C)在part_1中
//使用partitionBy重分区
scala> varrdd5=rdd4.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd5:org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[10] at partitionBy at<console>:26
scala>rdd5.mapPartitionsWithIndex{(partIdx,iter)=>{
|var part_map=scala.collection.mutable.map()
| varpart_map=scala.collection.mutable.Map
| varpart_map=scala.collection.mutable.Map[String,List[(Int,String)]]()
|while(iter.hasNext){var part_name="part_"+partIdx
| var elem=iter.next()
| if(part_map.contains(part_name)){
| var elems=part_map(part_name)
| elems::=elem
| part_map(part_name)=elems}else{
| part_map(part_name)=List[(Int,String)]{elem}}
| }
| part_map.iterator}}.collect
res9:Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))),(part_1,List((1,A), (3,C))))
//(4,D),(2,B)在part_0中,(3,C),(1,A)在part_1中
mapValues
def mapValues[U](f: (V) => U): RDD[(K, U)]
同基本转换操作中的map,只不过mapValues是针对[K,V]中的V值进行map操作。
scala>rdd4.collect
res10: Array[(Int,String)] = Array((1,A), (2,B), (3,C), (4,D))
scala>rdd4.mapValues(x=>x+"_").collect
res11: Array[(Int,String)] = Array((1,A_), (2,B_), (3,C_), (4,D_))
flatMapValues
def flatMapValues[U](f: (V) => TraversableOnce[U]):RDD[(K, U)]
同基本转换操作中的flatMap,只不过flatMapValues是针对[K,V]中的V值进行flatMap操作。
scala>rdd4.flatMapValues(x=>x+"_").collect
res13: Array[(Int,Char)] = Array((1,A), (1,_), (2,B), (2,_), (3,C), (3,_), (4,D), (4,_))
combineByKey、foldByKey
combineByKey
def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]
def combineByKey[C](createCombiner: (V) => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner,mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]
该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。
其中的参数:
createCombiner:组合器函数,用于将V类型转换成C类型,输入参数为RDD[K,V]中的V,输出为C
mergeValue:合并值函数,将一个C类型和一个V类型值合并成一个C类型,输入参数为(C,V),输出为C
mergeCombiners:合并组合器函数,用于将两个C类型值合并成一个C类型,输入参数为(C,C),输出为C
numPartitions:结果RDD分区数,默认保持原有的分区数
partitioner:分区函数,默认为HashPartitioner
mapSideCombine:是否需要在Map端进行combine操作,类似于MapReduce中的combine,默认为true
scala> varrdd6=sc.makeRDD(Array(("A",1),("A",3),("C",1),("C",3),("C",1)))
rdd6:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at makeRDDat <console>:24
scala>rdd6.combineByKey(
| (v:Int)=>v+"_",
| (c:String,v:Int)=>c+"@"+v,
|(c1:String,c2:String)=>c1+"$"+c2).collect
res14:Array[(String, String)] = Array((A,3_$1_), (C,1_$3_$1_))
其中三个映射函数分别为:
createCombiner:(V) => C
(v : Int) => v+ “_” //在每一个V值后面加上字符_,返回C类型(String)
mergeValue: (C, V)=> C
(c : String, v :Int) => c + “@” + v //合并C类型和V类型,中间加字符@,返回C(String)
mergeCombiners:(C, C) => C
(c1 : String, c2 :String) => c1 + “$” + c2 //合并C类型和C类型,中间加$,返回C(String)
其他参数为默认值。
最终,将RDD[String,Int]转换为RDD[String,String]。
scala>rdd6.combineByKey(
| (v:Int)=>List(v),
| (c:List[Int],v:Int)=>v::c,
|(c1:List[Int],c2:List[Int])=>c1:::c2).collect
res16:Array[(String, List[Int])] = Array((A,List(3, 1)), (C,List(1, 3, 1)))
最终将RDD[String,Int]转换为RDD[String,List[Int]]。
foldByKey
def foldByKey(zeroValue: V)(func: (V, V) => V):RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func:(V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner:Partitioner)(func: (V, V) => V): RDD[(K, V)]
该函数用于RDD[K,V]根据K将V做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.
scala> varrdd7=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd7:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[17] at makeRDDat <console>:24
scala>rdd7.foldByKey(0)(_+_).collect
res17:Array[(String, Int)] = Array((A,2), (B,3), (C,1))
//将rdd1中每个key对应的V进行累加,注意zeroValue=0,需要先初始化V,映射函数为+操作,比如("A",0),("A",2),先将zeroValue应用于每个V,得到:("A",0+0),("A",2+0),即:("A",0),("A",2),再将映射函数应用于初始化后的V,最后得到(A,0+2),即(A,2)
scala>rdd7.foldByKey(2)(_+_).collect
res18:Array[(String, Int)] = Array((A,6), (B,7), (C,3))
//先将zeroValue=2应用于每个V,得到:("A",0+2),("A",2+2),即:("A",2),("A",4),再将映射函数应用于初始化后的V,最后得到:(A,2+4),即:(A,6)
scala>rdd7.foldByKey(0)(_*_).collect
res19:Array[(String, Int)] = Array((A,0), (B,0), (C,0))
//先将zeroValue=0应用于每个V,注意,这次映射函数为乘法,得到:("A",0*0),("A",2*0),即:("A",0),("A",0),再将映射函//数应用于初始化后的V,最后得到:(A,0*0),即:(A,0)其他K也一样,最终都得到了V=0
cala>rdd1.foldByKey(1)(_*_).collect
res78:Array[(String, Int)] = Array((A,0), (B,2), (C,1))
//映射函数为乘法时,需要将zeroValue设为1,才能得到我们想要的结果。
groupByKey、reduceByKey、reduceByKeyLocally
groupByKey
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K,Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K,Iterable[V])]
该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,
参数numPartitions用于指定分区数;参数partitioner用于指定分区函数;
scala> varrdd1=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[22] at makeRDDat <console>:24
scala>rdd1.groupByKey().collect
res21:Array[(String, Iterable[Int])] = Array((A,CompactBuffer(2, 0)),(B,CompactBuffer(1, 2)), (C,CompactBuffer(1)))
reduceByKey
defreduceByKey(func: (V, V) => V): RDD[(K, V)]
defreduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
defreduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
scala>rdd1.partitions.size
res22: Int = 5
scala> varrdd2=rdd1.reduceByKey((x,y)=>x+y)
rdd2:org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[24] at reduceByKey at<console>:26
scala>rdd2.collect
res23:Array[(String, Int)] = Array((A,2), (B,3), (C,1))
scala>rdd2.partitions.size
res24: Int = 5
scala>varrdd2=rdd1.reduceByKey(neworg.apache.spark.HashPartitioner(2),(x,y)=>x+y)
rdd2:org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[25] at reduceByKey at<console>:26
scala>rdd2.collect
res25:Array[(String, Int)] = Array((B,3), (A,2), (C,1))
scala>rdd2.partitions.size
res26: Int = 2
reduceByKeyLocally
def reduceByKeyLocally(func: (V, V) => V): Map[K,V]
该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。
scala>rdd1.reduceByKeyLocally((x,y)=>x+y)
res27:scala.collection.Map[String,Int] = Map(A -> 2, B -> 3, C -> 1)
scala>rdd1.collect
res28:Array[(String, Int)] = Array((A,0), (A,2), (B,1), (B,2), (C,1))
cogroup、join
cogroup
##参数为1个RDD
def cogroup[W](other: RDD[(K, W)]): RDD[(K,(Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
##参数为2个RDD
def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2]))]
##参数为3个RDD
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K, (Iterable[V], Iterable[W1],Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V],Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2:RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner): RDD[(K,(Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
cogroup相当于SQL中的全外关联fullouter join,返回左右RDD中的记录,关联不上的为空。
参数numPartitions用于指定结果的分区数。
参数partitioner用于指定分区函数。
##参数为1个RDD的例子
scala> varrdd1=sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
rdd1:org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDDat <console>:24
scala> varrdd2=sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
rdd2:org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] atmakeRDD at <console>:24
scala> varrdd3=rdd1.cogroup(rdd2).collect
rdd3: Array[(String,(Iterable[String], Iterable[String]))] =Array((B,(CompactBuffer(2),CompactBuffer())),(D,(CompactBuffer(),CompactBuffer(d))),(A,(CompactBuffer(1),CompactBuffer(a))),(C,(CompactBuffer(3),CompactBuffer(c))))
scala>rdd1.cogroup(rdd2).partitions.size
res30: Int = 2
scala> varrdd4=rdd1.cogroup(rdd2,3)
rdd4:org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String]))] =MapPartitionsRDD[38] at cogroup at <console>:28
scala>rdd4.partitions.size
res31: Int = 3
scala>rdd4.collect
res33:Array[(String,(Iterable[String],Iterable[String]))]=Array((B,(CompactBuffer(2),CompactBuffer())),(C,(CompactBuffer(3),CompactBuffer(c))),(A,(CompactBuffer(1),CompactBuffer(a))),(D,(CompactBuffer(),CompactBuffer(d))))
scala>rdd1.partitions.size
res34: Int = 2
scala>rdd2.partitions.size
res35: Int = 2
scala> varrdd4=rdd1.cogroup(rdd2,rdd3)
rdd4:org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String],Iterable[String]))] = MapPartitionsRDD[41] at cogroup at <console>:30
scala> rdd4.partitions.size
res36: Int = 2
scala>rdd4.collect
res37:Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] =Array((B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())),(D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())), (A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))),(C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())),(E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))
join
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int):RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, W))]
join相当于SQL中的内关联join,只返回两个RDD根据K可以关联上的结果,join只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。
参数numPartitions用于指定结果的分区数
参数partitioner用于指定分区函数
scala>rdd1.join(rdd2).collect
res38:Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
leftOuterJoin、rightOuterJoin、subtractByKey
leftOuterJoin
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V,Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, Option[W]))]
leftOuterJoin类似于SQL中的左外关联leftouter join,返回结果以前面的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。
参数numPartitions用于指定结果的分区数
参数partitioner用于指定分区函数
scala>rdd1.collect
res43:Array[(String, String)] = Array((A,1), (B,2), (C,3))
scala>rdd2.collect
res44:Array[(String, String)] = Array((A,a), (C,c), (D,d))
scala>rdd1.leftOuterJoin(rdd2).collect
res45:Array[(String, (String, Option[String]))] = Array((B,(2,None)),(A,(1,Some(a))), (C,(3,Some(c))))
rightOuterJoin
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K,(Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Option[V], W))]
rightOuterJoin类似于SQL中的有外关联rightouter join,返回结果以参数中的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。
参数numPartitions用于指定结果的分区数
参数partitioner用于指定分区函数
scala>rdd1.rightOuterJoin(rdd2).collect
res46:Array[(String, (Option[String], String))] = Array((D,(None,d)),(A,(Some(1),a)), (C,(Some(3),c)))
subtractByKey
def subtractByKey[W](other: RDD[(K, W)])(implicitarg0: ClassTag[W]): RDD[(K, V)]
def subtractByKey[W](other: RDD[(K, W)],numPartitions: Int)(implicit arg0: ClassTag[W]): RDD[(K, V)]
def subtractByKey[W](other: RDD[(K, W)], p:Partitioner)(implicit arg0: ClassTag[W]): RDD[(K, V)]
subtractByKey和基本转换操作中的subtract类似只不过这里是针对K的,返回在主RDD中出现,并且不在otherRDD中出现的元素。
参数numPartitions用于指定结果的分区数
参数partitioner用于指定分区函数
scala>rdd1.subtractByKey(rdd2).collect
res47:Array[(String, String)] = Array((B,2))
first、count、reduce、collect
scala>rdd1.collect
res48:Array[(String, String)] = Array((A,1), (B,2), (C,3))
scala>rdd1.first
res49: (String,String) = (A,1)
scala> varrdd3=sc.makeRDD(Seq(10,4,23,33,2))
rdd3:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at makeRDD at<console>:24
scala>rdd3.first
res50: Int = 10
count
def count(): Long
count返回RDD中的元素数量。
scala>rdd3.count
res51: Long = 5
reduce
def reduce(f: (T, T) ⇒ T): T
根据映射函数f,对RDD中的元素进行二元计算,返回计算结果
scala>rdd3.reduce(_+_)
res52: Int = 72
scala> varrdd4=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd4:org.apache.spark.rdd.RDD[(String,Int)]=ParallelCollectionRDD[56]atmakeRDD at <console>:24
scala>rdd4.reduce((x,y)=>{
| (x._1+y._1,x._2+y._2)})
res53: (String,Int) = (BACBA,6)
collect
def collect(): Array[T]
collect用于将一个RDD转换成数组。
scala> rdd3.collect
res54: Array[Int]= Array(10, 4, 23, 33, 2)
take、top、takeOrdered
take
def take(num: Int): Array[T]
take用于获取RDD中从0到num-1下标的元素,不排序。
scala>rdd3.take(2)
res55: Array[Int]= Array(10, 4)
top
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
top函数用于从RDD中,按照默认(降序)或者指定的排序规则,返回前num个元素。
scala>rdd3.top(3)
res56: Array[Int]= Array(33, 23, 10)
//指定排序规则
scala> implicitval myOrder=implicitly[Ordering[Int]].reverse
myOrder:scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@58ac9ec9
scala>rdd3.top(3)
res57: Array[Int]= Array(2, 4, 10)
takeOrdered
def takeOrdered(num: Int)(implicit ord: Ordering[T]):Array[T]
takeOrdered和top类似,只不过以和top相反的顺序返回元素。
scala>rdd3.takeOrdered(3)
res60: Array[Int]= Array(33, 23, 10)
aggregate、fold、lookup
aggregate
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
aggregate用户聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,再使用combOp将之前每个分区聚合后的U类型聚合成U类型,特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U。
scala>rdd1.collect
res61: Array[Int]= Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala>rdd1.mapPartitionsWithIndex{(partIdx,iter)=>{
| varpart_map=scala.collection.mutable.Map[String,List[Int]]()
| while(iter.hasNext){
| var part_name="part_"+partIdx
| var elem=iter.next()
| if(part_map.contains(part_name)){
| var elems=part_map(part_name)
| elems::=elem
| part_map(part_name)=elems}else{
| part_map(part_name)=List[Int]{elem}}}
| part_map.iterator}}.collect
res62:Array[(String, List[Int])] = Array((part_0,List(5, 4, 3, 2, 1)),(part_1,List(10, 9, 8, 7, 6)))
##第一个分区中包含5,4,3,2,1
##第二个分区中包含10,9,8,7,6
scala>rdd1.aggregate(1)({(x:Int,y:Int)=>x+y},
| {(a:Int,b:Int)=>a+b})
res63: Int = 58
结果为什么是58,看下面的计算过程:
##先在每个分区中迭代执行(x : Int,y : Int) => x + y 并且使用zeroValue的值1
##即:part_0中zeroValue+5+4+3+2+1 = 1+5+4+3+2+1 = 16
## part_1中zeroValue+10+9+8+7+6 = 1+10+9+8+7+6 = 41
##再将两个分区的结果合并(a: Int,b : Int) => a + b ,并且使用zeroValue的值1
##即:zeroValue+part_0+part_1 = 1 +16 + 41 = 58
scala>rdd1.aggregate(2)({(x:Int,y:Int)=>x+y},
| {(a:Int,b:Int)=>a*b})
res64: Int = 1428
##这次zeroValue=2
##part_0中zeroValue+5+4+3+2+1 = 2+5+4+3+2+1 = 17
##part_1中zeroValue+10+9+8+7+6 = 2+10+9+8+7+6 = 42
##最后:zeroValue*part_0*part_1= 2 * 17 * 42 = 1428
因此,zeroValue即确定了U的类型,也会对结果产生至关重要的影响,使用时候要特别注意。
fold
def fold(zeroValue: T)(op:(T, T) ⇒T): T
fold是aggregate的简化,将aggregate中的seqOp和combOp使用同一个函数op。
scala>rdd1.fold(1)((x,y)=>x+y)
res65: Int = 58
lookup
def lookup(key: K): Seq[V]
lookup用于(K,V)类型的RDD,指定K值,返回RDD中该K对应的所有V值
scala>rdd2.collect
res66:Array[(String, String)] = Array((A,a), (C,c), (D,d))
scala>rdd2.lookup("A")
res68: Seq[String]= WrappedArray(a)
countByKey、foreach、foreachPartition、sortBy
countByKey
def countByKey(): Map[K,Long]
countByKey用于统计RDD[K,V]中每个K的数量。
scala> varrdd1=sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("B",3)))
rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[64] at makeRDDat <console>:26
scala> rdd1.countByKey
res69:scala.collection.Map[String,Long] = Map(A -> 2, B -> 3)
foreach
def foreach(f: (T) ⇒ Unit): Unit
foreach用于遍历RDD,将函数f应用于每一个元素。
但要注意,如果对RDD执行foreach,只会在Executor端有效,而并不是Driver端。
比如:rdd.foreach(println),只会在Executor的stdout中打印出来,Driver端是看不到的。
我在Spark1.4中是这样,不知道是否真如此。
这时候,使用accumulator共享变量与foreach结合,倒是个不错的选择。
scala> varcnt=sc.accumulator(0)
warning: therewere two deprecation warnings; re-run with -deprecation for details
cnt:org.apache.spark.Accumulator[Int] = 0
scala> varrdd1=sc.makeRDD(1 to 10,2)
rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at<console>:26
scala>rdd1.foreach(x=>cnt+=x)
scala>cnt.value
res72: Int = 55
scala>rdd1.collect.foreach(println)
1
2
3
4
5
6
7
8
9
10
foreachPartition
def foreachPartition(f: (Iterator[T]) ⇒Unit): Unit
foreachPartition和foreach类似,只不过是对每一个分区使用f。
scala> varallsize=sc.accumulator(0)
warning: therewere two deprecation warnings; re-run with -deprecation for details
allsize:org.apache.spark.Accumulator[Int] = 0
scala>rdd1.foreachPartition{x=>{
| allsize +=x.size}}
scala>println(allsize.value)
10
sortBy
def sortBy[K](f: (T) ⇒ K, ascending:Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord:Ordering[K], ctag: ClassTag[K]): RDD[T]
sortBy根据给定的排序k函数将RDD中的元素进行排序。
scala> varrdd1=sc.makeRDD(Seq(3,6,7,1,2,0),2)
rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[68] at makeRDD at<console>:26
scala>rdd1.sortBy(x=>x).collect
res79: Array[Int]= Array(7, 6, 3, 2, 1, 0)
scala>rdd1.sortBy(x=>x,false).collect
res80: Array[Int]= Array(0, 1, 2, 3, 6, 7)
scala> varrdd1=sc.makeRDD(Seq(3,6,7,1,2,0),2)
rdd1:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[68] at makeRDD at<console>:26
scala>rdd1.sortBy(x=>x).collect
res79: Array[Int] =Array(7, 6, 3, 2, 1, 0)
scala>rdd1.sortBy(x=>x,false).collect
res80: Array[Int]= Array(0, 1, 2, 3, 6, 7)
scala> varrdd1=sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
rdd1:org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[79] at makeRDDat <console>:26
scala>rdd1.sortBy(x=>x).collect
res81:Array[(String, Int)] = Array((A,2), (A,1), (B,7), (B,6), (B,3))
scala>rdd1.sortBy(x=>x._2,false).collect
res82:Array[(String, Int)] = Array((A,1), (A,2), (B,3), (B,6), (B,7))
saveAsTextFile、saveAsSequenceFile、saveAsObjectFile
saveAsSequenceFile
saveAsSequenceFile用于将RDD以SequenceFile的文件格式保存到HDFS上。用法同saveAsTextFile。
saveAsObjectFile
def saveAsObjectFile(path:String): Unit
saveAsObjectFile用于将RDD中的元素序列化成对象,存储到文件中。
对于HDFS,默认采用SequenceFile保存。
saveAsHadoopFile、saveAsHadoopDataset
saveAsHadoopFile
def saveAsHadoopFile(path:String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_<: OutputFormat[_, _]], codec: Class[_ <: CompressionCodec]): Unit
def saveAsHadoopFile(path:String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_<: OutputFormat[_, _]], conf: JobConf = …, codec: Option[Class[_ <:CompressionCodec]] = None): Unit
saveAsHadoopFile是将RDD存储在HDFS上的文件中,支持老版本Hadoop API。
可以指定outputKeyClass、outputValueClass以及压缩格式。
每个分区输出一个文件。
saveAsHadoopDataset
defsaveAsHadoopDataset(conf: JobConf): Unit
saveAsHadoopDataset用于将RDD保存到除了HDFS的其他存储中,比如HBase。
在JobConf中,通常需要关注或者设置五个参数:
文件的保存路径、key值的class类型、value值的class类型、RDD的输出格式(OutputFormat)、以及压缩相关的参数。