对于基于内存计算的spark框架只是绝大多数是基于内存,但是譬如shuffle还是基于磁盘的,这也是影响整个spark计算性能的因素之一,这里我们将讲解一些saprk常用的算子,Actions和Transformations主要区别在于1.Actions的操作会触发任务,2.Actions操作的结果要么返回给client要么存储到介质中譬如hdfs,而Transformations返回的都是RDD
- Transformations算子
- map函数,
sc.parallelize(d).map(w=>w+1).foreach(println _),即对rdd中的每个元素进行一个操作
- flatMap函数,sc.parallelize(d).map(w=>w+1).foreach(println _),即对rdd中的每个元素进行一个操作
-
/** * groupBy与groupByKey之间的区别 * 1.groupBy需要传递一个Key到函数groupBy(w=>w._1)(w可以是n元组),返回值时(w._1,CompactBuffer(w1,w2,...(key相同的))) * sc.parallelize(d).map(w=>(w,1,2,3)).groupBy(w=>w._2).foreach(println _)(1,CompactBuffer((A,1,2,3), (B,1,2,3), (A,1,2,3), (C,1,2,3))) * 2.groupByKey无需传递参数,但是数据有格式要求(key,value)(只能是二元组),返回的数据(key,CompactBuffer(v1,v2...)) * sc.parallelize(d).map(w=>(w,(1,2))).groupByKey().foreach(println _) * (B,CompactBuffer((1,2))) (A,CompactBuffer((1,2), (1,2))) (C,CompactBuffer((1,2))) val a=sc.parallelize(d).map(w=>(w,(1,2))).sortByKey(false) a.foreach(println _)
-
/** * 此时的join也是key,value的形式(key,value) (key,(value1,value2)) * val d=Array(("A",(10,11)),("B",(20,12)),("C",(30,13)),("C",(30,13)),("D",(30,13))) val d1=Array(("A","令狐冲"),("B","张无忌"),("C","老顽童"),("C","老顽童1")) val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s1.leftOuterJoin(s2).foreach(println _) */ 此时的join是基于左边的key依次遍历右边的rdd,key相同则结合在一起,所以结果为(key,(v1,v2)) /** * cogroup函数的使用 *返回的值是(key,(Interal(第一个rdd相同key),Interal(第二个rdd相同key))) * val d=Array(("A",(10,11)),("B",(20,12)),("B",(20,12)),("C",(30,13)),("D",(30,13))) val d1=Array(("A","令狐冲"),("B","张无忌"),("C","老顽童"),("B","张无忌2")) val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s1.cogroup(s2).foreach(println _) */ /** * 求两个rdd的并集,但是不去重 * val d=Array(1,2,3,4,5) val d1=Array(4,5,6,7) val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s1.union(s2).foreach(println _) */
-
/** *intersection求交集 * val d=Array(("A","令狐冲"),("B","张无忌")) val d1=Array(("A","令狐冲"),("B","张无忌"),("C","老顽童"),("C","老顽童1")) val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s1.intersection(s2).foreach(println _) */ /** * distinct去除重复 * val d1=Array(("A","令狐冲"),("B","张无忌"),("C","老顽童"),("C","老顽童1"),("A","令狐冲")) //val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s2.distinct().foreach(println _) */ /** *cartesian笛卡儿积 * val d=Array(1,2,3) val d1=Array(4,5,6) val s1=sc.parallelize(d) val s2=sc.parallelize(d1) s1.cartesian(s2).foreach(println _) (1,4) (1,5) (1,6) (2,4) (2,5) (2,6) (3,4) (3,5) (3,6) */
-
** * mapPartitions和Map进行比较,功能类似, * 不同 * 1.map每次迭代一个元素,mapPartitions每次迭代一个分区 * 如果链接数据库,那么每个分区有一个链接对象,而map将为每个元素创建一个链接 * * 2.考虑内存是否够用 * val d1=Array(1,2,3,4) val s1=sc.parallelize(d1,2) s1.mapPartitions{t =>{ val t1=scala.collection.mutable.ListBuffer.empty[Int] for(i <- t){ t1+=(i+10) } t1.iterator} }.foreach(println _) */ /** * repartition,coalesce都是重新分区,之间的区别,宽依赖,窄依赖,shuffle * val d1=Array(1,2,3,4,5,6,7,8,9,10) val s1=sc.parallelize(d1) s1.foreach(println _) s1.repartition(2).foreach(println _) *1.repartition会进行shuffle操作,上面的结果是(1,3,5,7,9),(2,4,6,8,10) hashCode&Int.MaxValue % partitions * * val d1=Array(1,2,3,4,5,6,7,8,9,10) val s1=sc.parallelize(d1) s1.foreach(println _) s1.coalesce(2,true).foreach(println _) 是否进行分区,true:是分区(1,3,5,7,9),(2,4,6,8,10),false是不分区(1,2,3,4,5,6,7,8,9,10) 结论:N:原来的分区数 M:后来的分区数目 1.N<M:需要将shuffle为true 宽依赖 2.N>M 相差不大 N=1000,M=100建议将shuffle设置为false窄依赖 3.N>>M n=100 m=1建议设置shuffle为true这样性能更好,如果设为false,则在同一个stage,造成并行度不够影响性能 */
-
/** * sample:随机采样 withReplacement是否放回 fraction:比例(不一样准确) seed;随机种子 * * aggregateByKey:第一个参数,初始值 第二个参数:类似于map-side本地聚合,combider 第三个参数:类似reduce全局聚合 * val d1=Array("line hadoop spark hadoop","spark we") val s1=sc.parallelize(d1) val r1=s1.flatMap(line=>line.split(" ")) val r2=r1.map(w=>(w,1)).aggregateByKey(10)(_+_,_+_).foreach(println _) */
-
/** *mapPartitionsWithIndex和mapPartitions之间的区别就是传入参数多了一个分区号 * val d1=Array(1,2,3,4,5,6) val s1=sc.parallelize(d1,2) s1.mapPartitionsWithIndex{(index,p)=>{ val buffer=scala.collection.mutable.ListBuffer.empty[String] for(i<- p){ buffer+=index+"->"+i } buffer.iterator }}.foreach(println _) repartitionAndSortWithinPartitions讲解 1.数据的格式是(key,value) 2.先进行分区,然后再对分区之后的数据进行排序(分区内有序,但是整个数据是无序的) 3.比先repartitions再sortByKey好,因为之前的算子的排序是在shuffle阶段 val d1=Array(("A",(3,8)),("B",(3,8)),("C",(3,8)),("E",(3,8)))//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) s1.repartitionAndSortWithinPartitions(new MyPartitioner(2)).foreach(println _) */
class MyPartitioner(partitions: Int) extends Partitioner { require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.") override def numPartitions: Int = partitions override def getPartition(key: Any): Int = { key.hashCode() % numPartitions } }
- map函数,
- Actions(collect,take,countByKey,takeSample,(takyOder(升序排序去前几个),Top(降序排序取前几个)))
/** * val d1=Array(1,2,3,4,5)//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) println(s1.reduce(_+_)) val d1=Array(1,2,3,4,5)//repartitionAndSortWithinPartitions val d2=Array(5,6,7,8)//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) val s2=sc.parallelize(d2) for(i<-s1.union(s2).collect()){ println(i) } val d1=Array(1,2,3,4,5)//repartitionAndSortWithinPartitions val d2=Array(5,6,7,8)//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) val s2=sc.parallelize(d2) for(i<-s1.union(s2).take(3)){ println(i) } //注意和reduceByKey直接按的区别,一个是Transformation,一个是Actions,返回的结果不同 val d1=Array("hadoop spark","spark flume")//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) for(i<-s1.flatMap(line=>line.split(" ")).map(w=>(w,1)).countByKey()){ println(i) */ val d1=Array(1,2,3,4,5)//repartitionAndSortWithinPartitions val s1=sc.parallelize(d1) for(i<-s1.takeSample(false,2)){ println(i) }