spark算子总结

spark一些算子的demo,为了方便理解把运行结果也给大家看一下,主要是加深对算子的熟练程度与深入的理解

Transformation算子:

1.map

   /**
  * map算子,将RDD中的每个元素传入自定义函数,获取一个新的元素,
  * 然后用新的元素组成新的RDD。
  */
    val conf = new SparkConf().setAppName("mapDemo").setMaster("local")
    val sc = new SparkContext(conf)
    val numRDD = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
    numRDD.map(num=>num*2).foreach(println)
    sc.stop()

2.reduceByKey

    //相同key的value按照传入的自定义函数进行聚合
    val conf = new SparkConf().setMaster("local").setAppName("reduceByKey")
    val sc = new SparkContext(conf)
    val datasRDD = sc.parallelize(Array(Tuple2("class1",98),
      Tuple2("class2",96),
      Tuple2("class1",90),
      Tuple2("class2",100),
      Tuple2("class1",94)))
    val result = datasRDD.map(m=>(m._1,m._2)).reduceByKey((x,y)=>x+y)
    result.foreach(i=>println(i._1+"班级的总分是:"+i._2))

3.groupByKey

//gropuByKey算子根据key进行分组,每个key对应一个Iterable<value>
    val conf = new SparkConf().setAppName("groupByKey").setMaster("local")
    val sc = new SparkContext(conf)
    val datas = sc.parallelize(Array(
      Tuple2("class1",98),
      Tuple2("class2",96),
      Tuple2("class1",90),
      Tuple2("class2",100),
      Tuple2("class1",94)))
    datas.groupByKey().foreach(
      i=>{
        println("班级:"+i._1)
        i._2.foreach(item=>println(item))
        println("----------------------")
      }
    )

4.sortByKey

//sortByKey算子对每个key进行排序操作。
 val conf = new SparkConf().setAppName("sortByKey").setMaster("local")
    val sc = new SparkContext(conf)
    val datasRDD = sc.parallelize(Array(
      Tuple2(98,"张三"),
      Tuple2(100,"李四"),
      Tuple2(92,"Jack"),
      Tuple2(96,"tom")
    ))
    datasRDD.map(m=>(m._1,m._2)).sortByKey(false).foreach(i=>println(i._2+"的成绩是:"+i._1))
  

5.cogroup

   //返回数据集中没有匹配上的数据。
    val conf = new SparkConf().setAppName("cogroup").setMaster("local")
    val sc = new SparkContext(conf)
    val infoRDD = sc.parallelize(Array(
      Tuple2(1,"张三"),
      Tuple2(2,"李四"),
      Tuple2(3,"jack"),
      Tuple2(4,"tom")
    ))
    val scoresRDD=sc.parallelize(Array(
      Tuple2(1,98),
      Tuple2(2,96),
      Tuple2(3,92),
      Tuple2(5,99)
    ))
    val result=infoRDD.cogroup(scoresRDD)
    result.foreach(m=> {
      println(m)
      println(m._1)
      println(m._2._1.toString())
      println(m._2._2.toString())
      println("================================")
    })

6.join

/**
  * join	算子对两个包含<key,value>对的RDD进行join操作,
  * 每个key join上的pair都会传入自定义函数进行处理,
  * 相当于Hive中的内联接查询,获取数据集中能匹配上的共同数据。
  */
    val conf = new SparkConf().setMaster("local").setAppName("join")
    val sc = new SparkContext(conf)
    val infoRDD = sc.parallelize(Array(
      Tuple2(1, "张三"),
      Tuple2(2, "李四"),
      Tuple2(3, "jack"),
      Tuple2(4, "tom")
    ))
    val scoresRDD = sc.parallelize(Array(
      Tuple2(1, 98),
      Tuple2(2, 96),
      Tuple2(3, 92),
      Tuple2(5, 99)
    ))
    val result = infoRDD.join(scoresRDD)
    result.foreach(i => {
      println(i)
      println("学号:" + i._1)
      println("姓名:" + i._2._1)
      println("成绩:" + i._2._2)
      println("----------------")
    }
    )

7.filter

/**
  * filter算子,对RDD中每个元素进行判断,如果返回true则保留,返回false则剔除。
  * 相当于过滤器
  */
    val conf = new SparkConf().setMaster("local[2]").setAppName("filter")
    val sc = new SparkContext(conf)
    val numRDD = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
    numRDD.filter(num => num % 2 == 0).foreach(println)

总结:

1).reduceByKey与groupByKey:reduceByKey先局部聚合再全局聚合,groupByKey全局聚合,因此reduceByKey效率比较高,reduceByKey返回的是k,v,groupByKey返回k,迭代器。

2).cogroup与join:cogroup返回所有,但是匹配上与没匹配上的结果不同,具体观察上面结果,join返回匹配上的

action算子:

1.count

/**
  * count算子获取RDD元素总数。
  */
    val conf=new SparkConf().setAppName("countByKeyDemo").setMaster("local")
    val sc=new SparkContext(conf)

    //模拟数据
    val datas=sc.parallelize(Array(
      Tuple2("class1","leo"),
      Tuple2("class2","jack"),
      Tuple2("class1","jen"),
      Tuple2("class2","tom"),
      Tuple2("class1","marray")
    ))
    val result = datas.map(m=>(m._1,1)).reduceByKey(_+_).count()
    println(result)

2.countByKey

//countByKey算子对每个key对应的value值进行count计数。

    val conf=new SparkConf().setAppName("countByKeyDemo").setMaster("local")
    val sc=new SparkContext(conf)

    //模拟数据
    val datas=sc.parallelize(Array(
      Tuple2("class1","leo"),
      Tuple2("class2","jack"),
      Tuple2("class1","jen"),
      Tuple2("class2","tom"),
      Tuple2("class1","marray")
    ))
    datas.countByKey().foreach(m=>println(m._1+"有"+m._2+"个学生"))

3.collect  & take

    /**
  * collect算子将RDD中所有元素获取到本地客户端,take(n)算子获取RDD中前n个元素。
  */
    val conf = new SparkConf().setAppName("collectAndTake").setMaster("local")
    val sc = new SparkContext(conf)
    val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
    val collectResult = datas.map(m=>2*m).collect()
    val takeResult = datas.map(m=>2*m).take(3)
    for (elem <- collectResult) {
      println(elem)
    }
    println("----------------------------")
    for (elem <- takeResult) {
      println(elem)
    }

4.reduce

/**
  * reduce算子将RDD中的所有元素进行聚合操作。
  * 第一个和第二个元素聚合,值与第三个元素聚合,值与第四个元素聚合,以此类推。
  */
    val conf = new SparkConf().setMaster("local").setAppName("reduce")
    val sc = new SparkContext(conf)
    val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
    val result = datas.reduce((m,n) => m+n)
    println(result)

5.foreach

    val conf = new SparkConf().setMaster("local").setAppName("foreach")
    val sc = new SparkContext(conf)
    val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
    datas.foreach(println)

高级算子:

1.cartesian

/**
  * cartesian,中文名笛卡尔乘积。比如说两个RDD,分别有10条数据,用了cartesian算子以后,
  * 两个RDD的每一条数据都会和另外一个RDD的每一条数据执行一次join,
  * 最终组成了一个笛卡尔乘积。
  */    
    val conf = new SparkConf().setAppName("cartesian").setMaster("local")
    val sc = new SparkContext(conf)
    val clothRDD=sc.parallelize(Array("夹克","风衣","冲锋衣","羽绒服"))
    val kuziRDD=sc.parallelize(Array("牛仔裤","皮裤","西裤","运动裤"))
    val resuleRDD = clothRDD.cartesian(kuziRDD)
    resuleRDD.foreach(m=>println(m._1+"搭配"+m._2))
    println(resuleRDD.count())

2.distinct

//去重    
    val conf = new SparkConf().setMaster("local").setAppName("distinct")
    val sc = new SparkContext(conf)
    val logsRDD = sc.parallelize(Array(
      "user1 2016-01-01 23:58:42",
      "user1 2016-01-01 23:58:43",
      "user1 2016-01-01 23:58:44",
      "user2 2016-01-01 12:58:42",
      "user2 2016-01-01 12:58:46",
      "user3 2016-01-01 12:58:42",
      "user4 2016-01-01 12:58:42",
      "user5 2016-01-01 12:58:42",
      "user6 2016-01-01 12:58:42",
      "user6 2016-01-01 12:58:45"
    ))
    val resultRDD = logsRDD.map(m=>m.split(" ")(0)).distinct()
    resultRDD.foreach(println)
    println(resultRDD.count())

3.coalesce & repartition

/**
  * repartition算子,用于任意将rdd的partition增多或者减少与coalesce不同之处在于,
  * coalesce仅仅能将rdd的partition变少,但是repartition可以将rdd的partiton变多。
  */    
    val conf = new SparkConf().setAppName("coalesce").setMaster("local")
    val sc = new SparkContext(conf)
    val datasRDD=sc.parallelize(Array("tom","jack","leo","张三","李四","王五"),3)
    val resultRDD1=datasRDD.repartition(2)
    val resultRDD2 = datasRDD.coalesce(2)
    resultRDD1.mapPartitionsWithIndex((x,y)=>{
      val arrayBuffer=ArrayBuffer[String]()
      while(y.hasNext){
        val info="第"+(x+1)+"分区的数据:"+y.next()
        arrayBuffer+=info
      }
      arrayBuffer.iterator
    }).foreach(println)
println("-----------------------------------------")
    resultRDD2.mapPartitionsWithIndex((x,y)=>{
      val arrayBuffer=ArrayBuffer[String]()
      while(y.hasNext){
        val info="第"+(x+1)+"分区的数据:"+y.next()
        arrayBuffer+=info
      }
      arrayBuffer.iterator
    }).foreach(println)

4.intersection

    val conf = new SparkConf().setAppName("intersection").setMaster("local")
    val sc = new SparkContext(conf)
    val stus1 = sc.parallelize(Array("leo","jack","tom","marry"))
    val stus2 = sc.parallelize(Array("leo","jack","devid","honny"))
    val resultRDD = stus1.intersection(stus2)
    resultRDD.foreach(println)
  

5.mapPartitions

/**
  * mapPartitions的输入函数作用于每个分区,也就是把每个分区中的内容作为整体来处理。
  */
    val conf = new SparkConf().setMaster("local").setAppName("mapPartitions")
    val sc = new SparkContext(conf)
    val stusRDD = sc.parallelize(Array("leo", "jack", "tom", "marry"), 2)
    val scores = Map(("leo", 600), ("jack", 620), ("tom", 650), ("marry", 500), ("jen", 550))
    val resultRDD = stusRDD.mapPartitions(m => {
      val result = ArrayBuffer[Int]()
      var score = 0
      while (m.hasNext) {
        score = scores(m.next())
        result += score
      }
      result.iterator

    })
    resultRDD.foreach(println)

6.mapPartitionsWithIndex

    val conf = new SparkConf().setAppName("mapPartitionsWithIndex").setMaster("local")
    val sc = new SparkContext(conf)
    val stusRDD = sc.parallelize(Array("leo","jack","tom","marry","jenny"),2)
    val resultRDD = stusRDD.mapPartitionsWithIndex((m,n)=>{
      val result=ArrayBuffer[String]()
      while(n.hasNext){
        val stuName=n.next()
        val info="学生"+stuName+"在"+(m+1)+"班"
        result+=info
      }
      result.iterator
    })
    resultRDD.foreach(println)

7.mapValues

    val conf = new SparkConf().setMaster("local").setAppName("mapValues")
    val sc = new SparkContext(conf)
    val values = sc.parallelize(Array(
      Tuple2("class2",96),
      Tuple2("class1",90),
      Tuple2("class2",100),
      Tuple2("class1",94)
    ))
    val result = values.mapValues(x=>x*x).collect()
    result.foreach(println)

8.subtract

    //差集运算  intRDD1是List(3,1,2,5,5),扣除intRDD2 List(5,6)重复的部分5,所以结果是(1,2,3)
    val conf = new SparkConf().setMaster("local").setAppName("subtract")
    val sc = new SparkContext(conf)
    val intRDD1 = sc.parallelize(List(1,2,3,5,5))
    val intRDD2 = sc.parallelize(List(5,6))
    val resultRDD = intRDD1.subtract(intRDD2)
    resultRDD.foreach(println)

9.union

    //union算子(等价于“++”)是将两个RDD取并集,取并集的过程中不会把相同元素去掉。
// union操作是输入分区与输出分区多对一模式。
    val conf = new SparkConf().setAppName("union").setMaster("local")
    val sc = new SparkContext(conf)
    val stusRDD1 = sc.parallelize(Array("leo","jack","tom"))
    val stusRDD2 = sc.parallelize(Array("tom","marry","jenny"))
    val resultRDD = stusRDD1.union(stusRDD2)
    resultRDD.foreach(println)
   

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值