Spark算子总结

最新推荐文章于 2023-03-01 11:17:19 发布

治愈爱吃肉

最新推荐文章于 2023-03-01 11:17:19 发布

阅读量229

点赞数

分类专栏：大数据文章标签： spark 大数据 scala

本文链接：https://blog.csdn.net/qq_40565265/article/details/112001969

版权

大数据专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Spark算子总结

Transformations转化算子

转换算子不触发提交作业，完成作业中间过程处理，懒加载算子，需要有**action算子**操作的时候才会真正触发运算

Value类型

Map

返回一个新的RDD，该RDD由每一个输入元素经过func函数转化后组成

object Spark`RDD` {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建`RDD`
    val list`RDD`: `RDD`[Int] = sc.make`RDD`(List(1, 2, 3, 4, 5))
    //4、每个数变成原来的2倍
    val rs`RDD`: `RDD`[Int] = list`RDD`.map(_ * 2)
    //5、打印
    rs`RDD`.foreach(println(_))
  }
}

mapPartitions

类似于map，单独独立的在RDD的每一分片上运行，假设有N个元素，有M个分区，那么map的函数将被调用N此，而mapPartitions被调用M次，一次函数处理所有的分区

object Spark`RDD` {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建`RDD`
    val list`RDD`: `RDD`[Int] = sc.make`RDD`(List(1, 2, 3, 4, 5))
    //4、每个数变成原来的2倍
    val rs`RDD`: `RDD`[Int] = list`RDD`.mapPartitions(i=>{i.map(_*2)})
    //5、打印
    rs`RDD`.foreach(println)
  }
}

mapPartitionWithIndex(`func`)

类似于mapPartitions，但是func带有一个整数参数表示分片的索引值，因此在类型为T的RDD上运行时，func的函数类型必须是(Int,Integer[T])=>Iterator[U];

object Spark`RDD` {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建2个分区的`RDD`
    val list`RDD`: `RDD`[Int] = sc.make`RDD`(List(1, 2, 3, 4, 5), 2)
    //4、每个数变成原来的2倍
    val rs`RDD`: `RDD`[(Int, Int)] = list`RDD`.mapPartitionsWithIndex((index, item) => {
      item.map((index, _))
    })
    //5、打印
    rs`RDD`.foreach(println)
  }
}

flatMap(`func`)

对集合中每个元素进行操作后再扁平化

object Spark`RDD` {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建2个分区的`RDD`
     val list`RDD`: `RDD`[String] = sc.make`RDD`(List("Hello World","Hello Spark"))
    //4、扁平化
    val rs`RDD`: `RDD`[String] = list`RDD`.flatMap(_.split(" "))
    //5、打印
    rs`RDD`.foreach(println)
  }
}

glom

将没一个分区形成一个数组，形成新的RDD类型时RDD[Array[T]]

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建3个分区的RDD
    val listRdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8), 3)
    //4、glom
    val rsRdd: RDD[Array[Int]] = listRdd.glom()
    //5、打印
    rsRdd.foreach(item => {
      println(item.mkString(","))
    })
  }
}

groupBy

按照传入函数的返回值进行分组，将相同的key对应的值放入一个迭代器

def main(args: Array[String]): Unit = {
  //1、创建本地spark配置文件
  val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
  //2、创建Spark上下文对象
  val sc = new SparkContext(config)
  //3、创建RDD
  val listRdd: RDD[Int] = sc.makeRDD(1 to 20)
  //4、分组
  val rsRdd: RDD[(Int, Iterable[Int])] = listRdd.groupBy(_%2)
  //5、打印
  rsRdd.foreach(t=>println(t._1,t._2))
}

filter

返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val listRdd: RDD[Int] = sc.makeRDD(1 to 10)
    //4、过滤
    val rsRdd: RDD[Int] = listRdd.filter(_%2==0)
    //5、打印
    rsRdd.foreach(println)
  }
}

distinct

对RDD中的每个元素组进行去重

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val listRdd: RDD[Int] = sc.makeRDD(List(1,2,1,2,1,2))
    //4、用2个分区保存结果
    val rsRdd: RDD[Int] = listRdd.distinct(2)
    //5、打印
    rsRdd.foreach(println)
  }
}

repartition(`numPartitions`)

重分区，更具分区数，重新通过网络随机洗牌所有数据

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建4个分区的Rdd
    val listRdd: RDD[Int] = sc.makeRDD(1 to 10, 4)
    println("repartition before:" + listRdd.partitions.size)
    //4、重新分区
    val rsRdd: RDD[Int] = listRdd.repartition(2)
    //5、打印
    println("repartition after:" + rsRdd.partitions.size)
  }
}

sortBy(`func`)

使用func先对数据进行处理，按照处理后的数据比较结果排序

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建4个分区的Rdd
    val listRdd: RDD[Int] = sc.makeRDD(1 to 10)
    //4、升序
    val sortAsc: Array[Int] = listRdd.sortBy(x => x).collect()
    println(sortAsc.mkString(","))
    //5、降序
    val sortDesc: Array[Int] = listRdd.sortBy(x => x, false).collect()
    println(sortDesc.mkString(","))
  }
}

双Value类型交互

union

对源RDD和参数RDD求并集后返回一个新的RDD

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建2个list
    val list1: RDD[Int] = sc.makeRDD(1 to 2)
    val list2: RDD[Int] = sc.makeRDD(2 to 4)
    //4、求并级
    val rsRdd: Array[Int] = list1.union(list2).collect()
    //5、打印
    rsRdd.foreach(println)
  }
}

subtract

差集，去除两个RDD中相同的元素，不用的RDD将保留下来

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建2个list
    val list1: RDD[Int] = sc.makeRDD(1 to 10)
    val list2: RDD[Int] = sc.makeRDD(5 to 10)
    //4、求并级
    val rsRdd: Array[Int] = list1.subtract(list2).collect()
    //5、打印
    rsRdd.foreach(println)
  }
}

intersection

交集，对源RDD和参数RDD求交集后返回一个新的RDD

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建2个list
    val list1: RDD[Int] = sc.makeRDD(1 to 10)
    val list2: RDD[Int] = sc.makeRDD(5 to 10)
    //4、交集
    val rsRdd: Array[Int] = list1.intersection(list2).collect()
    //5、打印
    rsRdd.foreach(println)
  }
}

Key-Value类型

groupByKey

groupByKey也是对每个key进行操作，但是只生成一个事件，将所有的数据在一个RDD中进行操作

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建list
    val list: RDD[String] = sc.makeRDD(List("Hello World","Hello Scala","Spark Spark Spark"))
    //4、wordcount
     val mapRdd: RDD[(String, Int)] = list.flatMap(_.split(" ")).map((_,1))
    //5、groupByKey
    val groupByKeyRdd: RDD[(String, Iterable[Int])] = mapRdd.groupByKey()
    groupByKeyRdd.foreach(println)
    //6、聚合
    val wcRdd: RDD[(String, Int)] = groupByKeyRdd.map(t=>(t._1,t._2.size))
    wcRdd.foreach(println)
  }
}

reduceByKey

在一个(K,V)的RDD上调用，返回一个(K,V)的RDD,使用指定的reduce函数，将相同的key的值聚合到一起，reduce任务的个数可以通过第二个可选的参数来设置

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建list
    val list: RDD[String] = sc.makeRDD(List("Hello World","Hello Scala","Spark Spark Spark"))
    //4、扁平化
     val mapRdd: RDD[(String, Int)] = list.flatMap(_.split(" ")).map((_,1))
    //5、wordcount
    val reduceMapRdd: RDD[(String, Int)] = mapRdd.reduceByKey(_+_)
    //6、打印
    reduceMapRdd.foreach(println)
  }
}

sortByKey

在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key的正序和倒序进行排序

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建list
    val listRdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 3), (1, 2), (1, 4), (2, 3), (3, 6), (3, 8)))
    //4、升序
    val sortAsc: Array[(Int, Int)] = listRdd.sortByKey(true).collect()
    println("升序:" + sortAsc.mkString(","))
    //5、降序
    val sortDesc: Array[(Int, Int)] = listRdd.sortByKey(false).collect()
    println("降序:" + sortDesc.mkString(","))
  }
}

join

在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    val r1: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (2, "b"), (3, "c")))
    val r2: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (2, "b"), (3, "c")))
    val rsRdd: Array[(Int, (String, String))] = r1.join(r2).collect()
    rsRdd.foreach(println)
  }
}

Action行动算子

这类算子会触发**SparkContext提交Job**作业

reduce

通过func函数聚合RDD中的所有元素，先聚合分区内数据，再聚合分区间数据

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val rdd: RDD[Int] = sc.makeRDD(1 to 5)
    //4、reduce
    val rs: Int = rdd.reduce(_+_)
    //5、结果
    println(rs)
  }
}

collect

在驱动程序中，以数组的形式返回数据集的所有元素

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val lisrRdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5))
    val ints: Array[Int] = lisrRdd.collect()
    println(ints.mkString(","))
  }
}

count

返回RDD中的元素的个数

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
     val rdd: RDD[Int] = sc.makeRDD(1 to 10)
    val rddCount: Long = rdd.count()
    println(rddCount)
  }
}

first

返回RDD中的第一个元素

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val rdd: RDD[Int] = sc.makeRDD(1 to 10)
    //4、返回RDD中第一个元素
    val firstNum: Int = rdd.first()
    println(firstNum)
  }
}

take

返回一个由RDD的前N个元素组成的数组

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val rdd: RDD[Int] = sc.makeRDD(1 to 10)
    val top3: Array[Int] = rdd.take(3)
    top3.foreach(println)
  }
}

takeOrdered

返回该RDD排序后的前N个元素组成的数组

def main(args: Array[String]): Unit = {
  //1、创建本地spark配置文件
  val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
  //2、创建Spark上下文对象
  val sc = new SparkContext(config)
  //3、创建RDD
  val rdd: RDD[Int] = sc.makeRDD(List(5, 3, 2, 1, 7, 6))
  val top3: Array[Int] = rdd.takeOrdered(3)
  println(top3.mkString(","))
}

countByKey

针对(K,V)类型的RDD，返回一个(K,Int)的map，表示每一个key对应的元素个数

object SparkRdd {
  def main(args: Array[String]): Unit = {
    //1、创建本地spark配置文件
    val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    //2、创建Spark上下文对象
    val sc = new SparkContext(config)
    //3、创建RDD
    val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)))
    //4、统计key
    val countKey: collection.Map[Int, Long] = rdd.countByKey()
    //5、打印
    println(countKey)
  }
}

治愈爱吃肉

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark算子总结

Spark算子总结文章目录Spark算子总结Transformations转化算子Value类型MapmapPartitionsmapPartitionWithIndex(`func`)flatMap(`func`)glomgroupByfilterdistinctrepartition(`numPartitions`)sortBy(`func`)双Value类型交互unionsubtractintersectionKey-Value类型groupByKeyreduceByKeysortByKeyjoin
复制链接

扫一扫