Spark 算子-Transformations

最新推荐文章于 2021-01-04 15:46:59 发布

潇洒-人生

最新推荐文章于 2021-01-04 15:46:59 发布

阅读量200

点赞数 1

分类专栏：大数据 spark spark

本文链接：https://blog.csdn.net/qq_35744460/article/details/90141508

版权

大数据同时被 3 个专栏收录

45 篇文章 0 订阅

订阅专栏

spark

17 篇文章 0 订阅

订阅专栏

spark

12 篇文章 0 订阅

订阅专栏

Spark 算子-Transformations

Spark 常用Transformations算子介绍

操作	介绍	翻译
map（func）	Return a new distributed dataset formed by passing each element of the source through a function func.	传入一个函数，作用于RDD每个元素，并返回一个新的RDD
filter(func)	Return a new dataset formed by selecting those elements of the source on which func returns true.	对RDD中的每个元素进行判断，反会符合条件的元素
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).	与map类似返回0个或多个元素
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator< T> => Iterator< U> when running on an RDD of type T.	与map类似，但是作用于分区
groupByKey([numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable< V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.	根据Key进行分组，返回(Key,Iterable< value>)
reduceByKey(func, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.	对每个Key对应的values 进行reduce操作
sortByKey([ascending], [numPartitions])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.	对Key进行排序

Demo

/**
  * Spark Transformations demo
  */
object TransformationsApp {
  def main(args: Array[String]): Unit = {
    val sparkConf= new SparkConf().setAppName("TransformationsApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    //创建数据集
    val data1 = sc.parallelize(Array("a","b","c","d","e"),2)
    //map demo 对每个元素进行操作返回一个tupe
    //val mapData = data1.map((_,1)).foreach(println)
    /*
     (a,1)
     (b,1)
     (c,1)
     (d,1)
     (e,1)
      */
    //filter demo 对元素进行过滤
   // val filterData=data1.filter(x=>(x=="a")).foreach(println)//a

    val data2 = sc.parallelize(Array(Array("a","b","c","d","e"),Array("q","w","r")))
  //  val mapData=data2.map((_,1)).foreach(println(_))
    /*
    ([Ljava.lang.String;@5c4cd0b8,1)
    ([Ljava.lang.String;@b40de16,1)
     */
   // val flatMapData=data2.flatMap(_.map((_,1))).foreach(println(_))
    /*
    (a,1)
    (b,1)
    (c,1)
    (d,1)
    (e,1)
    (q,1)
    (w,1)
    (r,1)
    */
    //对比结果可以看出，flatMap对元素进行压平在进行map操作

//    val mapPartitions=data1.mapPartitions(x=>{
//      x.map((_,1))
//    }).foreach(println(_))
    /*
    (a,1)
    (c,1)
    (b,1)
    (d,1)
    (e,1)
     */
    //结果与map类似，mapPartitions作用于每个分区
    val data3 = sc.parallelize(Array("a","b","c","d","e","a","a","d","d"),1)
    //key聚合返回<Key,Iterable[V]>
   // val groupByKeyData = data3.map((_,1)).groupByKey().foreach(println(_))
    /*
    (e,CompactBuffer(1))
    (d,CompactBuffer(1, 1))
    (a,CompactBuffer(1, 1, 1))
    (b,CompactBuffer(1, 1))
    (c,CompactBuffer(1))
     */
    //key聚合返回reduce结果
   //val reduceByKeyData = data3.map((_,1)).reduceByKey(_+_).foreach(println(_))
   /*
    (e,1)
    (a,3)
    (d,2)
    (c,1)
    (b,2)
    */
    //对key进行排序，注意时反区内排序
    val sortByKeyData = data3.map((_,1)).reduceByKey(_+_).sortByKey().foreach(println(_))
    //sortBy可以指定排序字段，默认时升序
    data3.map((_,1)).reduceByKey(_+_).sortBy(_._2).foreach(println(_))
    sc.stop()
  }
}