转换算子(TransFormation)和执行算子(Action)

最新推荐文章于 2024-05-27 11:05:12 发布

孤独の√ 3

最新推荐文章于 2024-05-27 11:05:12 发布

阅读量2k

点赞数 1

分类专栏：大数据 # spark 文章标签： spark

本文链接：https://blog.csdn.net/LCY_1013/article/details/105188074

版权

目录:一、转换算子(TransFormation)二、执行算子(Action)一、转换算子(TransFormation)func:functionmap(func)返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成mapPartitions(func)类似于map，但独立地在RDD的每一个分片上运行，因此在类型为T(泛型)的RDD运行时，func的函数必须是...

摘要由CSDN通过智能技术生成

一、转换算子(TransFormation)

func:function

map(func)
返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成
mapPartitions(func)
类似于map，但独立地在RDD的每一个分片上运行，因此在类型为T(泛型)的RDD运行时，
func的函数必须是Iterator[T]=>Iterator[U].
假设RDD有N个元素，有M个分区，那么map函数被调用N次，而mapPartitions则被调用M次，一个函数一次处理所有分区

def mappatitions(): Unit = {
  val conf = new SparkConf().setAppName("cogroup").setMaster("local[*]")
  val sc = new SparkContext(conf)
  val a = sc.parallelize(1 to 9, 3)

  def doubleFunc(iter: Iterator[Int]): Iterator[Int] = {
    var res = List[Int]()
    while (iter.hasNext) {
      val cur = iter.next;
      res.::=((cur * 2))
    }
    res.iterator
  }
  val result = a.mapPartitions(doubleFunc)
  result.collect().foreach(println)
}

mapPartitionsWithIndex(func)
类似于mapPartitions，但func带有一个整数表示分片的索引值，因此类型为T的RDD运行时，
func的函数类型必须是(Int,Iterator[T]) => Iterator[U]

def mappatitionswithindex() {
  val conf = new SparkConf().setAppName("mappatitionswithindex").setMaster("local[*]")
  val sc = new SparkContext(conf)

  val rdd1: RDD[Int] = sc.makeRDD(1 to 10, 2)
  val rdd2: RDD[String] = rdd1.mapPartitionsWithIndex((x, iter) => {
    var result = List[String]()
    result.::=(x + "|" + iter.toList)
    result.iterator
  })
  val res: Array[String] = rdd2.collect()
  for (x <- res) {
    println(x)
  }
}

glom()
将每一个分区形成一个数组，形成新的RDD类型=>RDD[Array[T]]
filter(func)
返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成

// 创建RDD
 val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5))
//  获得所有的偶数。
 val num: RDD[Int] = rdd.filter(aaa => aaa % 2==0 )
 num.foreach(line => println(line))

flatMap(func)
类似于map，但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列，而不是单一元素)
sample(withReplacement,fraction,seed)
以指定的随机种子随机抽样出数量为fraction的数据，withReplacement表示是抽出的数据是否放回，true为有放回的抽样，false为无放回的抽样，seed用于指定随机数生成器种子。
例如：从RDD中随机且有放回的抽出50%的数据，随机种子值为3(即可能以1,2,3的其中一个起始值)
distinct([numTasks])
对RDD进行去重后返回一个新的RDD。默认情况下，只有8个并行任务来操作，但是可以传入一个可选的numTasks参数改变它。

def distinct(): Unit = {
  val conf = new SparkConf().setAppName("distinct").setMaster("local[*]")
  val sc = new SparkContext(conf)
  val distinctRdd = sc.parallelize(List(1, 2, 1, 5, 2, 9, 6, 1))
  val res: RDD[Int] = distinctRdd.distinct(2)
  // distinct 中的参数numTask 表示 数据先对2整除，其他的依次余1，余2，余3 。局部无序，整体有序。
  res.foreach(println)
}

partitionBy
对RDD进行分区操作，如果原有

最低0.47元/天解锁文章

孤独の√ 3

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
转换算子(TransFormation)和执行算子(Action)

目录:一、转换算子(TransFormation)二、执行算子(Action)一、转换算子(TransFormation)func:functionmap(func)返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成mapPartitions(func)类似于map，但独立地在RDD的每一个分片上运行，因此在类型为T(泛型)的RDD运行时，func的函数必须是...
复制链接

扫一扫