Spark Transformations之mapPartitionsWithIndex

最新推荐文章于 2024-11-12 23:14:56 发布

weixin_33869377

最新推荐文章于 2024-11-12 23:14:56 发布

阅读量70

点赞数

文章标签：大数据 python scala

原文链接：https://my.oschina.net/forrest420/blog/470511

版权

2019独角兽企业重金招聘Python工程师标准>>>

mapWith 在spark1.0之后就过期了，使用mapPartitionsWithIndex代替

====原来mapWith使用

mapWith是map的另外一个变种，map只需要一个输入函数，而mapWith有两个输入函数。它的定义如下：

def mapWith[A: ClassTag, U: ](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]

第一个函数constructA是把RDD的partition index（index从0开始）作为输入，输出为新类型A；
第二个函数f是把二元组(T, A)作为输入（其中T为原RDD中的元素，A为第一个函数的输出），输出类型为U。

举例：把partition index 乘以10，然后加上2作为新的RDD的元素。

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) 
x.mapWith(a => a * 10)((a, b) => (b + 2)).collect 
res4: Array[Int] = Array(2, 2, 2, 12, 12, 12, 22, 22, 22, 22)

====新的mapPartitionsWithIndex使用

mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

类似于mapPartitions, 其函数原型是：
def mapPartitionsWithIndex [ U : ClassTag ]( f : ( Int , Iterator [ T ]) => Iterator [ U ] , preservesPartitioning : Boolean = false ) : RDD [ U ]，
mapPartitionsWithIndex的func接受两个参数，第一个参数是分区的索引，第二个是一个数据集分区的迭代器。而输出的是一个包含经过该函数转换的迭代器。下面测试中，将分区索引和分区数据一起输出。
Test:

val x = sc . parallelize ( List (1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10) , 3)
def myfunc ( index : Int , iter : Iterator [ Int ]) : Iterator [ String ] = {
iter . toList . map ( x => index + "-" + x ) . iterator
}
//myfunc: (index: Int, iter: Iterator[Int])Iterator[String]
x . mapPartitionsWithIndex ( myfunc ) . collect()
res: Array[String] = Array(0-1, 0-2, 0-3, 1-4, 1-5, 1-6, 2-7, 2-8, 2-9, 2-10)

可以运行的类：

package yanan.spark.core.transformations.example

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object TransformationsTest {
  def mapPartitionsFunc[T](iter: Iterator[T]): Iterator[(T, T)] = {
    var res = List[(T, T)]()
    var pre = iter.next
    while (iter.hasNext) {
      val cur = iter.next
      res.::=(pre, cur)
      pre = cur;
    }
    res.iterator
  }

  def mapPartitionsTest(sc: SparkContext) = {
    val a = sc.parallelize(1 to 9, 3)
    a.mapPartitions(mapPartitionsFunc).collect.foreach(println)
  }

  def mapValuesTest(sc: SparkContext) = {
    val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
    val b = a.map(x => (x.length, x))
    b.mapValues("x" + _ + "x").collect.foreach(println)
  }

  def mapWithTest(sc: SparkContext) = {
    val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
    x.mapWith(a => a * 10)((a, b) => (b + 2)).collect.foreach(println)
    //res4: Array[Int] = Array(2, 2, 2, 12, 12, 12, 22, 22, 22, 22)

    //mapWith 过期了，使用mapPartitionsWithIndex代替
    val parallel = sc.parallelize(1 to 9)
    parallel.mapPartitionsWithIndex((index: Int, it: Iterator[Int]) => it.toList.map(x => index + ", " + x).iterator).collect
   // parallel.collect.foreach(println)
    
    val y = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
    y.mapPartitionsWithIndex(myfuncPartitionsWithIndex).collect()
    //res: Array[String] = Array(0-1, 0-2, 0-3, 1-4, 1-5, 1-6, 2-7, 2-8, 2-9, 2-10)
  }

  def myfuncPartitionsWithIndex(index: Int, iter: Iterator[Int]): Iterator[String] = {
    iter.toList.map(x => index + "-" + x).iterator
  }

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]")
    val sc = new SparkContext(conf)
    //mapPartitionsTest(sc)
    mapWithTest(sc)
    sc.stop()
  }

}

参考：

https://www.zybuluo.com/jewes/note/35032

http://debugo.com/spark-programming-model/

转载于:https://my.oschina.net/forrest420/blog/470511