scala map与mapPartitions区别

最新推荐文章于 2022-03-27 21:17:30 发布

鸭梨山大哎

最新推荐文章于 2022-03-27 21:17:30 发布

阅读量798

点赞数

分类专栏： scala 文章标签： scala 分区

本文链接：https://blog.csdn.net/u010711495/article/details/109763949

版权

scala 专栏收录该内容

61 篇文章 0 订阅

订阅专栏

map - 遍历元素并处理

定义

参数为函数,T 比如为Int,U比如为Double类型.
map函数整体的返回值可为任意类型(比如可以是Double等等,和参数f的返回值类型保持一致)

def map[U : ClassTag](f: T => U): RDD[U]

例子

scala> val rdd=sc.parallelize(Array(1,2,3,4,5,6,7),2)
//参数函数
scala> rdd.map(x=>x*2).collect
res100: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14)
//简写
scala> rdd.map(_*2).collect
res99: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14)
//遍历元素并处理,返回值类型未必与原有元素类型一样
scala> rdd.map(_*2.1).collect
res101: Array[Double] = Array(2.1, 4.2, 6.300000000000001, 8.4, 10.5, 12.600000000000001, 14.700000000000001)

mapPartitions

定义

参数为函数,且该函数参数类型为迭代器,返回值类型也为迭代器

def mapPartitions[U : ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

例子

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("wordcount").setMaster("local")
    val sc = new SparkContext(conf)
    val rdd = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7), 2)
    rdd.mapPartitions(myfun).foreach(println)
    def myfun(iter: Iterator[Int]): Iterator[Int] = {
      var res = for (e <- iter) yield e * 2
      res
    }
  }
  // Array[Int] = Array(2, 4, 6, 8, 10, 12, 14)

其他例子

scala> val rdd=sc.parallelize(Array(1,2,3,4,5,6,7),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[145] at parallelize at <console>:24
//其中iter为Iterator类型,所以经过map处理后返回的也是Iterator类型
scala> rdd.mapPartitions(iter=>iter.map(_*2)).collect
res104: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14)
//对迭代器应用filter方法返回的还是迭代器类型
scala> rdd.mapPartitions(iter=>iter.filter(_%2==0)).collect
res15: Array[Int] = Array(2, 4, 6)

mapPartitionsWithIndex

这个只是上面那个加了个分区号而已

def mapPartitionsWithIndex[U : ClassTag](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

例子

scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7), 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> rdd.mapPartitionsWithIndex((index,x)=>x.map(index+":"+_*2)).collect
res19: Array[String] = Array(0:2, 0:4, 0:6, 1:8, 1:10, 1:12, 1:14)

总结

能用 mapPartitions的地方都可以用map解决
两者的主要区别是调用的粒度不一样：map的输入变换函数是应用于RDD中每个元素，而mapPartitions的输入函数是应用于每个分区。
有些时候比如连接数据库时用mapPartitions比较好,因为每次连接开销很大,每个分区连一次比每调用一次连接一次要好.
mapPartitionsWithIndex跟mapPartitions差不多,参数多了个分区号而已

鸭梨山大哎

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
scala map与mapPartitions区别

map val a=sc.parallelize(1 to 9 ,2) def mapDouble(a:Int): (Int, Int) ={(a,a*2)} val Result=a.map(mapDouble) println(Result.collect().mkString)//(1,2)(2,4)(3,6)(4,8)(5,10)(6,12)(7,14)(8,16)(9,18)
复制链接

扫一扫

专栏目录