map vs mapPartitions

最新推荐文章于 2023-02-01 21:55:44 发布

blueheart丶

最新推荐文章于 2023-02-01 21:55:44 发布

阅读量538

点赞数

分类专栏： spark

spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

版本：Apache spark 1.6.0

源码：RDD.scala

一、源码说明
1、map算子

// Transformations (return a new RDD)/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

2、mapPartitions

/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}

二、对比
1、举例
现在有10个分区，共1000条数据，假设每个分区的数据=1000/10=100条，分别使用map和mapPartition遍历。

(1)、使用map(func())遍历
现在，当我们将map(func)方法应用于rdd时，func()操作将应用于每一行，在这种情况下，func()操作将被调用1000次。即在一些时间关键的应用中会耗费时间。

(2)、使用mapPartition(func())遍历
如果我们在rdd上调用mapPartition（func）方法，则func()操作将在每个分区上而不是在每一行上调用。在这种特殊情况下，它将被称为10次（分区数）。通过这种方式，你可以在涉及时间关键的应用程序时阻止一些处理。

2、mapPrtition的优势
(1)机器学习应用程序，特别是深度学习应用程序 - 使用矢量化时，执行比简单for循环要好上百倍。mapPartitions将帮助您使用矢量化。一般来说，你的性能提高300倍+（这不是百分比，是300倍）

(2)连接创建和清理任务很昂贵，每个元素都会使代码效率低下。这适用于数据库或其他连接。但是使用mapPartitions，你可以只对整个分区执行一次init / cleanup循环。

(3)一般来说，JVM带有乱序执行（它将完全使用CPU并使你的代码运行得更快），JVM需要分析你的代码，并且必须重写你的代码。使用mapPartitions，JVM可以更好地进行分析优化（与分析调用函数相比，它可以分析/优化简单代码）

(4)对于map ()，CPU需要每次调用lambda函数（以arg形式传递以进行映射），这会带来10-15ns的开销，并导致CPU寄存器刷新并再次加载（堆栈指针，基址指针和指令指针）

3、与mapPartitions相比，map有什么用处？
(1)更简单的API，易于编码和易于理解，可以直接使用为List / Array / Map编写的现有函数

(2)功能性编程遗留下来的贡献很小。

map:
遍历算子，可以遍历RDD中每一个元素，遍历的单位是每条记录

mapPartitions
遍历算子，可以改变RDD格式，会提高RDD并行度，遍历单位是partition，也就是在遍历之前它会将一个partition的数据加载到内存中

那么问题来了用上面的两个算子遍历一个RDD谁的效率高？

mapPartitions算子效率高

mapPartitions算子占用内存多，如果一个partition的计算结果非常非常大，那么可能造成OOM，怎么解决？

repartition算子来增加RDD的分区数，那么每一个partition的计算结果就减少了很多。

mapPartitions应用场景
一般在将一个RDD的计算结果写入到数据库(mysql oracle redis)中时会使用这个算子，这个算子适合将数据插入到数据库

链接：https://www.jianshu.com/p/f6c541aef8b3
原文：https://blog.csdn.net/high2011/article/details/79384159

blueheart丶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
map vs mapPartitions

版本：Apache spark 1.6.0源码：RDD.scala一、源码说明1、map算子// Transformations (return a new RDD)/** * Return a new RDD by applying a function to all elements of this RDD. */def map[U: ClassTag](f: T =&g...
复制链接

扫一扫