map vs mapPartitions
>>> rdd=sc.parallelize(range(1,11))
>>> rdd.glom().collect()
[[1, 2], [3, 4, 5], [6, 7], [8, 9, 10]]
>>> rdd.map(lambda num : num+1 ).collect() #普通map函数
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>>
>>> def fn1(list):
... arr=[]
... for n in list:
... arr.append(n+1)
... return arr
...
>>> rdd.mapPartitions(fn1).collect() #分区map函数
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>>
>>> def fn2(list):
... for n in list:
... yield n+1
...
>>> rdd.mapPartitions(fn2).collect()
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>>
>>> rdd.mapPartitions(fn1).glom().collect()
[[2, 3], [4, 5, 6], [7, 8], [9, 10, 11]]
>>> rdd.mapPartitions(fn2).glom().collect()
[[2, 3], [4, 5, 6], [7, 8], [9, 10, 11]]
foreach vs foreachPartition
--foreach vs foreachPartition
>>> rdd=sc.parallelize(range(1,11))
>>> rdd.foreach(lambda n : print(n) )
1
2
3
4
5
8
9
10
6
7
>>> def fn3(list):
... for n in list:
... print(n)
...
>>> rdd.foreachPartition(fn3)
8
9
10
3
4
5
1
2
6
7
普通函数vs 分区函数图解
普通函数对元素操作,分区函数对分区操作,分区函数可能要传入函数遍历元素