As in the previous post we studied how to calculate the histogram on a RDD[String].
By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.
Before Scala 2.10, to add some new helper method to existing types with implicit type conversion need to define a new class, and an implicit function to do the conversion like the follow:
|
1
2
3
4
5
6
|
class
mapHelper[K,V](m
:
Map[K,V]){
def
updatedWith(c
:
K,d
:
V)(f
:
V
=
>V)
=
{
m.updated(c,f(m.getOrElse(c,d)))
}
}
implicit
def
toMapHelper[K,V](m
:
Map[K,V])
=
new
mapHelper(m)
|
As you can see here, the “toMapHelper” method is kind of redundant. Since 2.10, we can use the implicit keyword on class, so the constructor of the class can be used for implicit conversion:
|
1
|
implicit
class
mapHelper[K,V](m
:
Map[K,V]){...
|
Now Let’s create the helper function for histogram
|
1
2
3
4
5
6
7
8
|
implicit
class
histMap[K](m
:
Map[K,Long]){
def
addcount(c
:
K,n
:
Long)
=
{
m.updated(c, (m.getOrElse(c,
0
L)+n))
}
def
::
(n
:
Map[K,Long])
=
{
(n /
:
m){
case
(map,(k,v))
=
>map.updated(k,v+map.getOrElse(k,
0
L))}
}
}
|
Here we defined 2 method on the implicit class “histMap” which converts “Map[K,Long]” objects. With this, we can do the histogram as:
|
1
|
scala> d.aggregate(Map[String,Long]())(
_
.addcount(
_
,
1
),
_::_
)
|
So far, we used the immutable.Map to store the counts. According to this benchmark
Scala Map Benchmark
The immuntable.Map is significantly slower than mutable.Map, and mutable.Map is significantly slower than java.util.HashMap. With the scala.collection.JavaConversions.mapAsScalaMap package, we can pretty much consider the java.util.HashMap as a scala mutable.Map.
If we want to use mutable.Map to store the counts, we can pretty much simply do
|
1
|
import
scala.collection.mutable.
_
|
and change
|
1
|
m.updated(c, (m.getOrElse(c,
0
L)+n))
|
to
|
1
|
m +
=
(c->(m.getOrElse(c,
0
L)+n))
|
According to the API document, the different between m.updated and m+= is that “updated” method return a new instant of mutable.Map, and “+=” return the existing instant. I’m not so sure whether there are some real performance difference. Just play safe here.
However, this approach will return us a mutable.Map as the “d.aggregate” step. Since the signature of aggregate is
|
1
|
def
aggregate[U](zeroValue
:
U)(seqOp
:
(U, T) ⇒ U, combOp
:
(U, U) ⇒ U)(
implicit
arg
0
:
ClassTag[U])
:
U
|
As you can see, the return value is the same as the internal type (U).
To solve this problem, let’s use a more flexible approach “mapPartitions(…).reduce(…)”
|
1
2
3
4
5
6
7
8
9
10
11
12
|
scala>
import
scala.collection.mutable
scala>
def
histP(iter
:
Iterator[String])
:
Iterator[Map[String,Long]]
=
{
|
val
hist
:
mutable.Map[String,Long]
=
mutable.Map()
|
while
(iter.hasNext){
|
val
c
=
iter.next
| hist.update(c,hist.getOrElse(c,
0
L)+
1
)
| }
| Iterator(hist.toMap)
| }
scala> d.mapPartitions(histP(
_
)).reduce(
_::_
)
|
Above code did the same thing as the aggregate method. However, the histP method returns a Iterator of immutable Map, although internally it used a mutable map. Assume the RDD[String] has a lot records in each partition, most of the map updates are actually on mutable.Map and updated in-place.
本文探讨了如何利用Scala的隐式转换优化RDD的Histogram计算过程,通过引入辅助函数并使用隐式类来简化代码。同时,介绍了如何在保持效率的同时,将不可变Map转换为可变Map,并通过迭代聚合操作实现Histogram计算。最后,通过实例演示了如何使用`aggregate`和`mapPartitions`方法进行高效的数据处理。
2169

被折叠的 条评论
为什么被折叠?



