As in the previous post we studied how to calculate the histogram on a RDD[String].
By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.
Before Scala 2.10, to add some new helper method to existing types with implicit type conversion need to define a new class, and an implicit function to do the conversion like the follow:
1
2
3
4
5
6
|
class
mapHelper[K,V](m
:
Map[K,V]){
def
updatedWith(c
:
K,d
:
V)(f
:
V
=
>V)
=
{
m.updated(c,f(m.getOrElse(c,d)))
}
}
implicit
def
toMapHelper[K,V](m
:
Map[K,V])
=
new
mapHelper(m)
|
As you can see here, the “toMapHelper” method is kind of redundant. Since 2.10, we can use the implicit keyword on class, so the constructor of the class can be used for implicit conversion:
1
|
implicit
class
mapHelper[K,V](m
:
Map[K,V]){...
|
Now Let’s create the helper function for histogram
1
2
3
4
5
6
7
8
|
implicit
class
histMap[K](m
:
Map[K,Long]){
def
addcount(c
:
K,n
:
Long)
=
{
m.updated(c, (m.getOrElse(c,
0
L)+n))
}
def
::
(n
:
Map[K,Long])
=
{
(n /
:
m){
case
(map,(k,v))
=
>map.updated(k,v+map.getOrElse(k,
0
L))}
}
}
|
Here we defined 2 method on the implicit class “histMap” which converts “Map[K,Long]” objects. With this, we can do the histogram as:
1
|
scala> d.aggregate(Map[String,Long]())(
_
.addcount(
_
,
1
),
_::_
)
|
So far, we used the immutable.Map to store the counts. According to this benchmark
Scala Map Benchmark
The immuntable.Map is significantly slower than mutable.Map, and mutable.Map is significantly slower than java.util.HashMap. With the scala.collection.JavaConversions.mapAsScalaMap package, we can pretty much consider the java.util.HashMap as a scala mutable.Map.
If we want to use mutable.Map to store the counts, we can pretty much simply do
1
|
import
scala.collection.mutable.
_
|
and change
1
|
m.updated(c, (m.getOrElse(c,
0
L)+n))
|
to
1
|
m +
=
(c->(m.getOrElse(c,
0
L)+n))
|
According to the API document, the different between m.updated and m+= is that “updated” method return a new instant of mutable.Map, and “+=” return the existing instant. I’m not so sure whether there are some real performance difference. Just play safe here.
However, this approach will return us a mutable.Map as the “d.aggregate” step. Since the signature of aggregate is
1
|
def
aggregate[U](zeroValue
:
U)(seqOp
:
(U, T) ⇒ U, combOp
:
(U, U) ⇒ U)(
implicit
arg
0
:
ClassTag[U])
:
U
|
As you can see, the return value is the same as the internal type (U).
To solve this problem, let’s use a more flexible approach “mapPartitions(…).reduce(…)”
1
2
3
4
5
6
7
8
9
10
11
12
|
scala>
import
scala.collection.mutable
scala>
def
histP(iter
:
Iterator[String])
:
Iterator[Map[String,Long]]
=
{
|
val
hist
:
mutable.Map[String,Long]
=
mutable.Map()
|
while
(iter.hasNext){
|
val
c
=
iter.next
| hist.update(c,hist.getOrElse(c,
0
L)+
1
)
| }
| Iterator(hist.toMap)
| }
scala> d.mapPartitions(histP(
_
)).reduce(
_::_
)
|
Above code did the same thing as the aggregate method. However, the histP method returns a Iterator of immutable Map, although internally it used a mutable map. Assume the RDD[String] has a lot records in each partition, most of the map updates are actually on mutable.Map and updated in-place.