Histogram with Spark (2) – Implicit class

As in the previous post we studied how to calculate the histogram on a RDD[String].
By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.

Before Scala 2.10, to add some new helper method to existing types with implicit type conversion need to define a new class, and an implicit function to do the conversion like the follow:

1
2
3
4
5
6
class mapHelper[K,V](m : Map[K,V]){
   def updatedWith(c : K,d : V)(f : V = >V) = {
     m.updated(c,f(m.getOrElse(c,d)))
   }
}
implicit def toMapHelper[K,V](m : Map[K,V]) = new mapHelper(m)

As you can see here, the “toMapHelper” method is kind of redundant. Since 2.10, we can use the implicit keyword on class, so the constructor of the class can be used for implicit conversion:

1
implicit class mapHelper[K,V](m : Map[K,V]){...

Now Let’s create the helper function for histogram

1
2
3
4
5
6
7
8
implicit class histMap[K](m : Map[K,Long]){
   def addcount(c : K,n : Long) = {
     m.updated(c, (m.getOrElse(c, 0 L)+n))
   }
   def :: (n : Map[K,Long]) = {
     (n / : m){ case (map,(k,v)) = >map.updated(k,v+map.getOrElse(k, 0 L))}
   }
}

Here we defined 2 method on the implicit class “histMap” which converts “Map[K,Long]” objects. With this, we can do the histogram as:

1
scala> d.aggregate(Map[String,Long]())( _ .addcount( _ , 1 ), _::_ )

So far, we used the immutable.Map to store the counts. According to this benchmark
Scala Map Benchmark
The immuntable.Map is significantly slower than mutable.Map, and mutable.Map is significantly slower than java.util.HashMap. With the scala.collection.JavaConversions.mapAsScalaMap package, we can pretty much consider the java.util.HashMap as a scala mutable.Map.

If we want to use mutable.Map to store the counts, we can pretty much simply do

1
import scala.collection.mutable. _

and change

1
m.updated(c, (m.getOrElse(c, 0 L)+n))

to

1
m + = (c->(m.getOrElse(c, 0 L)+n))

According to the API document, the different between m.updated and m+= is that “updated” method return a new instant of mutable.Map, and “+=” return the existing instant. I’m not so sure whether there are some real performance difference. Just play safe here.

However, this approach will return us a mutable.Map as the “d.aggregate” step. Since the signature of aggregate is

1
def aggregate[U](zeroValue : U)(seqOp : (U, T) ⇒ U, combOp : (U, U) ⇒ U)( implicit arg 0 : ClassTag[U]) : U

As you can see, the return value is the same as the internal type (U).

To solve this problem, let’s use a more flexible approach “mapPartitions(…).reduce(…)”


1
2
3
4
5
6
7
8
9
10
11
12
scala> import scala.collection.mutable
 
scala> def histP(iter : Iterator[String]) : Iterator[Map[String,Long]] = {
      | val hist : mutable.Map[String,Long] = mutable.Map()
      | while (iter.hasNext){
      |   val c = iter.next
      |   hist.update(c,hist.getOrElse(c, 0 L)+ 1 )
      | }
      | Iterator(hist.toMap)
      | }
 
scala> d.mapPartitions(histP( _ )).reduce( _::_ )

Above code did the same thing as the aggregate method. However, the histP method returns a Iterator of immutable Map, although internally it used a mutable map. Assume the RDD[String] has a lot records in each partition, most of the map updates are actually on mutable.Map and updated in-place.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值