Histogram with Spark (2) – Implicit class

最新推荐文章于 2020-01-12 16:36:55 发布

原创最新推荐文章于 2020-01-12 16:36:55 发布 · 461 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Scala #spark

本文探讨了如何利用Scala的隐式转换优化RDD的Histogram计算过程，通过引入辅助函数并使用隐式类来简化代码。同时，介绍了如何在保持效率的同时，将不可变Map转换为可变Map，并通过迭代聚合操作实现Histogram计算。最后，通过实例演示了如何使用`aggregate`和`mapPartitions`方法进行高效的数据处理。

As in the previous post we studied how to calculate the histogram on a RDD[String].
By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.

Before Scala 2.10, to add some new helper method to existing types with implicit type conversion need to define a new class, and an implicit function to do the conversion like the follow:

 
  
        class 
        mapHelper[K,V](m
        :
        Map[K,V]){
       

          
        def 
        updatedWith(c
        :
        K,d
        :
        V)(f
        : 
        V
        =
        >V)
        =
        {
       

            
        m.updated(c,f(m.getOrElse(c,d)))
       

          
        }
       

        }
       

        implicit 
        def 
         toMapHelper[K,V](m
        :
        Map[K,V])
        =
        new 
        mapHelper(m)
       
 
 

As you can see here, the “toMapHelper” method is kind of redundant. Since 2.10, we can use the implicit keyword on class, so the constructor of the class can be used for implicit conversion:

 
        implicit 
        class 
         mapHelper[K,V](m
        :
        Map[K,V]){...

Now Let’s create the helper function for histogram

 
        implicit 
        class 
         histMap[K](m
        :
        Map[K,Long]){
       
        def 
        addcount(c
        :
        K,n
        :
        Long)
        =
        {
       
        m.updated(c, (m.getOrElse(c,
        0
        L)+n))
       
        }
       
        def 
        ::
        (n
        :
        Map[K,Long])
        =
        {
       
        (n /
        : 
        m){
        case 
        (map,(k,v))
        =
        >map.updated(k,v+map.getOrElse(k,
        0
        L))}
       
        }
       
        }

Here we defined 2 method on the implicit class “histMap” which converts “Map[K,Long]” objects. With this, we can do the histogram as:

 
  
        scala> d.aggregate(Map[String,Long]())(
        _
        .addcount(
        _
        ,
        1
        ),
        _::_
        )
       
 
 

So far, we used the immutable.Map to store the counts. According to this benchmark
Scala Map Benchmark
The immuntable.Map is significantly slower than mutable.Map, and mutable.Map is significantly slower than java.util.HashMap. With the scala.collection.JavaConversions.mapAsScalaMap package, we can pretty much consider the java.util.HashMap as a scala mutable.Map.

If we want to use mutable.Map to store the counts, we can pretty much simply do

 
        import 
        scala.collection.mutable.
        _

and change

 
        m.updated(c, (m.getOrElse(c,
        0
        L)+n))

 
        m +
        = 
        (c->(m.getOrElse(c,
        0
        L)+n))

According to the API document, the different between m.updated and m+= is that “updated” method return a new instant of mutable.Map, and “+=” return the existing instant. I’m not so sure whether there are some real performance difference. Just play safe here.

However, this approach will return us a mutable.Map as the “d.aggregate” step. Since the signature of aggregate is

 
  
        def 
        aggregate[U](zeroValue
        : 
        U)(seqOp
        : 
        (U, T) ⇒ U, combOp
        : 
        (U, U) ⇒ U)(
        implicit 
        arg
        0
        : 
        ClassTag[U])
        : 
        U 
       
 
 

As you can see, the return value is the same as the internal type (U).

To solve this problem, let’s use a more flexible approach “mapPartitions(…).reduce(…)”

 
        scala> 
        import 
         scala.collection.mutable
       
        scala> 
        def 
         histP(iter
        : 
         Iterator[String])
        :
        Iterator[Map[String,Long]]
        =
        {
       
        | 
        val 
         hist
        :
        mutable.Map[String,Long]
        =
        mutable.Map()
       
        | 
        while 
         (iter.hasNext){
       
        |   
        val 
         c
        =
        iter.next
       
        |   hist.update(c,hist.getOrElse(c,
        0
        L)+
        1
        )
       
        | }
       
        | Iterator(hist.toMap)
       
        | }
       
        scala> d.mapPartitions(histP(
        _
        )).reduce(
        _::_
        )

Above code did the same thing as the aggregate method. However, the histP method returns a Iterator of immutable Map, although internally it used a mutable map. Assume the RDD[String] has a lot records in each partition, most of the map updates are actually on mutable.Map and updated in-place.