Histogram in Spark (1)

最新推荐文章于 2020-12-16 11:18:29 发布

bzhangusc

最新推荐文章于 2020-12-16 11:18:29 发布

阅读量709

点赞数

文章标签： Scala spark

本文链接：https://blog.csdn.net/bzhangusc/article/details/42963433

版权

Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in this exercise.

Create a dummy RDD[String] and apply the aggregate method to calculate histogram

 
   
        scala>  
        val 
         d 
        = 
        sc.parallelize(( 
        1 
        to  
        10 
        ).map( 
        _ 
        % 
         3 
        ).map( 
        "val" 
        + 
        _ 
        .toString)) 
       
 
        scala> d.aggregate(Map[String,Int]())( 
       
 
              
        | (m,c) 
        = 
        >m.updated(c,m.getOrElse(c, 
        0 
        )+ 
        1 
        ), 
       
 
              
        | (m,n) 
        = 
        >(m / 
        : 
        n){ 
        case 
        (map,(k,v)) 
        = 
        >map.updated(k,v+map.getOrElse(k, 
        0 
        ))} 
       
 
              
        | ) 
       
 
 

The 2nd function of aggregate method is to merge 2 maps. We can actually define a Scala function

 
   
        scala>  
        def 
         mapadd[T](m 
        : 
        Map[T,Int],n 
        : 
        Map[T,Int]) 
        = 
        { 
       
 
              
        | (m / 
        : 
        n){ 
        case 
        (map,(k,v)) 
        = 
        >map.updated(k,v+map.getOrElse(k, 
        0 
        ))} 
       
 
              
        | } 
       
 
 

It combine the histogram on the different partitions together

 
   
        scala> mapadd(Map( 
        "a" 
        -> 
        1 
        , 
        "b" 
        -> 
        2 
        ),Map( 
        "a" 
        -> 
        2 
        , 
        "c" 
        -> 
        1 
        )) 
       
 
        res 
        3 
        : 
        scala.collection.mutable.Map[String,Int]  
        = 
         Map(b ->  
        2 
        , a ->  
        3 
        , c ->  
        1 
        ) 
       
 
 

Use mapadd we can rewrite the aggregate step

 
        scala> d.aggregate(Map[String,Int]())( 
       
        | (m,c) 
        = 
        >m.updated(c,m.getOrElse(c, 
        0 
        )+ 
        1 
        ), 
       
        | mapadd( 
        _ 
        , 
        _ 
        ) 
       
        | )

bzhangusc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Histogram in Spark (1)

Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in th
复制链接

扫一扫