spark优化

最新推荐文章于 2024-11-15 20:13:07 发布

守望拼搏

最新推荐文章于 2024-11-15 20:13:07 发布

阅读量551

点赞数

分类专栏： spark 文章标签： spark

spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

转载:

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Avoid reduceByKey When the input and output value types are different. For example, consider writing a transformation that finds all the unique strings corresponding to each key. One way would be to use map to transform each element into a Set and then combine the Sets with reduceByKey:

 
           1 
         
           2 
         
          rdd 
          . 
          map 
          ( 
          kv 
            
          = 
          > 
            
          ( 
          kv 
          . 
          _1 
          , 
            
          new 
            
          Set 
          [ 
          String 
          ] 
          ( 
          ) 
            
          + 
            
          kv 
          . 
          _2 
          ) 
          ) 
         
          . 
          reduceByKey 
          ( 
          _ 
            
          ++ 
            
          _ 
          )

This code results in tons of unnecessary object creation because a new set must be allocated for each record. It’s better to use aggregateByKey, which performs the map-side aggregation more efficiently:

 
           1 
         
           2 
         
           3 
         
           4 
         
          val  
          zero 
            
          = 
            
          new 
            
          collection 
          . 
          mutable 
          . 
          Set 
          [ 
          String 
          ] 
          ( 
          ) 
         
          rdd 
          . 
          aggregateByKey 
          ( 
          zero 
          ) 
          ( 
         
          ( 
          set 
          , 
            
          v 
          ) 
            
          = 
          > 
            
          set 
            
          += 
            
          v 
          , 
         
          ( 
          set1 
          , 
            
          set2 
          ) 
            
          = 
          > 
            
          set1 
            
          ++ 
          = 
            
          set2 
          )

Avoid the flatMap-join-groupBy pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.