Optimize map performamce with mapPartitions

最新推荐文章于 2024-08-18 00:00:00 发布

bzhangusc

最新推荐文章于 2024-08-18 00:00:00 发布

阅读量272

点赞数

文章标签： Scala spark

本文链接：https://blog.csdn.net/bzhangusc/article/details/43151727

版权

As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in

 
        def 
        mLine(line 
        : 
        String) 
        = 
        { 
       
        val 
        parser 
        = 
        new 
        CSVParser( 
        '\t' 
        ) 
       
        parser.parseLine(line) 
       
        } 
       
        ... 
       
        ...myRDD.map(mLine( 
        _ 
        ).size)...

The mLine function is used in the map method of an RDD. In this case the parser object is created each time for each record, although they are exactly the same thing.

Actually, whenever we need to apply some complicated operation on each record there is a high chance we need to create some helper objects within map. By combining mapPartition with Scala map, we can reduce the unnecessary new object creation. Let’s rewrite above example with mapPartitions:

 
        def 
        pLines(lines 
        : 
        Iterator[String]) 
        = 
        { 
       
        val 
        parser 
        = 
        new 
        CSVParser( 
        '\t' 
        ) 
       
        lines.map(parser.parseLine( 
        _ 
        ).size) 
       
        } 
       
        ... 
       
        myRDD.mapPartitions(pLines)

On my single box test machine, execution time of the same task reduced from 65 seconds to 35 seconds. Surprisingly the opencsv parser with the mapPartitions optimization is significantly faster than map(_split('\t')).

bzhangusc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimize map performamce with mapPartitions

As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in123456 defmLine(line:String)={ valparser=
复制链接

扫一扫