Spark二次排序的实现方式,记录以作备忘
一、测试文件 testsortTwo:
- [root@tongji ~]# hadoop fs -cat /user/wzx/testsortTwo
- 1444697637.786 180.175.251.34 wv.88mf.com _trackClick 174|139||-17718753 0000436cc2ad45bb8df6a70bd09e146f
- 1444695603.085 218.22.168.122 wv.17mf.com _trackPageview 0002a9ed7d754a08957912700e36d731
- 1444696305.588 106.110.49.210 wv.88mf.com _trackPageview 00034c9597df47b6a3041635334daa3d
- 1444696305.588 221.2.101.146 wv.77mf.com _trackMover 446,650|492,635|520,629 000364344c8649f8bdbf66dba76f8ed1
- 1444695543.619 120.193.187.66 c.mfniu.com _trackPageview 00042d7207ee4eb29d1604c724629182
- 1444697033.836 183.54.102.45 c.mfniu.com _trackPageview 000436b51eb844aa9e002ff62c21168c
- 1444696305.588 58.215.136.139 wv.88mf.com _trackPageview 00051113efbf4ae1a805b2bf262ca26d
- 1444697308.329 61.164.41.227 wv.17mf.com _trackPageview 00054c1620814bfcaeba6a88d6b3c54c
spark代码: (先按照第一列排序,再按照第3列排序 )
- val text= sc.textFile("/user/wzx/testsortTwo")
- val rdd1 = text.map(x => x.split(" ")).map{
- x =>
- val len = x.length
- if(len == 5){
- (((x(0)),x(2)),(x(1),x(3),x(4)))
- }else if(len == 6){
- (((x(0)),x(2)),(x(1),x(3),x(4),x(5)))
- }else{
- (((x(0)),x(2)),x(1))
- }
- }
- val rdd2 = rdd1.groupByKey().sortByKey()
- rdd2.collect()
-
结果:
- ((1444695543.619,c.mfniu.com),CompactBuffer((120.193.187.66,_trackPageview,00042d7207ee4eb29d1604c724629182))),
- ((1444695603.085,wv.17mf.com),CompactBuffer((218.22.168.122,_trackPageview,0002a9ed7d754a08957912700e36d731))),
- ((1444696305.588,wv.77mf.com),CompactBuffer((221.2.101.146,_trackMover,446,650|492,635|520,629,000364344c8649f8bdbf66dba76f8ed1))),
- ((1444696305.588,wv.88mf.com),CompactBuffer((106.110.49.210,_trackPageview,00034c9597df47b6a3041635334daa3d), (58.215.136.139,_trackPageview,00051113efbf4ae1a805b2bf262ca26d))),
- ((1444697033.836,c.mfniu.com),CompactBuffer((183.54.102.45,_trackPageview,000436b51eb844aa9e002ff62c21168c))),
- ((1444697308.329,wv.17mf.com),CompactBuffer((61.164.41.227,_trackPageview,00054c...
二、测试文件 testsort:
- wzx 2321
- admin 462
- yxy 21323
- zov 32
- wzx 123
- vi 2
- wzx 3
- wzx 9
- yxy 223
spark代码:
val text = sc.textFile("/user/wzx/testsort")
val rdd1 = text.map(x => x.split(" ")).map(x => (x(0),x(1).toInt)).groupByKey().sortByKey(true).map(x => (x._1,x._2.toList.sortWith(_>_)))
rdd1.collect
val rdd1 = text.map(x => x.split(" ")).map(x => (x(0),x(1).toInt)).groupByKey().sortByKey(true).map(x => (x._1,x._2.toList.sortWith(_>_)))
rdd1.collect
结果:
Array((admin,List(462)), (vi,List(2)), (wzx,List(2321, 123, 9, 3)), (yxy,List(21323, 223)), (zov,List(32)))
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29754888/viewspace-1826229/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29754888/viewspace-1826229/