Spark是基于内存的分布式计算框架,性能是十分彪悍的。
话接上回,部署完Spark集群之后,想要测试一下,Spark的性能。
1、环境
集群概况可以参见Spark Hadoop集群部署与Spark操作HDFS运行详解。
现在集群里有一大约7G的文件,是手机号和IP地址的组合。
hadoop dfs -dus /dw/spark/mobile.txt
hdfs://web02.dw:9000/dw/spark/mobile.txt 7056656190
里面我们关心的只有IP地址。
hadoop dfs -cat /dw/spark/mobile.txt | more
2014-04-21 104497 15936529112 2 2011-01-11 09:58:47 0 0 2011-01-11 09:58:50 2011-01-19 09:58:50 61.172.242.36 2011-01-19 08:59:47 0
0
2014-04-21 111864 13967013250 2 2010-11-28 21:06:56 0 0 2010-11-28 21:06:57 2010-12-06 21:06:57 61.172.242.36 2010-12-06 20:08:11 0
0
2014-04-21 116368 15957685805 2 2011-06-27 17:05:55 0 0 2011-06-27 17:06:01 2011-07-05 17:06:01 10.129.20.108 2011-07-05 16:11:05 0
0
2、count和TopN
2.1 count文件
spark可以很简单的实现类似sql中select count(1) from xxx 的功能。调用RDD 的count函数就好了。
进入spark-shell
bin/spark-shell
scala> val data = sc.textFile("/dw/spark/mobile.txt")
14/05/14 17:23:33 INFO MemoryStore: ensureFreeSpace(73490) called with curMem=0, maxMem=308713881
14/05/14 17:23:33 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 71.8 KB, free 294.3 MB)
data: org.ap