700多万的数据,每个都执行两遍
------------------rdd---------------
val rdd = sc.textFile("hdfs://master:9000/spark/SogouQ/")rdd.cache()
rdd.count()
6/09/09 19:19:11 INFO scheduler.DAGScheduler: Job 1 finished: count at <console>:24, took 15.594766 s
res1: Long = 7265051
--------------spark sql---------------------------------------
select count(*) from hive_test
7265051
Time taken: 15.448 seconds, Fetched 1 row(s)
---------------hive---------------------------------------------------
select count(*) from hive_test
第一次
OK7265051
Time taken: 168.611 seconds, Fetched: 1 row(s)
第二次
OK7265051
Time taken: 96.413 seconds, Fetched: 1 row(s)
---------------------------------------------------------------------
总结:spark速度最好,不是一般的快