更多代码请见:https://github.com/xubo245/SparkLearning
Spark代码1之RDDparallelizeSaveAsFile
主要功能:
1.并行生成n个随机数并对其进行统计并排序,最后存到HDFS
2.计算和存储两部分分别计时
代码:
package LocalSpark
/**
* Created by xubo on 2016/3/3.
*/
import org.apache.spark._
//import java.util._;
import scala.util.Random
import java.text.SimpleDateFormat
import java.util.Date
import scala.math._
object RDDparallelizeSaveAsFile {
def main(args:Array[String]) {
// val conf = new SparkConf().setAppName("RDDparallelize").setMaster("local")
val conf = new SparkConf().setAppName("RDDparallelize").setMaster("spark://Master:7077")
val spark = new SparkContext(conf)
// val startTime=
// val iString=new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date() );
var startTime=System.currentTimeMillis();
var ar1=if(args.length>0) args(0).toInt else 10000000
ar1=min(ar1,Int.MaxValue)
val data=spark.parallelize(1 to ar1).map(num=>(new Random()).nextInt(1000)).map(num=>(num,1)).reduceByKey(_+_).sortBy(_._2 )
// spark.parallelize(1 to 10000000).map(num=>(new Random()).nextInt(1000)).map(num=>(num,1)).reduceByKey(_+_).sortBy(_._2 ).foreach(println)
// for(i<-1 to 1000) println( (new Random()).nextInt(1000))
var endTime=System.currentTimeMillis();
println("compute:"+(endTime-startTime)+"ms")
val iString=new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date() )
val soutput="hdfs://Master:9000/output/"+iString;
println(soutput)
startTime=System.currentTimeMillis();
data.saveAsTextFile(soutput)
endTime=System.currentTimeMillis();
println("saveAsTextFile:"+(endTime-startTime)+"ms")
spark.stop()
}
}
脚本:
#!/usr/bin/env bash
spark-submit --name RDDparallelizeSaveAsFile \
--class LocalSpark.RDDparallelizeSaveAsFile \
--master spark://Master:7077 \
--executor-memory 512M \
--total-executor-cores 22 scala2.jar 2147483645
Master为ip地址,运行时是要用真是ip替换
最大Int为2147483645
当n为:2147483645
测试结果:
集群模式:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom2.sh
compute:270435ms
hdfs://Master:9000/output/20160303203717406
saveAsTextFile:3658ms
local模式:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom2.sh
compute:380ms
hdfs://Master:9000/output/20160303205225605
saveAsTextFile:472576ms
第二次:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom2.sh
compute:462ms
hdfs://Master:9000/output/20160303210423094
saveAsTextFile:494303ms
明显local时间要长,为什么两个时间不一样?应该是运行前后动了代码,RDD是懒执行的,所以会在调用时才执行
部分文件代码:
(571,2143167)
(544,2143383)
(919,2143604)
(832,2143630)
(184,2143638)
(26,2143763)
(228,2143871)
(360,2143921)
(76,2143931)
(92,2144076)
(820,2144118)
(286,2144162)
(831,2144198)
当n为: 10000000
集群模式:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom.sh
compute:5423ms
hdfs://Master:9000/output/20160303204414173
saveAsTextFile:3530ms
local:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom.sh
compute:442ms
hdfs://<span style="font-family: Arial, Helvetica, sans-serif;">Master</span><span style="font-family: Arial, Helvetica, sans-serif;">:9000/output/20160303205045778</span>
saveAsTextFile:5112ms
修改后代码:
package LocalSpark
/**
* Created by xubo on 2016/3/3.
*/
import org.apache.spark._
//import java.util._;
import scala.util.Random
import java.text.SimpleDateFormat
import java.util.Date
import scala.math._
object RDDparallelizeSaveAsFile {
def main(args:Array[String]) {
// val conf = new SparkConf().setAppName("RDDparallelize").setMaster("local")
val conf = new SparkConf().setAppName("RDDparallelize").setMaster("spark://<span style="font-family: Arial, Helvetica, sans-serif;">Master</span><span style="font-family: Arial, Helvetica, sans-serif;">:7077")</span>
val spark = new SparkContext(conf)
// val startTime=
// val iString=new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date() );
var startTime=System.currentTimeMillis();
var ar1=if(args.length>0) args(0).toInt else 10000000
println("length:"+ar1)
ar1=min(ar1,Int.MaxValue)
val data=spark.parallelize(1 to ar1).map(num=>(new Random()).nextInt(1000)).map(num=>(num,1)).reduceByKey(_+_).sortBy(_._2 )
// spark.parallelize(1 to 10000000).map(num=>(new Random()).nextInt(1000)).map(num=>(num,1)).reduceByKey(_+_).sortBy(_._2 ).foreach(println)
// for(i<-1 to 1000) println( (new Random()).nextInt(1000))
println(data.count())
var endTime=System.currentTimeMillis();
println("compute:"+(endTime-startTime)+"ms")
val iString=new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date() )
val soutput="hdfs://<span style="font-family: Arial, Helvetica, sans-serif;">Master</span><span style="font-family: Arial, Helvetica, sans-serif;">:9000/output/"+iString;</span>
println(soutput)
startTime=System.currentTimeMillis();
data.saveAsTextFile(soutput)
endTime=System.currentTimeMillis();
println("saveAsTextFile:"+(endTime-startTime)+"ms")
spark.stop()
}
}
运行结果:
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom.sh
length:10000000
1000
compute:4361ms
hdfs://MAster:9000/output/20160303211526342
saveAsTextFile:1229ms
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom2.sh
length:2147483645
1000
compute:464324ms
hdfs://<span style="font-family: Arial, Helvetica, sans-serif;">MAster</span><span style="font-family: Arial, Helvetica, sans-serif;">:9000/output/20160303212342077</span>
saveAsTextFile:1237ms
hadoop@Master:~/cloud/testByXubo/spark/testRandom$ ./submitJobTestRandom2.sh
length:2147483645
1000
compute:274269ms
hdfs://Master:9000/output/20160303214238474
saveAsTextFile:2662ms