- 版本信息:Spark 2.2-hadoop-2.6
1、使用Spark Shell方式测试1g数据
(1)测试sql:SELECT * FROM rddTable order by age
先打开spark集群, /…/…/sbin/start-all.sh
然后启动spark shell:
./spark-shell --master spark://10.47.85.158:7077 --executor-memory 2g --conf spark.default.parallelism=64
Spark web界面:10.47.85.158:8080
Spark Jon web界面:10.47.85.158:4040
下面是在spark shell中的代码:
val sqlContext=new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(id:Long, name:String, age:Int, department: String, job_number: Long, hire_date:String, on_the_job:Boolean)
val rddpeople=sc.textFile(“hdfs://10.47.85.158:9000/SQLData/100M/SQLData_1G.txt”).map(_.split(",")).map(p=>Person(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), p(4).trim.toLong, p(5), p(6).trim.toBoolean)).toDF()
rddpeople.createOrReplaceTempView(“rddTable”)
sqlContext.sql(“SELECT * FROM rddTable order by age”).toDF().write.format(“csv”).save(“hdfs://10.47.85.158:9000/nht/spark1g/result1g111/”)
测试时间: 11秒 / 13s / 12s
//sqlContext.sql(“SELECT * FROM rddTable order by age”).toDF().write.save(“hdfs://10.47.85.158:9000/nht/spark1g/result1g/”) //默认使用的是parquet格式,而且使用snappy格式压缩
//sqlContext.sql(“SELECT * FROM rddTable order by age”).show() //默认展示20行数据
2、使用java程序方式,submit提交任务的形式
参照博客:https://blog.csdn.net/qq_21383435/article/details/77428659
然后根据上面的代码修改,然后提交到集群运行(提交命令分多种集群形式,local、standalone、yarn clent和yarn cluster):
./bin/spark-submit \
–class
–master \
–deploy-mode \
–conf = \
... # other options
\
[application-arguments]