Spark1.5.2在eclipse生成jar提交到集群运行
环境:
window7
ubuntu
spark1.5.2
1.WordCountSpark.scala代码:
//class WordCountSpark {
//
//}
import org.apache.spark._
import SparkContext._
object WordCountSpark {
def main(args: Array[String]) {
if (args.length != 3 ){
println("usage is org.test.WordCount <master> <input> <output>")
return
}
val sc = new SparkContext(args(0), "WordCount",
System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))
val textFile = sc.textFile(args(1))
val result = textFile.flatMap(line => line.split("\\s+"))
.map(word => (word, 1)).reduceByKey(_ + _)
result.saveAsTextFile(args(2))
}
}
2.脚本代submitJob.sh 代码:
#!/usr/bin/env bash
./spark-submit --name WordCountSpark \
--class WordCountSpark \
--master spark://219.219.220.149:7077 \
--executor-memory 512M \
--total-executor-cores 1 WordCountSpark.jar local /input/* /output/201601262158
目录:/home/hadoop/cloud/spark-1.5.2/bin
3.将WordCountSpark.scala生成jar包,用rz上传到/home/hadoop/cloud/spark-1.5.2/bin;
数据放在HDFS的input下:
4.运行:
hadoop@Master:~/cloud/spark-1.5.2/bin$ ./submitJob.sh
运行结果:
运行记录
参考:
【1】http://bit1129.iteye.com/blog/2172164
【2】http://blog.csdn.net/ggz631047367/article/details/50185181
同理,第二份代码按照相似的操作也可行:
//class SparkWordCount {
//
//}
//package spark.examples
//import org.apache.spark.SparkConf
//import org.apache.spark.SparkContext
//
//import org.apache.spark.SparkContext._
import org.apache.spark._
import SparkContext._
object SparkWordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: <file>")
System.exit(1)
}
//定义Spark运行时的配置信息
/*
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load
values from any `spark.*` Java system properties set in your application as well. In this case,
parameters you set directly on the `SparkConf` object take priority over system properties.
*/
val conf = new SparkConf()
conf.setAppName("SparkWordCount")
//定义Spark的上下文
/*
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
@param config a Spark Config object describing the application configuration. Any settings in this config overrides the default configs as well as system properties.
*/
val sc = new SparkContext(conf)
///从HDFS中获取文本(没有实际的进行读取),构造MappedRDD
val rdd = sc.textFile(args(0))
//此行如果报value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)],则需要import org.apache.spark.SparkContext._
rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile(args(1))
sc.stop
}
}
脚本:
#!/usr/bin/env bash
./spark-submit --name SparkWordCount \
--class SparkWordCount \
--master spark://219.219.220.149:7077 \
--executor-memory 512M \
--total-executor-cores 1 SparkWordCount.jar /input/* /output/201601262211
执行:
hadoop@Master:~/cloud/spark-1.5.2/bin$ ./submitJob_SparkWordCount.sh