1. 根据前面的文章,搭建好spark on yarn的集群,即hadoop和spark均搭建成功
/usr/local/hadoop/sbin/start-all.sh
启动hadoop yarn
6661 NameNode
7163 ResourceManager
7300 NodeManager
7012 SecondaryNameNode
3119
7512 Jps
6795 DataNode
2. 打开eclipse,创建maven项目
3.修改pom.xml 增加jar包依赖
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
4. 点击run---as---maven install
此时会下载依赖的jar包
5.在app.java主类中调用spark
package com.fei.simple_project;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
/**
* Hello world!
*
*/
public class App
{
public static void main( String[] args )
{
String logFile = "README.md";
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println( "Hello World!" );
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
6. run---as---install
将会在target目录下生成jar包
7. 运行spark.sh
/usr/local/spark/bin/spark-submit --class "com.fei.simple_project.App" --master local[4] /home/tizen/share/working-dir/spark/simple-project/target/simple-project-0.0.1-SNAPSHOT.jar
8. 查看执行效果
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://namenode:9000/user/tizen/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
说明我们还没有把README.md文件上传到集群中去
9. hdfs dfs -put README.md README.md
10 执行spark.sh
16/01/23 21:49:23 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/23 21:49:23 INFO DAGScheduler: ResultStage 1 (count at App.java:25) finished in 0.047 s
16/01/23 21:49:23 INFO DAGScheduler: Job 1 finished: count at App.java:25, took 0.155340 s
Hello World!
Lines with a: 58, lines with b: 26
16/01/23 21:49:23 INFO SparkContext: Invoking stop() from shutdown hook
16/01/23 21:49:23 INFO SparkUI: Stopped Spark web UI at http://192.168.0.101:4040
16/01/23 21:49:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
可以看到结果出来了
查看url
http://192.168.0.101:4040