在本文中,威廉将尝试构建我们的第一个Spark程序,并在之前文章中创建的Spark集群里运行起来
Java程序
对于Java程序来说,使用Maven
管理依赖及发布过程中的各个步骤是不错的选择
利用Maven
生成Java Project
mvn archetype:generate -DgroupId=com.spark.example -DartifactId=JavaSparkPi
Maven
为我们创建了符合标准目录结构的文件
JavaSparkPi
|-src
|-main
|-java
|-com
|-spark
|-example
|-App.java
|-test
|-java
|-com
|-spark
|-example
|-AppTest.java
|-pom.xml
接下来我们将Spark官方实例JavaSparkPi.java
作实例,因此用其替换掉生成的App.java
,
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.spark.example;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import java.util.ArrayList;
import java.util.List;
/**
* 计算Pi的近似值
* Usage: JavaSparkPi [slices]
*/
public final class JavaSparkPi {
public static void main(String[] args) throws Exception {
# 创建SparkConf对象
SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi");
# 创建JavaSparkContext对象
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
int n = 100000 * slices;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
int count = dataSet.map(new Function<Integer, Integer>() {
@Override
public Integer call(Integer integer) {
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y < 1) ? 1 : 0;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) {
return integer + integer2;
}
});
System.out.println("Pi is roughly " + 4.0 * count / n);
jsc.stop();
}
}
此程序实现的功能是计算Pi的近似值,需提供参数slices,默认值为2,slices参数值越大,计算结果的精确度越高,但计算量也就越大
配置pom
依赖关系
JavaSparkPi
对象引用了org.apache.spark
包的对象,因此需要在pom文件中配置依赖关系
编辑JavaSparkPi/pom.xml
,在dependencies
处加入以下代码
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
我们可以通过mvn dependency
命令来查看依赖关系
mvn dependency:tree -Ddetail=true
[INFO] com.spark.example:JavaSparkPi:jar:1.0-SNAPSHOT
[INFO] +- junit:junit:jar:3.8.1:test
[INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.0:compile
[INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile
[INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile
[INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile
[INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile
[INFO] | \- org.objenesis:objenesis:jar:1.2:compile
[INFO] +- com.twitter:chill-java:jar:0.5.0:compile
[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO] | | +- commons-cli:commons-cli:jar:1.2:compile
[INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] | | +- xmlenc:xmlenc:jar:0.52:compile
[INFO] | | +- commons-io:commons-io:jar:2.1:compile
[INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] | | +- commons-lang:commons-lang:jar:2.5:compile
[INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile
[INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile
[INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
[INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile
[INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
[INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO] | | \- org.tukaani:xz:jar:1.0:compile
[INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
[INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
[INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
[INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
[INFO] | | | | +- com.google.inject:guice:jar:3.0:compile
[INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile
[INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile
[INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
[INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
[INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile
[INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile
[INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
[INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
[INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
[INFO] | | | | | | \- org.glassfish.external:management-api:jar:3.0.0-b012:compile
[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
[INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
[INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile
[INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] | | | | | +- asm:asm:jar:3.1:compile
[INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile
[INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] | | | | | | \- javax.activation:activation:jar:1.1:compile
[INFO] | | | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO] | | | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO] | | | | \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
[INFO] | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
[INFO] | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
[INFO] | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
[INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
[INFO] | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
[INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO] | \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO] +- org.apache.spark:spark-network-common_2.10:jar:1.2.0:compile
[INFO] +- org.apache.spark:spark-network-shuffle_2.10:jar:1.2.0:compile
[INFO] +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO] | +- commons-codec:commons-codec:jar:1.3:compile
[INFO] | \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] | +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] | | \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO] | +- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] | | \- jline:jline:jar:0.9.94:compile
[INFO] | \- com.google.guava:guava:jar:14.0.1:compile
[INFO] +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO] | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO] | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO] | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO] | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO] | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO] | \- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO] +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO] | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
[INFO] +- org.apache.commons:commons-lang3:jar:3.3.2:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO] +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO] +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile
[INFO] +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile
[INFO] +- log4j:log4j:jar:1.2.17:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO] +- com.ning:compress-lzf:jar:1.0.0:compile
[INFO] +- org.xerial.snappy:snappy-java:jar:1.1.1.6:compile
[INFO] +- net.jpountz.lz4:lz4:jar:1.2.0:compile
[INFO] +- org.roaringbitmap:RoaringBitmap:jar:0.4.5:compile
[INFO] +- commons-net:commons-net:jar:2.2:compile
[INFO] +- org.spark-project.akka:akka-remote_2.10:jar:2.3.4-spark:compile
[INFO] | +- org.spark-project.akka:akka-actor_2.10:jar:2.3.4-spark:compile
[INFO] | | \- com.typesafe:config:jar:1.2.1:compile
[INFO] | +- io.netty:netty:jar:3.8.0.Final:compile
[INFO] | +- org.spark-project.protobuf:protobuf-java:jar:2.5.0-spark:compile
[INFO] | \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:compile
[INFO] +- org.spark-project.akka:akka-slf4j_2.10:jar:2.3.4-spark:compile
[INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile
[INFO] +- org.json4s:json4s-jackson_2.10:jar:3.2.10:compile
[INFO] | +- org.json4s:json4s-core_2.10:jar:3.2.10:compile
[INFO] | | +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile
[INFO] | | +- com.thoughtworks.paranamer:paranamer:jar:2.6:compile
[INFO] | | \- org.scala-lang:scalap:jar:2.10.0:compile
[INFO] | | \- org.scala-lang:scala-compiler:jar:2.10.0:compile
[INFO] | | \- org.scala-lang:scala-reflect:jar:2.10.0:compile
[INFO] | \- com.fasterxml.jackson.core:jackson-databind:jar:2.3.1:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
[INFO] | \- com.fasterxml.jackson.core:jackson-core:jar:2.3.1:compile
[INFO] +- org.apache.mesos:mesos:jar:shaded-protobuf:0.18.1:compile
[INFO] +- io.netty:netty-all:jar:4.0.23.Final:compile
[INFO] +- com.clearspring.analytics:stream:jar:2.7.0:compile
[INFO] +- com.codahale.metrics:metrics-core:jar:3.0.0:compile
[INFO] +- com.codahale.metrics:metrics-jvm:jar:3.0.0:compile
[INFO] +- com.codahale.metrics:metrics-json:jar:3.0.0:compile
[INFO] +- com.codahale.metrics:metrics-graphite:jar:3.0.0:compile
[INFO] +- org.tachyonproject:tachyon-client:jar:0.5.0:compile
[INFO] | \- org.tachyonproject:tachyon:jar:0.5.0:compile
[INFO] +- org.spark-project:pyrolite:jar:2.0.1:compile
[INFO] +- net.sf.py4j:py4j:jar:0.8.2.1:compile
[INFO] \- org.spark-project.spark:unused:jar:1.0.0:compile
配置maven-shade-plugin
默认情况下,Maven
在本地有名为.m2
的仓库文件夹,用于存放所有依赖的jar
包,因此在package
操作时只会打包依赖的jar
包的名称,而不会将真正的内容打包进去
而当在集群中运行程序的时候,情况就有所不同了,我们很难保证所依赖的jar
包在所有worker
主机上都存在,因此需要生成一个大jar
包(uber JAR
/assembly JAR
),其中包含所有的依赖内容
我们可以通过maven-shade-plugin
插件来实现,在pom
中加入以下内容:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Maven生成JAR包
mvn clean compile package
成功运行后,我们看到在JavaSparkPi/target
文件夹下生成了JavaSparkPi-1.0-SNAPSHOT.jar
文件,大小为76M,可见所有的依赖jar
包都被包含了进来
此处威廉只是演示如何生成uber JAR
,其实spark-core
的jar
是所有集群主机都肯定拥有的,因此我们不需要把它加入到打包生成的jar
包中
修改JavaSparkPi/pom.xml
,把spark-core
依赖的scope
设为provided
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
<scope>provided</scope>
</dependency>
再次运行mvn clean compile package
,可以看到这次的JavaSparkPi-1.0-SNAPSHOT.jar
大小仅为5K,不再包含spark-core
相关的依赖jar
向Spark集群提交程序
./bin/spark-submit --class com.spark.example.JavaSparkPi --master spark://192.168.32.130:7077 $HOME/JavaSparkPi/target/JavaSparkPi-1.0-SNAPSHOT.jar 1000
Python程序
Python是脚本语言,部署相对简单,不需要编译的步骤,仍然以计算Pi近似值的示例程序为例
./bin/spark-submit \
--master spark://192.168.32.130:7077 \
examples/src/main/python/pi.py \
1000
引用的.zip, .egg, .py
文件可以通过--py-files
参数来添加