每次当你在Yarn上以Cluster模式提交 21 Oct 2014 14:23:22,006 INFO [main] (org.apache.spark.Logging$class.logInfo:59) -
Uploading file:/home/spark-1.1.0-bin-2.2.0/lib/spark-assembly-1.1.0-hadoop2.2.0.jar to
hdfs://my/user/iteblog/...../spark-assembly-1.1.0-hadoop2.2.0.jar
21 Oct 2014 14:23:23,465 INFO [main] (org.apache.spark.Logging$class.logInfo:59) -
Uploading file:/export1/spark/spark-1.0.1-bin-hadoop2/spark-1.0-SNAPSHOT.jar to
hdfs://my/user/iteblog/.sparkStaging/application_1413861490879_0010/spark-1.0-SNAPSHOT.jar
这是 bin/hadoop fs -mkdir /home/iteblog/spark_lib
bin/hadoop fs -put spark-assembly-1.1.0-hadoop2.2.0.jar
/home/iteblog/spark_lib/spark-assembly-1.1.0-hadoop2.2.0.jar
然后编辑spark-default.conf文件,添加以下内容: spark.yarn.jar=hdfs://my/home/iteblog/spark_lib/spark-assembly-1.1.0-hadoop2.2.0.jar
也就是使得spark.yarn.jar指向我们HDFS上的Spark lib库。
然后你再去提交应用程序 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 512m \
--executor-memory 2g \
--executor-cores 1 \
lib/spark-examples*.jar \
10
你可以看到日志里面已经不需要上传spark-assembly-1.1.0-hadoop2.2.0.jar文件了。
但是遗憾的是,如果你的 /** See if two file systems are the same or not. */
private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = {
val srcUri = srcFs.getUri()
val dstUri = destFs.getUri()
if (srcUri.getScheme() == null) {
return false
}
if (!srcUri.getScheme().equals(dstUri.getScheme())) {
return false
}
var srcHost = srcUri.getHost()
var dstHost = dstUri.getHost()
if ((srcHost != null) && (dstHost != null)) {
try {
srcHost = InetAddress.getByName(srcHost).getCanonicalHostName()
dstHost = InetAddress.getByName(dstHost).getCanonicalHostName()
} catch {
case e: UnknownHostException =>
return false
}
if (!srcHost.equals(dstHost)) {
return false
}
} else if (srcHost == null && dstHost != null) {
return false
} else if (srcHost != null && dstHost == null) {
return false
}
if (srcUri.getPort() != dstUri.getPort()) {
false
} else {
true
}
}
仔细看第15、16行代码,代码里面把我们传进来的HDFS路径中的srcHost当作是IP:port格式的。而如果你的
如果你急着用这个特性,你可以将你的compareFs函数修改成下面的实现即可 import com.google.common.base.Objects
/**
* Return whether the two file systems are the same.
*/
private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = {
val srcUri = srcFs.getUri()
val dstUri = destFs.getUri()
if (srcUri.getScheme() == null || srcUri.getScheme() != dstUri.getScheme()) {
return false
}
var srcHost = srcUri.getHost()
var dstHost = dstUri.getHost()
// In HA or when using viewfs, the host part of the URI may not actually
// be a host, but the name of the HDFS namespace.
// Those names won't resolve, so avoid even trying if they
// match.
if (srcHost != null && dstHost != null && srcHost != dstHost) {
try {
srcHost = InetAddress.getByName(srcHost).getCanonicalHostName()
dstHost = InetAddress.getByName(dstHost).getCanonicalHostName()
} catch {
case e: UnknownHostException =>
return false
}
}
Objects.equal(srcHost, dstHost) && srcUri.getPort() == dstUri.getPort()
}
然后重新编译一下Spark源码(不会编译?参见《用Maven编译Spark 1.1.0》)。
根据Cloudera 官方博客说明,如果你用的是Cloudera Manager,那么Spark assembly JAR 会自动地上传到HDFS,比如上面的hdfs://my/home/iteblog/spark_lib/目录。但是我没安装那个,如果你安装了可以试试。