spark hdfs java_Spark优化:禁止应用程序将依赖的Jar包传到HDFS

每次当你在Yarn上以Cluster模式提交 21 Oct 2014 14:23:22,006 INFO [main] (org.apache.spark.Logging$class.logInfo:59) -

Uploading file:/home/spark-1.1.0-bin-2.2.0/lib/spark-assembly-1.1.0-hadoop2.2.0.jar to

hdfs://my/user/iteblog/...../spark-assembly-1.1.0-hadoop2.2.0.jar

21 Oct 2014 14:23:23,465 INFO [main] (org.apache.spark.Logging$class.logInfo:59) -

Uploading file:/export1/spark/spark-1.0.1-bin-hadoop2/spark-1.0-SNAPSHOT.jar to

hdfs://my/user/iteblog/.sparkStaging/application_1413861490879_0010/spark-1.0-SNAPSHOT.jar

这是 bin/hadoop fs -mkdir /home/iteblog/spark_lib

bin/hadoop fs -put spark-assembly-1.1.0-hadoop2.2.0.jar

/home/iteblog/spark_lib/spark-assembly-1.1.0-hadoop2.2.0.jar

然后编辑spark-default.conf文件,添加以下内容: spark.yarn.jar=hdfs://my/home/iteblog/spark_lib/spark-assembly-1.1.0-hadoop2.2.0.jar

也就是使得spark.yarn.jar指向我们HDFS上的Spark lib库。

然后你再去提交应用程序 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn-cluster \

--num-executors 3 \

--driver-memory 512m \

--executor-memory 2g \

--executor-cores 1 \

lib/spark-examples*.jar \

10

你可以看到日志里面已经不需要上传spark-assembly-1.1.0-hadoop2.2.0.jar文件了。

但是遗憾的是,如果你的 /** See if two file systems are the same or not. */

private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = {

val srcUri = srcFs.getUri()

val dstUri = destFs.getUri()

if (srcUri.getScheme() == null) {

return false

}

if (!srcUri.getScheme().equals(dstUri.getScheme())) {

return false

}

var srcHost = srcUri.getHost()

var dstHost = dstUri.getHost()

if ((srcHost != null) && (dstHost != null)) {

try {

srcHost = InetAddress.getByName(srcHost).getCanonicalHostName()

dstHost = InetAddress.getByName(dstHost).getCanonicalHostName()

} catch {

case e: UnknownHostException =>

return false

}

if (!srcHost.equals(dstHost)) {

return false

}

} else if (srcHost == null && dstHost != null) {

return false

} else if (srcHost != null && dstHost == null) {

return false

}

if (srcUri.getPort() != dstUri.getPort()) {

false

} else {

true

}

}

仔细看第15、16行代码,代码里面把我们传进来的HDFS路径中的srcHost当作是IP:port格式的。而如果你的

如果你急着用这个特性,你可以将你的compareFs函数修改成下面的实现即可 import com.google.common.base.Objects

/**

* Return whether the two file systems are the same.

*/

private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = {

val srcUri = srcFs.getUri()

val dstUri = destFs.getUri()

if (srcUri.getScheme() == null || srcUri.getScheme() != dstUri.getScheme()) {

return false

}

var srcHost = srcUri.getHost()

var dstHost = dstUri.getHost()

// In HA or when using viewfs, the host part of the URI may not actually

// be a host, but the name of the HDFS namespace.

// Those names won't resolve, so avoid even trying if they

// match.

if (srcHost != null && dstHost != null && srcHost != dstHost) {

try {

srcHost = InetAddress.getByName(srcHost).getCanonicalHostName()

dstHost = InetAddress.getByName(dstHost).getCanonicalHostName()

} catch {

case e: UnknownHostException =>

return false

}

}

Objects.equal(srcHost, dstHost) && srcUri.getPort() == dstUri.getPort()

}

然后重新编译一下Spark源码(不会编译?参见《用Maven编译Spark 1.1.0》)。

根据Cloudera 官方博客说明,如果你用的是Cloudera Manager,那么Spark assembly JAR 会自动地上传到HDFS,比如上面的hdfs://my/home/iteblog/spark_lib/目录。但是我没安装那个,如果你安装了可以试试。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值