在Spark项目中经常涉及到外部依赖包的部署问题,比较简便的方式是将项目编译的class和依赖包打到一个jar包中,方便上传部署,scala项目使用sbt-assembly来将工程class和依赖打到一个jar包中,类似maven的assembly。参考sbt-assembly项目地址:
https://github.com/sbt/sbt-assembly
安装sbt-assembly
在plugins.sbt(项目根目录/project/plugins.sbt 此乃项目中的配置,也可以在全局配置里设置以下地址C:\Users\userName\.sbt\0.1X\plugins)文件中添加以下配置,用于安装插件,并指定依赖下载地址:
addSbtPlugin(“com.eed3si9n” % “sbt-assembly” % “0.14.4”)
resolvers += Resolver.url(“bintray-sbt-plugins”, url(“http://dl.bintray.com/sbt/sbt-plugin-releases”))(Resolver.ivyStylePatterns);
sbt-assembly版本的选择
sbt的版本是sbt 0.13.6+,选择0.14.4
sbt 0.13.x选择0.11.2
sbt 0.12选择0.9.2
如何查看sbt版本:
在命令行输入sbt进入sbt的命令模式,执行命令:
sbtVersion
排除jar包
sbt-assembly是根据项目配置的libraryDependencies依赖进行打包的,不需要打包的依赖可以设置”provided”进行排除
[build.sbt]
libraryDependencies += “org.apache.spark” % “spark-core_2.11” % “2.1.0” % “provided”
排除scala库的jar包
在项目根目录下创建assembly.sbt文件并添加以下配置(注:sbt-assembly相关的配置,可以配置在项目根目录/build.sbt中,也可以在项目根目录下的assembly.sbt文件中):
[assembly.sbt]
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
明确排除某一指定jar包
[assembly.sbt]
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter {_.data.getName == “compile-0.1.0.jar”}
}
使用sbt进行编译打包:
sbt clean compile assembly
Jar包冲突
当遇到jar包冲突时,需要设置assemblyMergeStrategy
例如有冲突报错信息如下:
[error] 1 error was encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
/Users/xueyintao/.ivy2/cache/org.apache.spark/spark-streaming-kafka-0-10_2.11/jars/spark-streaming-kafka-0-10_2.11-2.1.0.jar:org/apache/spark/unused/UnusedStubClass.class
/Users/xueyintao/.ivy2/cache/org.apache.spark/spark-tags_2.11/jars/spark-tags_2.11-2.1.0.jar:org/apache/spark/unused/UnusedStubClass.class
/Users/xueyintao/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
at sbtassembly.Assembly$.applyStrategies(Assembly.scala:140)
报错信息表明在3个jar包中出现了相同声明的类:org/apache/spark/unused/UnusedStubClass.class
添加assemblyMergeStrategy配置来解决:
[assembly.sbt]
assemblyMergeStrategy in assembly := {
case PathList(ps @ _*) if ps.last endsWith “UnusedStubClass.class” => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
配置是当遇到类路径以UnusedStubClass.class结尾时,只保留第一个,更多配置参数查看
官方文档
根据项目具体情况进行配置。
其他使用场景
将依赖和项目的class分开打成两个包
将所有依赖包打到一个包中,而不包含自己项目的class:
sbt assemblyPackageDependency
会生成以-deps.jar结尾的jar包
将自己的项目打成一个包:
[assembly.sbt]
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false, includeDependency = false)
使用命令sbt assembly打包
If multiple files share the same relative path (e.g. a resource named application.conf in multiple dependency JARs), the default strategy is to verify that all candidates have the same contents and error out otherwise. This behavior can be configured on a per-path basis using either one of the following built-in strategies or writing a custom one:
-
MergeStrategy.deduplicate
is the default described above -
MergeStrategy.first
picks the first of the matching files in classpath order -
MergeStrategy.last
picks the last one -
MergeStrategy.singleOrError
bails out with an error message on conflict -
MergeStrategy.concat
simply concatenates all matching files and includes the result -
MergeStrategy.filterDistinctLines
also concatenates, but leaves out duplicates along the way -
MergeStrategy.rename
renames the files originating from jar files -
MergeStrategy.discard
simply discards matching files
example for myself:
- sbt 0.13.8
- scala 2.11.6
-
assembly 0.13.0
mergeStrategy in assembly := { case x if x.startsWith("META-INF") => MergeStrategy.discard // Bumf case x if x.endsWith(".html") => MergeStrategy.discard // More bumf case x if x.contains("slf4j-api") => MergeStrategy.last case x if x.contains("org/cyberneko/html") => MergeStrategy.first case PathList("com", "esotericsoftware", xs@_ *) => MergeStrategy.last // For Log$Logger.class case x => val oldStrategy = (mergeStrategy in assembly).value oldStrategy(x) }