本地hadoop环境提交job时经常碰到一大堆依赖需要引入的问题,又不想全部把依赖打包到代码里发布,比较好的选择则是 spark-submit --jar 选项。jar比较少的时候手动复制黏贴方式比较方便,jar依赖比较多时就很麻烦。于是想弄个脚本一步搞定,就有了下面的代码:
1. build.gradle里添加任务 sparkJars
/** * write spark --jar files into a file to use when spark-submitting. */ task sparkJars () { def jarsFile = new File(System.getProperty("user.home"), "sparkJars.txt") System.out.println("sparkJarsFile: " + jarsFile.absolutePath) if (jarsFile.exists()){ jarsFile.delete() } def printWriter = jarsFile.newPrintWriter() configurations.default.collect { // skip lombok if (it.isFile() && ! it.name.contains("lombok")) { printWriter.println(it.absolutePath) } } printWriter.close() }
把本地文件的绝对路径写到/home/user/sparkJars.txt文件中。
2. 编写脚本submit-job.sh
#!/bin/sh # write dependency jar path into sparkJar file `../gradlew sparkJars` CLASS_PATH="" # create spark class path for jar in `cat ~/sparkJars.txt`; do CLASS_PATH="$CLASS_PATH","$jar"; done MAIN_CLASS="com.my.detection.OrderDetectionStreaming" JAVA_CMD="spark-submit --master yarn --deploy-mode cluster --class $MAIN_CLASS --jars $CLASS_PATH --conf 'spark.driver.extraJavaOptions=-Dspring.profiles.active=develop' --conf 'spark.executor.extraJavaOptions=-Dspring.profiles.active=develop' build/libs/rule-engine-0.0.1-SNAPSHOT.jar" echo $JAVA_CMD eval $JAVA_CMD
这里的脚本就是把sparkJars.txt里的内容用个逗号拼接起来,然后在spark-submit --jar 方式提交到spark driver 上。这样依赖就可以冲driver上分发下去,直接运行程序了。