Spark依赖包冲突解决

最新推荐文章于 2022-08-05 17:36:18 发布

dengqian2095

最新推荐文章于 2022-08-05 17:36:18 发布

阅读量1.6k

点赞数

文章标签： java 大数据

原文链接：http://www.cnblogs.com/robynn/p/8444077.html

版权

背景：
公司选用Apache Beam 用于大数据程序开发; Apache Beam 提供了一系列通用的JAVA API, 通用是指使用Apache Beam开发的程序，
可以在不对代码做任何修改的情况下运行在当下流行的计算框架上，如SPARK, FLINK...

Beam的程序运行在Spark, 需要依赖Spark,Hadoop 甚至kafka的一些JAR包。

如果Beam的程序打成一个fat包，在Spark上运行不会遇到有问题：
spark-submit --master yarn-cluster --name RealTimeAPP --class com.data.analytics.app.RealTimeAPP RealTimeAPP-1.0.0-SNAPSHOT.jar --runner=SparkRunner

问题来了，这样的fat包太大，有200多M，老大们希望把小的核心程序和大的依赖包分离开了，这样较大的依赖包只要上传一次服务器就够了，核心程序有更新的时候只要在服务器上更新
核心程序就可以了（当然在核心程序的依赖包没有发生变化的时候可以这么做）。
于是spark-sumbit变成了这样：

spark-submit --master yarn-cluster \
    --name ${CLASS_NAME} \
    --class ${PACKAGE_NAME}.${CLASS_NAME} \
    --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
    --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
    --conf spark.yarn.maxAppAttempts=${SPARK_YARN_MAXAPPATTEMPTS} \
    --conf spark.yarn.am.attemptFailuresValidityInterval=${SPARK_YARN_AM_ATTEMPTFAILURESVALIDITYINTERVAL} \
    --conf spark.yarn.max.executor.failures=${SPARK_YARN_MAX_EXECUTOR_FAILURES} \
    --conf spark.yarn.executor.failuresValidityInterval=${SPARK_YARN_EXECUTOR_FAILURESVALIDITYINTERVAL} \
    --conf spark.streaming.receiver.writeAheadLog.enable=${SPARK_STREAMING_RECEIVER_WRITEAHEADLOG_ENABLE} \
    --files ${CONF_DIR}/metrics.properties,${CONF_DIR}/log4j.properties,${CONF_DIR}/conf.properties \
    --driver-memory ${DRIVER_MEMORY} \
    --executor-memory ${EXECUTOR_MEMORY} \
    --num-executors ${NUM_EXECUTORS} \
    --jars ${JARS} \         # 这里是依赖包的路径，我们把依赖包放到了HDFS上面；如果有多个依赖包，用逗号分割，注意逗号前后不要出现空格
    ${JAR_DIR}/${JAR_FILE} \
    --runner=SparkRunner \
    --batchIntervalMillis=60000



好吧，铺垫有点长，正题来了：

后来有了个新项目，需要用到Beam SQL. 问题来了：
Beam SQL 在SPARK LOCAL模式可以成功运行，到yarn-cluster或者yarn-client就出错。
原因竟然是一系列的JAR包冲突

解决:

1. 在spark-submit时设置参数--conf spark.driver.userClassPathFirst=true

spark.driver.userClassPathFirst=true
设置这个参数后JAVA类的加载次序：用户CLASS PATH -> SPARK CLASS PATH -> System CLASS PATH

曾经也尝试过： --driver-class-path ****.jar, 但是这个参数只会在指定的JAR 包去加载类，如果JAR包中没有那个类就会报错退出

2. 在pom文件中把一大批java 类排除掉，因为有些类必须是要用到spark环境提供的，否则无法运行：

                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/LICENSE</exclude>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                    <exclude>**/*.java</exclude>
                    <exclude>org/apache/hadoop/conf/**/*.class</exclude>
                    <exclude>org/apache/hadoop/fs/**/*.class</exclude>
                    <exclude>org/apache/hadoop/io/**/*.class</exclude>
                    <exclude>org/apache/hadoop/security/**/*.class</exclude>
                    <exclude>org/apache/hadoop/ipc/**/*.class</exclude>
                    <exclude>org/apache/log4j/**</exclude>
                    <exclude>org/slf4j/**</exclude>
                    <exclude>log4j.properties</exclude>
                    <exclude>com/codahale/**</exclude>
                    <exclude>scala/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/records/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/yarn/api/protocolrecords/impl/pb/**</exclude>
                    <exclude>org/apache/hadoop/net/**</exclude>
                    <exclude>org/apache/hadoop/hdfs/protocol/proto/**</exclude>
                    <exclude>com/google/protobuf/**</exclude>
                    <exclude>org/apache/spark/**</exclude>
                    <exclude>akka/**</exclude>
                    <exclude>org/apache/hadoop/util/**</exclude>
                  </excludes>
                </filter>



spark local:
spark-submit \
--master local \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true \
--class com.data.analytics.app.RealTimeAPP2 \
--jars RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar \
--runner=SparkRunner

spark yarn-cluster:
spark-submit \
--master yarn-cluster \
--name RealTimeAPP2 \
--conf spark.driver.userClassPathFirst=true \
--class com.data.analytics.app.RealTimeAPP2 \
--jars hdfs://testserver:8020/user/spark/lib/RealTimeAPP2/RealTimeAPP2-dependencies-1.0.0-SNAPSHOT.jar \
RealTimeAPP2-1.0.0-SNAPSHOT.jar \
--runner=SparkRunner



Apache Beam: 2.2.0
Apache Spark: 1.6.3