Spark-Adaptive编译和打包

最新推荐文章于 2024-01-02 16:30:00 发布

敏叔V587

最新推荐文章于 2024-01-02 16:30:00 发布

阅读量639

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/zhuxuemin1991/article/details/105025283

版权

大数据专栏收录该内容

42 篇文章 9 订阅

订阅专栏

文章目录

前言
源码下载
源码编译

前言

spark-adaptive平时也叫做Spark自适应，这个问题也是因为Spark3之前的版本还是比较笨的一些设置并行task数量的不合理的操作，造成资源浪费，最近是为了需要调研尝试，分享自己的一些过程。关于自适应的介绍，推荐一下前辈们的文章：
Spark SQL在100TB上的自适应执行实践
 Spark Adaptive Execution调研

源码下载

操作第一步，当然就是下载源码，github上面地址：源代码地址
因为那个代码下载起来实在太慢了，我自己fork了一份在gitee上，方便我各种下载操作：
fork的地址

源码编译

修改仓库地址

也是国外网络的慢问题，我把我maven 上的setting.xml镜像改成了阿里的源：

<mirror>
      <id>alimaven</id>
      <name>aliyun maven</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>        
    </mirror>

调整JVM参数

这个是因为maven也是一个java程序，spark的源码还是比较庞大的，之前发现原来都fullgc了，需要调整一下jvm参数，我自己机器本身内存也比较大，调整了一下，这个控制台执行也行，环境变量配置也行，参数如下：

export MAVEN_OPTS="-Xms12g -Xmx12g -XX:+UseG1GC"

编译过程

spark-adaptive的项目工程和spark其实是一样的，所以构建包也是一样的，命令如下：

./dev/make-distribution.sh --name spark-ae-2.3 --pip --r --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver  -Pyarn -DskipTests

这里想说明的是，后面的参数需要根据我们的需求来定义的，spark的项目可以用maven和sbt来构建，我们这种写法是maven的做法，在源码下面有个pom.xml文件，我们找到profile的字样：

......
  <profile>
      <id>hadoop-2.6</id>
      <!-- Default hadoop profile. Uses global properties. -->
    </profile>
    <profile>
      <id>hadoop-2.7</id>
      <properties>
        <hadoop.version>2.7.3</hadoop.version>
        <curator.version>2.7.1</curator.version>
      </properties>
    </profile>
    <profile>
      <id>yarn</id>
      <modules>
        <module>resource-managers/yarn</module>
        <module>common/network-yarn</module>
      </modules>
    </profile>
    ......

这种结构在我们maven工程中是表示不同的环境给不同的配置，比如指定-Pyarn的时候，就会把yarn的模块加进去，–pip其实是把python的依赖加进去，自然来说，我们是需要python环境的，r语言的依赖也是如此。大部分情况直接复制粘贴一段打包代码，实际报错很多也是这个原因，一方面可以去解决依赖的问题，另一方面，其实可以试着绕过我这里的策略先不着急安装r和python相关的模块，所以最后的命令是：

./dev/make-distribution.sh --name spark-ae-2.3 --tgz  -Phadoop-2.6 -Phive -Phive-thriftserver  -Pyarn -DskipTests

编译的时候其实会输出很多中间过程信息，如果有错误的话需要试着去看日志

......
+ cp -r /home/hdfs/spark-adaptive/examples/src/main /home/hdfs/spark-adaptive/dist/examples/src/
+ cp /home/hdfs/spark-adaptive/LICENSE /home/hdfs/spark-adaptive/dist
+ cp -r /home/hdfs/spark-adaptive/licenses /home/hdfs/spark-adaptive/dist
+ cp /home/hdfs/spark-adaptive/NOTICE /home/hdfs/spark-adaptive/dist
+ '[' -e /home/hdfs/spark-adaptive/CHANGES.txt ']'
+ cp -r /home/hdfs/spark-adaptive/data /home/hdfs/spark-adaptive/dist
+ '[' false == true ']'
+ echo 'Skipping building python distribution package'
Skipping building python distribution package
+ '[' false == true ']'
+ echo 'Skipping building R source package'
Skipping building R source package
+ mkdir /home/hdfs/spark-adaptive/dist/conf
+ cp /home/hdfs/spark-adaptive/conf/docker.properties.template /home/hdfs/spark-adaptive/conf/fairscheduler.xml.template /home/hdfs/spark-adaptive/conf/log4j.properties.template /home/hdfs/spark-adaptive/conf/metrics.properties.template /home/hdfs/spark-adaptive/conf/slaves.template /home/hdfs/spark-adaptive/conf/spark-defaults.conf.template /home/hdfs/spark-adaptive/conf/spark-env.sh.template /home/hdfs/spark-adaptive/dist/conf
+ cp /home/hdfs/spark-adaptive/README.md /home/hdfs/spark-adaptive/dist
+ cp -r /home/hdfs/spark-adaptive/bin /home/hdfs/spark-adaptive/dist
+ cp -r /home/hdfs/spark-adaptive/python /home/hdfs/spark-adaptive/dist
+ '[' false == true ']'
+ cp -r /home/hdfs/spark-adaptive/sbin /home/hdfs/spark-adaptive/dist
+ '[' -d /home/hdfs/spark-adaptive/R/lib/SparkR ']'
+ '[' true == true ']'
+ TARDIR_NAME=spark-2.3.2-bin-spark-ae-2.3
+ TARDIR=/home/hdfs/spark-adaptive/spark-2.3.2-bin-spark-ae-2.3
+ rm -rf /home/hdfs/spark-adaptive/spark-2.3.2-bin-spark-ae-2.3
+ cp -r /home/hdfs/spark-adaptive/dist /home/hdfs/spark-adaptive/spark-2.3.2-bin-spark-ae-2.3
**+ tar czf spark-2.3.2-bin-spark-ae-2.3.tgz -C /home/hdfs/spark-adaptive spark-2.3.2-bin-spark-ae-2.3**
+ rm -rf /home/hdfs/spark-adaptive/spark-2.3.2-bin-spark-ae-2.3

我们看到最后的部分输出：
tar czf spark-2.3.2-bin-spark-ae-2.3.tgz -C /home/hdfs/spark-adaptive spark-2.3.2-bin-spark-ae-2.3，说明编译好了，然后帮我们做好了打包工作；

新鲜热乎的包就出炉了！！！

在这里插入图片描述

错误分析

我的编译其实是很多把之后才比较顺利，因为出现问题之后我可以去解决了，我把比较有代表性的错误贴出来

插件和仓库配置问题

09:39:12 [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-tags_2.12: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed: A required class was missing while executing net.alchim31.maven:scala-maven-plugin:3.2.2:compile: xsbt/Analyzer$

出现错误的时候我们第一件事就是需要淡定，maven编译的错误比较隐晦，这个错误表面是类依赖找不到的问题，实际上就是插件依赖有问题。
根本原因就是，spark是scala和java混合方式的打包的，scala的代码需要专门的插件去做这个工作，一开始并不会在你仓库里面有，没有就需要去下载，我们找到maven配置仓库下载插件地址的地方，这个其实是一个中央仓库：

  <pluginRepositories>
      <id>central</id>
      <url>https://repo.maven.apache.org/maven2</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </pluginRepository>
  </pluginRepositories>

因为在国内来说，我们的网络情况其实不大稳定的，有些插件下载不过来，当出现类似插件问题的时候，我们首先需要保证插件可以被正确下载到。解决办法也比较直接，改成阿里仓库或者公司内部仓库就可以了。在setting.xml或者pom.xml里面加都是可以的。

<pluginRepositories>
  	<pluginRepository>
              <id>alimaven</id>
              <name>central</name>
              <url>http://maven.aliyun.com/nexus/content/repositories/central</url>         
          </pluginRepository>
  </pluginRepositories>

scala版本问题

再来一个，类似类找不到的问题

10:11:38 [error] /home/jenkins/workspace/spark-adaptive/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:96: exception during macro expansion: 
10:11:38 [error] java.lang.NoClassDefFoundError: scala/runtime/LazyRef
10:11:38 [error] 	at org.scalactic.MacroOwnerRepair$Utils.repairOwners(MacroOwnerRepair.scala:66)
10:11:38 [error] 	at org.scalactic.MacroOwnerRepair.repairOwners(MacroOwnerRepair.scala:46)
10:11:38 [error] 	at org.scalactic.BooleanMacro.genMacro(BooleanMacro.scala:837)
10:11:38 [error] 	at org.scalatest.AssertionsMacro$.assert(AssertionsMacro.scala:34)
10:11:38 [error]         assert(toUTF8(s).contains(toUTF8(substring)) === s.contains(substring))
10:11:38 [error]

这个也要去分析，scala/runtime/LazyRef这种runtime基础的类其实是自带的类库，自带的类库都找不到了，我们这个时候要怀疑scala版本的问题，spark编译的时候其实也知道这个事情，所以做了个切换scala版本的脚本，我们需要在执行编译之前再执行下面这个脚本，做一个切换动作。

./dev/change-scala-version.sh 2.11

当然，我们需要知道为啥找不到，这个其实是因为我们用scala2.12的时候，需要指定maven库中的依赖scala版本也要变化，我们找到pom.xml 2723行左右，我们找到scala版本的定义地方：

<!-- Exists for backwards compatibility; profile doesn't do anything -->
    <profile>
      <id>scala-2.11</id>
    </profile>
    <profile>
      <id>scala-2.12</id>
      <properties>
        <scala.version>2.12.4</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
      </properties>
      ......
    </profile>

我们的maven是可以通过-p参数来切换版本的，所以我们主动加上-Pscala-2.12新的版本才会生效，大致就是说你要切版本，全部都要一套下来才行！！一条龙走下去就会顺当很多了。

外部依赖的问题

其他的情况比较多的就是外部依赖了，我们的-pip -r这类参数，其实就是需要环境里面有相应的命令，编译的时候maven会像糖葫芦一样把各个命令串起来，到了pyhon的时候会去用pip拉取一些python相关的包，r也是如此，那么这个时候需要自己先按照包了，比如python的情况提前加一个这个：

sudo pip install wheel

r的话需要离线安装r的包，整套就会出来的！
切记冷静分析，百度上的信息不是很好找，需要找到根本原因再去解决，才是上上策~~

敏叔V587

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Spark-Adaptive编译和打包

文章目录前言源码下载源码编译修改仓库地址调整jvm参数编译过程新鲜热乎的包就出炉了！！！前言spark-adaptive平时也叫做Spark自适应，这个问题也是因为Spark3之前的版本还是比较笨的一些设置并行task数量的不合理的操作，造成资源浪费，最近是为了需要调研尝试，分享自己的一些过程。关于自适应的介绍，推荐一下前辈们的文章：Spark SQL在100TB上的自适应执行实践Spar...
复制链接

扫一扫