spark2.3.0 without hive 编译

1 篇文章 0 订阅
1 篇文章 0 订阅

搭建Hive on spark环境 -- Spark 编译

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started

  • 根据以上Hive 的Wiki得知Hive on spark环境需要Spark不包含Hive相关jar包。

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:

Wiki给出了如下Spark编译命令

Since Spark 2.3.0:

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

  1. Pyarn 设置编译版本支持yarn
  2. hadoop-provided 设置编译版本不包含hadoop相关的jar包
  3. hadoop-2.7 设置编译版本兼容的hadoop大版本是2.7
  4. parquet-provided 设置编译版本不包含parquet相关的jar包
  5. orc-provided 设置编译版本不包含orc相关的jar包
  • 当前最新版本的Hive 是hive-3.1.2。根据Wiki,hive-3.1.2兼容的Spark版本是Spark-2.3.0。

Version Compatibility

Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

Hive Version

Spark Version

master2.3.0
3.0.x2.3.0
2.3.x2.0.0
2.2.x1.6.0
2.1.x1.6.0
2.0.x1.5.0
1.2.x1.3.1
1.1.x1.2.0
  • 根据spark官方文档,要编译spark需要Maven 3.3.9 or newer and Java 8+

http://spark.apache.org/docs/2.3.0/building-spark.html

Building Apache Spark

Apache Maven

The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+. Note that support for Java 7 was removed as of Spark 2.2.0.

  • 以下开始编译spark-2.3.0

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

系统环境

操作系统: CentOS Linux release 7.7.1908 (Core)

[ghl@ghlhost etc]$ cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

java:java version "1.8.0_231"

[ghl@ghlhost etc]$ java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)

maven:apache-maven-3.6.3

[ghl@ghlhost etc]$ mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /u01/apache-maven-3.6.3
Java version: 1.8.0_231, vendor: Oracle Corporation, runtime: /u01/jdk1.8.0_231/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-1062.9.1.el7.x86_64", arch: "amd64", family: "unix"

spark:spark-2.3.0

进入spark源码根目录

[ghl@ghlhost spark-2.3.0]$ ll
total 288
-rw-r--r--  1 ghl ghl   2318 Feb 23  2018 appveyor.yml
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 assembly
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 bin
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 build
drwxr-xr-x  9 ghl ghl   4096 Feb 23  2018 common
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 conf
-rw-r--r--  1 ghl ghl    995 Feb 23  2018 CONTRIBUTING.md
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 core
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 data
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 dev
drwxr-xr-x  9 ghl ghl   4096 Feb 23  2018 docs
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 examples
drwxr-xr-x 15 ghl ghl   4096 Feb 23  2018 external
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 graphx
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 hadoop-cloud
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 launcher
-rw-r--r--  1 ghl ghl  18045 Feb 23  2018 LICENSE
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 licenses
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 mllib
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 mllib-local
-rw-r--r--  1 ghl ghl  24913 Feb 23  2018 NOTICE
-rw-r--r--  1 ghl ghl 101688 Feb 23  2018 pom.xml
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 project
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 python
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 R
-rw-r--r--  1 ghl ghl   3809 Feb 23  2018 README.md
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 repl
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 resource-managers
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 sbin
-rw-r--r--  1 ghl ghl  17624 Feb 23  2018 scalastyle-config.xml
drwxr-xr-x 29 ghl ghl   4096 Feb 23  2018 spark
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 sql
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 streaming
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 tools
[ghl@ghlhost spark-2.3.0]$ pwd
/home/ghl/softwares/spark-2.3.0

设置MAVEN_OPTS

[ghl@ghlhost spark-2.3.0]$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
[ghl@ghlhost spark-2.3.0]$ echo $MAVEN_OPTS
-Xmx2g -XX:ReservedCodeCacheSize=512m

修改pom.xml

   235          <!--<url>https://repo.maven.apache.org/maven2</url>!-->
   236          <url>https://maven.aliyun.com/nexus/content/groups/public/</url>

   248          <!--<url>https://repo.maven.apache.org/maven2</url>-->
   249          <url>https://maven.aliyun.com/nexus/content/groups/public/</url>

   230	  <repositories>
   231	    <repository>
   232	      <id>central</id>
   233	      <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
   234	      <name>Maven Repository</name>
   235	      <!--<url>https://repo.maven.apache.org/maven2</url>!-->
   236	      <url>https://maven.aliyun.com/nexus/content/groups/public/</url>
   237	      <releases>
   238	        <enabled>true</enabled>
   239	      </releases>
   240	      <snapshots>
   241	        <enabled>false</enabled>
   242	      </snapshots>
   243	    </repository>
   244	  </repositories>
   245	  <pluginRepositories>
   246	    <pluginRepository>
   247	      <id>central</id>
   248	      <!--<url>https://repo.maven.apache.org/maven2</url>-->
   249	      <url>https://maven.aliyun.com/nexus/content/groups/public/</url>
   250	      <releases>
   251	        <enabled>true</enabled>
   252	      </releases>
   253	      <snapshots>
   254	        <enabled>false</enabled>
   255	      </snapshots>
   256	    </pluginRepository>
   257	  </pluginRepositories>

 

开始编译

./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7

这里指定了hadoop的版本为2.7.7 “-Dhadoop.version=2.7.7”

main:
[INFO] Executed tasks
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 2.3.0:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [02:45 min]
[INFO] Spark Project Tags ................................. SUCCESS [ 21.019 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 11.515 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 17.450 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 18.672 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  8.324 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 18.598 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 44.480 s]
[INFO] Spark Project Core ................................. SUCCESS [07:05 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 40.373 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 31.986 s]
[INFO] Spark Project Streaming ............................ SUCCESS [01:10 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:26 min]
[INFO] Spark Project SQL .................................. SUCCESS [05:16 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:22 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 11.906 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:33 min]
[INFO] Spark Project REPL ................................. SUCCESS [  9.409 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 12.235 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 37.168 s]
[INFO] Spark Project Assembly ............................. SUCCESS [  9.423 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 20.819 s]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 18.046 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 31.207 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  8.645 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  30:53 min
[INFO] Finished at: 2020-01-14T10:42:13+08:00
[INFO] ------------------------------------------------------------------------

编译成功之后会再源码目录下生成一个文件 park-2.3.0-bin-hadoop277-without-hive.tgz (其中hadoop277-without-hive 就是--name "hadoop277-without-hive"指定的名称),这个就是我们需要的不包含hive依赖jar包的spark版本。

[ghl@ghlhost spark-2.3.0]$ ll -t
total 131152
-rw-rw-r--  1 ghl ghl 133992952 Jan 14 10:42 spark-2.3.0-bin-hadoop277-without-hive.tgz
drwxrwxr-x 11 ghl ghl      4096 Jan 14 10:42 dist
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:42 assembly
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:41 examples
drwxr-xr-x  6 ghl ghl      4096 Jan 14 10:41 repl
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:39 mllib
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:26 streaming
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:25 graphx
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:24 core
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:17 mllib-local
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:16 launcher
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:14 tools
drwxrwxr-x  8 ghl ghl      4096 Jan 14 10:14 target
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:10 build
-rw-r--r--  1 ghl ghl    101845 Jan 14 09:31 pom.xml
drwxr-xr-x 29 ghl ghl      4096 Feb 23  2018 spark
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 sql
-rw-r--r--  1 ghl ghl      2318 Feb 23  2018 appveyor.yml
drwxr-xr-x  9 ghl ghl      4096 Feb 23  2018 common
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 conf
-rw-r--r--  1 ghl ghl       995 Feb 23  2018 CONTRIBUTING.md
drwxr-xr-x  5 ghl ghl      4096 Feb 23  2018 data
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 dev
drwxr-xr-x 15 ghl ghl      4096 Feb 23  2018 external
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 project
-rw-r--r--  1 ghl ghl     17624 Feb 23  2018 scalastyle-config.xml
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 bin
drwxr-xr-x  9 ghl ghl      4096 Feb 23  2018 docs
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 hadoop-cloud
-rw-r--r--  1 ghl ghl     18045 Feb 23  2018 LICENSE
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 licenses
-rw-r--r--  1 ghl ghl     24913 Feb 23  2018 NOTICE
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 python
drwxr-xr-x  3 ghl ghl      4096 Feb 23  2018 R
-rw-r--r--  1 ghl ghl      3809 Feb 23  2018 README.md
drwxr-xr-x  5 ghl ghl      4096 Feb 23  2018 resource-managers
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 sbin
[ghl@ghlhost spark-2.3.0]$

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. 首次尝试编译的时候失败了,原因是首次使用了默认的maven源 <url>https://repo.maven.apache.org/maven2</url>。

第二次修改为<url>https://maven.aliyun.com/nexus/content/groups/public/</url>后编译再半小时左右完成了
2. 编译时前面几个步骤会比较慢,需要耐心等待

[ghl@ghlhost spark-2.3.0]$ ./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/home/ghl/softwares/spark-2.3.0
+ DISTDIR=/home/ghl/softwares/spark-2.3.0/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/home/ghl/softwares/spark-2.3.0/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=hadoop277-without-hive
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ MAKE_TGZ=true
+ shift
+ ((  2  ))
+ case $1 in
+ break
+ '[' -z /u01/jdk1.8.0_231 ']'
+ '[' -z /u01/jdk1.8.0_231 ']'
++ command -v git
+ '[' ']'
++ command -v /home/ghl/softwares/spark-2.3.0/build/mvn
+ '[' '!' /home/ghl/softwares/spark-2.3.0/build/mvn ']'
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ VERSION=2.3.0
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=scala.binary.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SCALA_VERSION=2.11
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=hadoop.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SPARK_HADOOP_VERSION=2.7.7
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.activeProfiles -pl sql/hive -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ fgrep --count '<id>hive</id>'
++ echo -n

3. 编译后的spark版本已上传

https://download.csdn.net/download/ghl0451/12101209

目前还在审核。

  • 4
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值