环境
参考官网:
https://spark.apache.org/docs/3.2.0/building-spark.html
The Maven-based build is the build of reference for Apache Spark.
Building Spark using Maven requires Maven 3.6.3 and Java 8.
Spark requires Scala 2.12;
support for Scala 2.11 was removed in Spark 3.0.0.
系统:在Windows端进行编译
java:java-1.8
maven:maven-3.6.3
scala:scala-2.12.14
注:之前有用maven-3.8.3,但是一直编译失败,改成maven-3.6.3,编译通过了。
下载源码
地址:http://spark.apache.org/downloads.html
选择源码包下载 https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0.tgz
或者GitHub上clone。
Spark修改spark-shell启动LOGO
1、解压spark-3.2.0.tgz到Windows本地
2、用idea打开源码
修改源码spark-3.2.0:
package org.apache.spark.repl.SparkILoop
修改pom.xml文件
打开总pom.xml文件,将以下内容添加到module中。
位置:pom文件100行左右
<module>sql/hive-thriftserver</module>
还需要修改一些东西,原因是scope规定provided会报ClassNotFoundException
比如,如果不修改,编译完成后,如果启动SparkSQLCLIDriver则会报如下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader
...
Caused by: java.lang.ClassNotFoundException: com.google.common.cache.CacheLoader
...
修改sql\hive-thriftserver模块下的pom.xm文件:
把provided注释掉
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<!-- <scope>provided</scope>-->
</dependency>
修改主pom.xml文件
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-http</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-continuation</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlets</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-proxy</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-client</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-plus</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
将如下换成compile
<dependency>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
<version>1.4.01</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
编译源码
在Windows机器上进入spark-3.2.0目录,右键打开git bash,输入编译命令开始编译:
./dev/make-distribution.sh \
--name spark-3.2.0 \
--tgz -Phive \
-Phive-thriftserver \
-Pyarn -Phadoop-3.2 \
-Dhadoop.version=3.2.2 \
-Dscala.version=2.12.14
注:如果希望Maven编译Spark时支持Hive,需要给定-Phive -Phive-thriftserver
开始编译后如果网络不好,会一直停在这里,下载不成功就会编译失败。
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.15/scala-2.12.15.tgz
所以可以把scala-2.12.15.tgz包下载下来放到 spark-3.2.0\build\ 目录下面,并解压。
然后再进行编译,就比较顺利了。
另外一种编译方式也可:
./build/mvn -Phadoop-3.2 -Pyarn -Dhadoop.version=3.2.2 -Phive -Phive-thriftserver -DskipTests clean package
如果内存不够需要加上:
export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g"
导入IDEA
步骤如下:
然后需要设置一下Maven的路径:
然后:
1、打开总的pom.xml文件,右键,选择Add as Maven Project,
2、在pom.xml右键Maven,Reimport一次,将依赖加载,
3、整个项目rebuild一次,将导入到IDEA后不匹配的地方重新编译一次;
4、导入后,总pom.xml文件中,还有几个红色报错,直接删除掉(有些报错删除不影响,能正常运行就行):
导入完成。
连接Hive库
如果希望前面Maven编译Spark时支持Hive,需要给定-Phive -Phive-thriftserver 。
找到 sql/hive-thriftserver 模块,在main下,找到resources资源目录,拷贝集群上hive-site.xml配置文件到resources目录中:
在服务器上启需启动 metastore 服务:
hive --service metastore &
然后在IDEA中找到:SparkSQLCLIDriver,
src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
然后运行SparkSQLCLIDriver主程序:
启动后报错:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
...
原因:Spark应用程序在启动时需要指定运行模式
解决:VM option加上: -Djline.WindowsTerminal.directConsole=false -Dspark.master=local
加上后,再次运行SparkSQLCLIDriver,结果如下:
执行sql语句,结果如下:
验证spark-shell启动LOGO的修改
上面编译完了之后,在spark-3.2.0目录下有个tgz包:spark-3.2.0-bin-spark-3.2.0.tgz
将该包传到服务器上,配置好环境变量之后,启动spark-shell,结果如下:
[ruoze@hadoop001 bin]$ ./spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://hadoop001:4040
Spark context available as 'sc' (master = local[*], app id = local-1644494050341).
Spark session available as 'spark'.
Welcome to 021-上海-Jerry
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Scala version 2.12.14 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
参考文章:
https://www.cnblogs.com/chuijingjing/p/14660893.html
https://blog.csdn.net/qq_43081842/article/details/105777311
https://blog.csdn.net/u011622631/article/details/106231663
https://blog.csdn.net/qq_21355765/article/details/81743815