Spark源码编译
当前环境与官方文档
-
编译环境与版本
环境 版本 System CentOS Linux release 7.6.1810 (Core) JDK java version “1.8.0_231” (HotSpot) Scala Scala-2.11.12 Maven Apache Maven 3.6.3 Spark spark-2.4.4 -
官方源码编译文档: http://spark.apache.org/docs/latest/building-spark.html
环境变量配置
- 建议单独弄一个用户来部署大数据组件(~/.bash_profile),但你也可以用root(/etc/profile)
# Java
export JAVA_HOME=/opt/bigdata/jdk1.8.0_231/
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
# Scala
export SCALA_HOME=/opt/bigdata/scala-2.11.12/
export PATH=$SCALA_HOME/bin:$PATH
# Maven
export MAVEN_HOME=/home/drawit/app/apache-maven-3.6.3/
export PATH=$MAVEN_HOME/bin:$PATH
Spark源码下载
- Spark GitHub 仓库
- 点击release,选择一个发布版本,例如v2.4.4,下载压缩包tar.gz
- 下载,并解压至 /home/drawit/compile/spark-2.4.4
- 官方下载地址
- 选择一个版本,例如2.4.4
- 选择包的类型为Source Code
- 点击spark-2.4.4.tgz下载,即可
使用Maven编译Spark
- 添加阿里云镜像 settings.xml (推荐,速度快很多)
<mirror> <id>alimaven</id> <mirrorOf>central</mirrorOf> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </mirror>
- 添加Maven环境变量
export MAVEN_OPTS="-Xmx2G -XX:ReservedCodeCacheSize=512M"
- 执行package命令
./build/mvn -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pflume -DskipTests clean package
- 可以直接使用自己的mvn,但建议使用源码下
./build/mvn
构建(原因请看注意事项)
- 但为了像Spark官方一样,打成分布式包,需要使用
./dev/make-distribution.sh
构建分布式运行包
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pflume
- 参数解释
- --name 指定生成的包的名字
- --pip 添加Python pip的构建
- --r 添加R的构建(请查看注意事项)
- --tgz 生成.tar.gz压缩包
- --mvn 指定mvn命令的位置(建议不指定,默认使用源码中./build/mvn,原因请看注意事项)
- -P 是Maven的参数,指定pom.xml中的profile
- 完成后,目录下会生成 spark-2.4.4-bin-custom-spark.tgz
- 可以拿去部署咯!^_^
注意事项
- 不是必须得先手动执行mvn打包,你也可以直接使用
make-distribution.sh
,会自动帮你执行mvn打包 - 源码中自带
./build/mvn
,默认使用该mvn,你也可以用--mvn指定自己的Maven- ./build/mvn 会查询环境变量是否有MAVEN_HOME,有的话会使用你自己的mvn(你仍然可以使用自己的settings.xml)
- ./build/mvn 还会自动去下载Zinc到./build/,并使用zinc server,编译速度更快
- Zinc相关的日志
- 未使用 [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
- 使用后 [INFO] Using zinc server for incremental compilation
- hadoop不指定版本的话,默认2.6
- hive不指定版本的话,默认1.2.1
- kafka会自动构建0.10版,需要0.8版本需添加-Pkafka-0-8
- 如果要添加SparkR的构建,需要先安装R语言
sudo yum install epel-release
sudo yum install R
sudo Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 'survival'), repos='https://mirrors.tuna.tsinghua.edu.cn/CRAN/')"
- 附: R的包的安装方式-示例
- R 进入R命令行
- 设置镜像
- 全局配置
options(repos=structure(c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")))
- 为某个包单独指定镜像
install.packages("stringi", repos="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
- 全局配置
- 安装示例
install.packages("knitr")
install.packages("stringi")
install.packages("e1071")
install.packages("rmarkdown")
install.packages("testthat")
- 想修改scala版本?
./dev/change-scala-version.sh 2.12
./build/mvn -Pscala-2.12 compile
- 如果你使用的虚拟机,会存在时间不一致的问题,解决办法:VMware -> 对应的虚拟机右键 -> 设置 -> 选项 -> VMware Tools 时间同步 -> 勾选’将客户机时间与主机同步’
- 不行的话,重新安装VMware Tools
- 想用Windows编译?
- 安装git,然后在Git Bash中调用mvn编译
- 想在IDEA中调试?
- 项目结构 -> Moudles -> spark-core_2.11 -> guava的scope改为compile
- Master: 启动
org.apache.spark.deploy.master.Master
- Worker: 启动
org.apache.spark.deploy.worker.Worker
- worker的main传参
--webui-port 8081 spark://192.168.0.101:7077 --cores 2 --memory 2G
- ip不要传成localhost! -_-!
- worker的main传参
附录
- Maven构建成功的信息
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 2.4.4: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 1.219 s] [INFO] Spark Project Tags ................................. SUCCESS [ 2.499 s] [INFO] Spark Project Sketch ............................... SUCCESS [ 3.600 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 1.534 s] [INFO] Spark Project Networking ........................... SUCCESS [ 3.028 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.543 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 4.439 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 2.090 s] [INFO] Spark Project Core ................................. SUCCESS [01:26 min] [INFO] Spark Project ML Local Library ..................... SUCCESS [ 27.461 s] [INFO] Spark Project GraphX ............................... SUCCESS [ 18.232 s] [INFO] Spark Project Streaming ............................ SUCCESS [ 32.276 s] [INFO] Spark Project Catalyst ............................. SUCCESS [01:17 min] [INFO] Spark Project SQL .................................. SUCCESS [02:08 min] [INFO] Spark Project ML Library ........................... SUCCESS [01:22 min] [INFO] Spark Project Tools ................................ SUCCESS [ 6.143 s] [INFO] Spark Project Hive ................................. SUCCESS [ 45.921 s] [INFO] Spark Project REPL ................................. SUCCESS [ 10.522 s] [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 3.728 s] [INFO] Spark Project YARN ................................. SUCCESS [ 32.081 s] [INFO] Spark Project Mesos ................................ SUCCESS [ 24.496 s] [INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 29.383 s] [INFO] Spark Project Assembly ............................. SUCCESS [ 1.430 s] [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 16.237 s] [INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 19.414 s] [INFO] Spark Project Examples ............................. SUCCESS [ 16.065 s] [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 2.265 s] [INFO] Spark Avro ......................................... SUCCESS [ 11.898 s] [INFO] Spark Project External Flume Sink .................. SUCCESS [ 3.773 s] [INFO] Spark Project External Flume ....................... SUCCESS [ 6.348 s] [INFO] Spark Project External Flume Assembly .............. SUCCESS [ 1.910 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------
- make-distribution.sh构建完成的信息
+ '[' true == true ']' + TARDIR_NAME=spark-2.4.4-bin-skey-spark + TARDIR=/home/drawit/compile/spark-2.4.4-mvn/spark-2.4.4-bin-skey-spark + rm -rf /home/drawit/compile/spark-2.4.4-mvn/spark-2.4.4-bin-skey-spark + cp -r /home/drawit/compile/spark-2.4.4-mvn/dist /home/drawit/compile/spark-2.4.4-mvn/spark-2.4.4-bin-skey-spark + tar czf spark-2.4.4-bin-skey-spark.tgz -C /home/drawit/compile/spark-2.4.4-mvn spark-2.4.4-bin-skey-spark + rm -rf /home/drawit/compile/spark-2.4.4-mvn/spark-2.4.4-bin-skey-spark [drawit@skey01 spark-2.4.4-mvn]$ ls appveyor.yml core graphx mllib R sql assembly data hadoop-cloud mllib-local README.md streaming bin dev launcher NOTICE repl target build dist LICENSE NOTICE-binary resource-managers tools common docs LICENSE-binary pom.xml sbin conf examples licenses project scalastyle-config.xml CONTRIBUTING.md external licenses-binary python spark-2.4.4-bin-skey-spark.tgz