题记:在有spark一定使用经验后再阅读源码,更易理解spark阅读,带有目的的阅读源码事半功倍。例如,想知道spark是如何完成spark-submit任务的,如何进行配置文件加载。
源码环境
工具 | 环境 |
---|---|
OS | mac |
java | 1.8 |
scala | 2.11 |
maven | 3.5 |
hadoop | 2.7.3 |
spark | 2.3.2 |
源码下载
本次阅读选择spark2.3.2版本。
源码链接:
https://archive.apache.org/dist/spark/spark-2.3.2/
源码编译
java环境、scala环境、maven环境、hadoop配置自行配置。
本人机环境变量如下:
SCALA_HOME=/Library/scala/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
M2_HOME=/Library/maven/apache-maven-3.5.0
HADOOP_HOME=/Users/lofty/source_code_study/hadoop-2.7.3/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$M2_HOME/bin
SPARK_HOME=/Users/lofty/source_code_study/spark-2.3.2
SPARK_CONF_DIR=/Users/lofty/source_code_study/spark-2.3.2/conf
export PATH=$PATH:$SPARK_HOME/bin
HADOOP_CONF_DIR=/Users/lofty/source_code_study/hadoop-2.7.3/etc/hadoop
export HADOOP_CONF_DIR
下载完成后解压, tar -zxvf spark-2.3.2
进入spark根目录,执行
mvn -T 4 -Pyarn -Phadoop-2.7.3 -Dhadoop.version=2.7.3 -DskipTests clean package
测试一下,在bin目录下执行./spark-shell
hello spark-submit
333333333333
/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/bin/java -cp /Users/lofty/source_code_study/spark-2.3.2/conf/:/Users/lofty/source_code_study/spark-2.3.2/assembly/target/scala-2.11/jars/*:/Users/lofty/source_code_study/hadoop-2.7.3/etc/hadoop/ -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell
333333333333
20/03/23 21:38:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.103:4040
Spark context available as 'sc' (master = local[*], app id = local-1584970741868).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
OK,编译成功
先将源码导入idea
加载完成目录结构如下:
先移除Libraries下所有jar包,再将编译好的jar包添加到Libraries下,
编译好的jar包路径为
$SPARK_HOME/assembly/target/scala-2.11/jars/
另外几个jetty包也需添加。
添加完成后idea中设置启动时不build项目
此方法可避免pom.xml中依赖包时大量<scope>provided</scope>
导致spark源码debug调试时各种缺包。
执行spark-examples中SparkPi,设置启动参数(关闭启动前build项目,节约大量build时间)
-Dspark.master=loca
执行
ok,成功
下一节开始spark程序入口spark-submit开始走读spark源码