官方部署文档:http://spark.apache.org/docs/latest/building-spark.html
spark的github源码下载地址:https://github.com/apache/spark
源码编译
-
修改相关配置
注释掉dev/make-distribution.sh脚本中的128行左右一下,使用固定的版本替代
VERSION=2.4.5 SCALA_VERSION=2.12.10 SPARK_HADOOP_VERSION=2.6.0-cdh5.16.2 SPARK_HIVE=1
==>替代
# VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\ # | grep -v "INFO"\ # | grep -v "WARNING"\ # | tail -n 1) # SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\ # | grep -v "INFO"\ # | grep -v "WARNING"\ # | tail -n 1) # SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\ # | grep -v "INFO"\ # | grep -v "WARNING"\ # | tail -n 1) # #SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\ # | grep -v "INFO"\ # | grep -v "WARNING"\ # | fgrep --count "<id>hive</id>";\ # # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\ # # because we use "set -o pipefail" # echo -n)
修改maven仓库地址,在253行左右,pom.xml文件
Maven Repository <!--<url>https://repo.maven.apache.org/maven2</url> --> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> 加上cdh的下载地址 <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository>
-
设置maven的执行内存:export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
-
打包:./build/mvn -DskipTests clean package
-
编译:./dev/make-distribution.sh --name 2.6.0-cdh5.16.2 --tgz -Pyarn -Phive -Phive-thriftserver -Pscala-2.12 -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.16.2
编译完后的版本:spark-2.4.5-bin-2.6.0-cdh5.16.2.tgz
IDEA导入源码
1、最好改下maven的setting.xml
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
2、scala2.12修改问题
由于spark2.4.5默认的pom.xml汇总,scala是2.11版本的,所以我们需要修改,我们在spark的源码文件夹下,通过git bash输入命令
./dev/change-scala-version.sh 2.12
3、对源码进行build
mvn clean package -Dmaven.test.skip
build过程中会不断的下载必须的jar包。build完成后,运行例子程序时,出现无法找到spark-version-info.properties 文件。这是因为没有执行生成spark-version-info.properties文件
4、需要修改core模块中的pom
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<executions>
<execution>
<phase>generate-resources</phase>
<configuration>
<!-- Execute the shell script to generate the spark build information. -->
<target>
<exec executable="bash">
<arg value="${project.basedir}/../build/spark-build-info"/>
<arg value="${project.build.directory}/extra-resources"/>
<arg value="${project.version}"/>
</exec>
</target>
</configuration>
<goals>
<goal>run</goal>
</goals>
</execution>
</executions>
</plugin>
5、运行命令生成spark-version-info.properties
admin@DESKTOP-80USIPP MINGW64 /d/ruozespace/spark-2.4.8
$ build/spark-build-info /core/target/extra-resources spark-core_2.12
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
上面的运行结果仅是个提示而已,已经生成了一个文件
会生成下面文件
测试
测试代码记得勾选:include dependencies with “Provided” scope
代码爆红但是不影响,如果看着不舒服可以找个代码拷贝一份License