1.概述
2.版本信息及环境
项目 | 版本 | 备注 |
---|---|---|
os | win10 | |
jdk | 1.8 | |
scala | 2.11.12 | |
spark | 2.4.8 | |
maven | 5.8.1 | 可使用与源码一致版本 |
sbt | 1.4 | 未发现作用 |
idea | 2020.03 |
3.基础环境准备
请自行查阅资料安装如下组件:
- win10本地安装jdk .
- win10本地安装scala .
- win10本地安装maven .
4.源码准备
fork Spark源码1至个人GIT仓库,idea配置下github拉下来。太简单了就不写了。
5.IDEA设置
5.1 IDEA内maven插件设置与更新
国内源配置文件 settings.xml 请参见附录1.
5.2 IDEA内导入spark各个模块
6. 编译spark 与执行JavaWordCount 案例
6.1 指定版本编译spark
在idea terminal 下进入spark源码根目录,指定Hadoop和yarn的版本,编译:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
ps:hadoop2.7较为稳定,但我本地是hadoop2.6的环境,所以保持现状.
若成功则显示如下图2
6.2 spark-version-info.properties 文件处理
3用git bash以管理员身份运行build/spark-build-info 用以生成spark-version-info.properties文件
build/spark-build-info D:\目录路径
将生成的spark-version-info.properties文件复制到spark-core_2.11-2.4.0-SNAPSHOT.jar
这个jar包的根目录中。(复制之前先检查根目录下是否存在spark-version-info.properties,不存在再复制)
ps: 不添加会报 Could not find spark-version-info.properties 错4 ,报错详见附录2.
但我在尝试执行上述 build/spark-build-info 命令时候报错了,于是我打开了这个shell手动生成了spark-version-info.properties文件
关键shell语句如下:
echo_build_properties() {
echo version=$1 --版本号
echo user=$USER --用户名
echo revision=$(git rev-parse HEAD) --很长的版本号
echo branch=$(git rev-parse --abbrev-ref HEAD) --分支
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) --日期
echo url=$(git config --get remote.origin.url | sed 's|https://\(.*\)@\(.*\)|https://\2|')
}
在git bash 执行生成结果如下图
最后生成的 spark-version-info.properties 文件内容参见附录3.
6.3 添加jar包添加至需要的classpath
本文是以测试 JavaWordCount 程序作为源码环境成功观测点,所以是将相关jar包添加至 examples 模块。5
6.4 JavaWordCount 执行环境设置
6.5 设置计算data原始文件
ps:data原始文件 cnt.txt 内容参见附录4.
至此,spark源码调试阅读环境搭建好了!!
附录
1). maven settings.xml国内源配置
<?xml version="1.0" encoding="UTF-8"?>
<settings xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.1.0 http://maven.apache.org/xsd/settings-1.1.0.xsd" xmlns="http://maven.apache.org/SETTINGS/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!--本地仓库位置设置 -->
<localRepository>D:\mvn_res</localRepository>
<servers>
<server>
<id>AUTOHOME</id>
<username>admin</username>
<password>admin123</password>
</server>
</servers>
<mirrors>
<mirror>
<id>alimaven</id>
<mirrorOf>central</mirrorOf>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
</mirror>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>central</id>
<name>Maven Repository Switchboard</name>
<url>http://repo1.maven.org/maven2/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
<mirror>
<id>ibiblio</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://mirrors.ibiblio.org/pub/mirrors/maven2/</url>
</mirror>
<mirror>
<id>jboss-public-repository-group</id>
<mirrorOf>central</mirrorOf>
<name>JBoss Public Repository Group</name>
<url>http://repository.jboss.org/nexus/content/groups/public</url>
</mirror>
<mirror>
<id>google-maven-central</id>
<name>Google Maven Central</name>
<url>https://maven-central.storage.googleapis.com
</url>
<mirrorOf>central</mirrorOf>
</mirror>
<!-- 中央仓库在中国的镜像 -->
<mirror>
<id>maven.net.cn</id>
<name>oneof the central mirrors in china</name>
<url>http://maven.net.cn/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<!--华为云MRS开发MAVEN库 -->
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
<!--华为云MRS开发MAVEN库 -->
</mirrors>
<profiles>
<profile>
<id>rep</id>
<repositories>
<!--添加cloudera的repository-->
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
<repository>
<!--阿里云国内镜像 -->
<id>alimaven</id>
<name>Maven Aliyun Mirror</name>
<url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<!--阿里云国内镜像 -->
<!--华为云MRS开发MAVEN库 -->
<repository>
<id>huaweicloudsdk</id>
<url>https://repo.huaweicloud.com/repository/maven/huaweicloudsdk/</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>true</enabled></snapshots>
</repository>
<!--华为云MRS开发MAVEN库 -->
</repositories>
</profile>
</profiles>
<activeProfiles>
<!--华为云MRS开发MAVEN库 -->
<activeProfile>alimavenspark</activeProfile>
<!--华为云MRS开发MAVEN库 -->
</activeProfiles>
</settings>
2). Could not find spark-version-info.properties 报错详情
Connected to the target VM, address: '127.0.0.1:60929', transport: 'socket'
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.package$.<init>(package.scala:93)
at org.apache.spark.package$.<clinit>(package.scala)
at org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:183)
at org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:183)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:73)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:183)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2526)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
at org.apache.spark.examples.JavaWordCount.main(JavaWordCount.java:44)
Caused by: org.apache.spark.SparkException: Could not find spark-version-info.properties
at org.apache.spark.package$SparkBuildInfo$.<init>(package.scala:62)
at org.apache.spark.package$SparkBuildInfo$.<clinit>(package.scala)
... 13 more
Disconnected from the target VM, address: '127.0.0.1:60929', transport: 'socket'
Process finished with exit code 1
3). spark-version-info.properties 文件内容
version=2.4.8
user=root
revision=4be566062defa249435c4d72xxxxxxxxxxxxxx
branch=branch-2.4
date=2022-02-16T09:58:54Z
url=https://github.com/你的github账号/spark.git
4). data原始文件 cnt.txt 内容
a
b
c
a
c
b
a
参考
https://github.com/apache/spark spark源码官方地址 ↩︎
https://blog.csdn.net/qq_27667379/article/details/80251068 析Spark源码第一步——搭建源码阅读环境 分 ↩︎
https://blog.csdn.net/u011055139/article/details/81611814 windows10环境下搭建spark2.4.0源码阅读环境 ↩︎
https://blog.csdn.net/ggz631047367/article/details/53811213 spark2.1源码分析1:Win10下IDEA源码阅读环境的搭建 ↩︎
https://www.cnblogs.com/mracale/p/10493823.html Intellij IDEA 添加jar包的三种方式 ↩︎