文章目录
注意:因为我本地网络速度慢,不稳定,所以在阿里云上租了一台服务器编译的,编译完成后下载到了本地机器上。
最后配置Spark客户端,是在我虚拟机上进行的。
1、编译环境
参考官网上的步骤
打开older versions and other resources,找到对应的版本,我这里是2.4.2
首先创建以下目录
[root@hadoop001 ~]# ll
total 24
drwxr-xr-x 5 root root 4096 Jul 30 15:47 app
drwxr-xr-x 2 root root 4096 Jul 30 15:42 data
drwxr-xr-x 2 root root 4096 Jul 30 15:42 lib
drwxr-xr-x 7 root root 4096 Jul 30 16:27 maven_repo
drwxr-xr-x 2 root root 4096 Jul 30 15:45 soft
drwxr-xr-x 3 root root 4096 Jul 30 15:47 source
安装包
[root@hadoop001 ~]# cd soft/
[root@hadoop001 soft]# ll
total 213916
-rw-r--r-- 1 root root 9136463 Jun 20 19:08 apache-maven-3.6.1-bin.tar.gz
-rw-r--r-- 1 root root 173271626 Jul 10 19:41 jdk-8u45-linux-x64.gz
-rw-r--r-- 1 root root 20467943 Jun 20 22:19 scala-2.12.8.tgz
-rw-r--r-- 1 root root 16165557 Jul 30 07:32 spark-2.4.2.tgz
spark下载地址
https://archive.apache.org/dist/spark/spark-2.4.2/
以下Scala和Maven版本也可以,下载地址
https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz
2、解压并配置环境变量
[root@hadoop001 soft]# yum install git #安装git
[root@hadoop001 soft]# tar -zxvf apache-maven-3.6.1-bin.tar.gz -C ../app
[root@hadoop001 soft]# tar -zxvf jdk-8u45-linux-x64.gz -C ../app
[root@hadoop001 soft]# tar -zxvf scala-2.12.8.tgz -C ../app
[root@hadoop001 soft]# tar -zxvf spark-2.4.2.tgz -C ../source/
配置环境变量
[root@hadoop001 soft]# cat /etc/profile
#追加以下配置
export JAVA_HOME=/root/app/jdk1.8.0_45
export PATH=$JAVA_HOME/bin:$PATH
export MAVEN_HOME=/root/app/apache-maven-3.6.1
export PATH=$MAVEN_HOME/bin:$PATH
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=2048m" #这个是内存大小,加快编译速度
export SCALA_HOME=/root/app/scala-2.12.8
export PATH=$SCALA_HOME/bin:$PATH
[root@hadoop001 soft]# source /etc/profile
3、配置mvn仓库地址
[root@hadoop001 conf]# pwd
/root/app/apache-maven-3.6.1/conf
[root@hadoop001 conf]# vi settings.xml
<!-- localRepository
| The path to the local repository maven will use to store artifacts.
|
| Default: ${user.home}/.m2/repository
<localRepository>/path/to/local/repo</localRepository>
-->
#在这个位置添加
<localRepository>/root/maven_repo</localRepository>
4、配置Spark源码下pom.xml文件
阿里云主机的配置
[root@hadoop001 spark-2.4.2]# pwd
/root/source/spark-2.4.2
[root@hadoop001 spark-2.4.2]# vi pom.xml
<repositories>
#注释这段,记得调整下面这句话的位置,和注释符号
<!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution
<repository>
<id>central</id>
<name>Maven Repository</name>
<url>https://repo.maven.apache.org/maven2</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
-->
#添加这段配置
<repository>
<id>maven-ali</id>
<url>http://maven.aliyun.com/nexus/content/groups/public//</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
<updatePolicy>always</updatePolicy>
<checksumPolicy>fail</checksumPolicy>
</snapshots>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
如果是本地虚拟机
<repository>
<id>cloudera</id>
<name>cloudera repository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
5、配置Spark源码下make-distribution.sh
文件
[root@hadoop001 dev]# pwd
/root/source/spark-2.4.2/dev
[root@hadoop001 dev]# vi make-distribution.sh
#注释下面这段
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)
#添加以下配置
VERSION=2.4.2 #spark版本
SCALA_VERSION=2.12 #scala版本
SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0 #hadoop版本
SPARK_HIVE=1 #支持hive
6、编译Spark
[root@hadoop001 spark-2.4.2]# pwd
/root/source/spark-2.4.2
[root@hadoop001 spark-2.4.2]#./dev/make-distribution.sh \
--name cdh5.16.1 \
--tgz \
-Dhadoop.version=2.6.0-cdh5.16.1 \
-Phadoop-2.6 \
-Phive \
-Phive-thriftserver \
-Pyarn
然后开始漫长的等待~~
7、验证
编译完成后,会生成spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz安装包,表示成功
[root@hadoop001 spark-2.4.2]# ll
...
-rw-r--r-- 1 root root 214594144 Jul 30 17:00 spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz
...
然后看看其他,maven_repo仓库地址,会生成非常多的依赖文件
[root@hadoop001 maven_repo]# pwd
/root/maven_repo
[root@hadoop001 maven_repo]# ll
total 184
drwxr-xr-x 3 root root 4096 Jul 30 16:37 antlr
drwxr-xr-x 3 root root 4096 Jul 30 16:40 aopalliance
drwxr-xr-x 4 root root 4096 Jul 30 16:51 asm
drwxr-xr-x 3 root root 4096 Jul 30 16:37 avalon-framework
drwxr-xr-x 3 root root 4096 Jul 30 16:37 backport-util-concurrent
drwxr-xr-x 3 root root 4096 Jul 30 16:47 cglib
drwxr-xr-x 3 root root 4096 Jul 30 16:28 classworlds
drwxr-xr-x 26 root root 4096 Jul 30 16:53 com
drwxr-xr-x 4 root root 4096 Jul 30 16:40 commons-beanutils
drwxr-xr-x 3 root root 4096 Jul 30 16:37 commons-chain
drwxr-xr-x 3 root root 4096 Jul 30 16:30 commons-cli
drwxr-xr-x 3 root root 4096 Jul 30 16:28 commons-codec
drwxr-xr-x 3 root root 4096 Jul 30 16:36 commons-collections
drwxr-xr-x 3 root root 4096 Jul 30 16:40 commons-configuration
drwxr-xr-x 3 root root 4096 Jul 30 16:48 commons-dbcp
drwxr-xr-x 3 root root 4096 Jul 30 16:36 commons-digester
drwxr-xr-x 3 root root 4096 Jul 30 16:40 commons-el
drwxr-xr-x 3 root root 4096 Jul 30 16:40 commons-httpclient
drwxr-xr-x 3 root root 4096 Jul 30 16:27 commons-io
drwxr-xr-x 3 root root 4096 Jul 30 16:36 commons-lang
drwxr-xr-x 4 root root 4096 Jul 30 16:37 commons-logging
drwxr-xr-x 3 root root 4096 Jul 30 16:40 commons-net
drwxr-xr-x 3 root root 4096 Jul 30 16:47 commons-pool
drwxr-xr-x 3 root root 4096 Jul 30 16:36 commons-validator
drwxr-xr-x 3 root root 4096 Jul 30 16:37 dom4j
drwxr-xr-x 5 root root 4096 Jul 30 16:53 io
drwxr-xr-x 3 root root 4096 Jul 30 16:53 it
drwxr-xr-x 11 root root 4096 Jul 30 16:48 javax
drwxr-xr-x 3 root root 4096 Jul 30 16:47 javolution
drwxr-xr-x 3 root root 4096 Jul 30 16:58 jline
drwxr-xr-x 3 root root 4096 Jul 30 16:53 joda-time
drwxr-xr-x 3 root root 4096 Jul 30 16:27 junit
drwxr-xr-x 4 root root 4096 Jul 30 16:47 log4j
drwxr-xr-x 3 root root 4096 Jul 30 16:37 logkit
drwxr-xr-x 3 root root 4096 Jul 30 16:53 mysql
drwxr-xr-x 8 root root 4096 Jul 30 16:56 net
drwxr-xr-x 46 root root 4096 Jul 30 16:56 org
drwxr-xr-x 3 root root 4096 Jul 30 16:36 oro
drwxr-xr-x 3 root root 4096 Jul 30 16:37 sslext
drwxr-xr-x 3 root root 4096 Jul 30 16:47 stax
drwxr-xr-x 5 root root 4096 Jul 30 16:40 tomcat
drwxr-xr-x 4 root root 4096 Jul 30 16:47 xalan
drwxr-xr-x 3 root root 4096 Jul 30 16:36 xerces
drwxr-xr-x 3 root root 4096 Jul 30 16:36 xml-apis
drwxr-xr-x 3 root root 4096 Jul 30 16:40 xmlenc
drwxr-xr-x 3 root root 4096 Jul 30 16:38 xmlunit
8、配置Spark客户端
[root@vm01 ~]# su - hadoop
[hadoop@vm01 software]$ tar -zxvf spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz -C ~/app/
[hadoop@vm01 software]$ vim ~/.bash_profile
export SCALA_HOME=/home/hadoop/app/scala-2.12.8
export PATH=$SCALA_HOME/bin:$PATH
export JAVA_HOME=/usr/java/jdk1.8.0_45
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JER_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JER_HOME/bin:$PATH
export MAVEN_HOME=/home/hadoop/app/apache-maven-3.6.1
export PATH=$MAVEN_HOME/bin:$PATH
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
export SPARK_HOME=/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
[hadoop@vm01 ~]$ source ~/.bash_profile
[hadoop@vm01 bin]$ pwd
/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/bin
[hadoop@vm01 bin]$ rm -fr *.cmd
[hadoop@vm01 bin]$ ./spark-shell
19/07/30 22:12:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://vm01:4040
Spark context available as 'sc' (master = local[*], app id = local-1564549938818).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.2
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
scala> :quit