目录
环境说明
- Centos 7.8
- Maven 3.6.3
- Spark 3.0.1
- CDH 6.3.1
一 . 安装Maven
1.1 下载
wget http://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
1.2 解压
tar -zxvf apache-maven-3.6.3-bin.tar.gz -C /home/ifeng/app
ln -s apache-maven-3.6.3/ maven
1.3 权限
粗暴点
[root@ifeng app]# chmod 777 apache-maven-3.6.3/ -R
[root@ifeng app]# chmod 777 maven -R
1.4 配置环境变量
vi /etc/profile
MAVEN_HOME=/home/ifeng/app/maven
export PATH=${MAVEN_HOME}/bin:${PATH}
source /etc/profile
二. 编译Spark(Maven) 2.4.6
2.1 下载Spark
http://archive.apache.org/dist/spark/
(http://archive.apache.org/dist/spark/spark-2.4.6/)
2.2 解压
tar -zxvf spark-2.4.6.tgz
2.3 编译(mvn)
编译前将mvn的大小设置到path中
vi /opt/path
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
2.3.1 build/mvn
Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the build/ directory. This script will automatically download and setup all necessary build requirements (Maven, Scala, and Zinc) locally within the build/ directory itself. It honors any mvn binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. build/mvn execution acts as a pass through to the mvn call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:
./build/mvn -DskipTests clean package
此方式编译出来没有安装包
没有配置国内代理的情况下 耗时5:16H 终于编译完了
佩服国内学习环境
2.3.2 Building a Runnable Distribution
要创建一个由Spark Downloads页面分发的Spark分发,并使其布局可运行,请在项目根目录中使用./dev/make-distribution.sh。 可以使用Maven配置文件设置等进行配置,例如直接进行Maven构建。 例:
./dev/make-distribution.sh --name custom-saprk --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
# 将custom-saprk 替换为 Hadoop 版本即可
# --tgz 编译后有一个tgz的包
# -Psparkr 不需要
# -Pmesos不需要
# -Pkubernetes 不用K8s也不需要
./dev/make-distribution.sh \
--name 2.6.0-cdh5.16.2 \
--pip --r --tgz -Phive -Phive-thriftserver -Pyarn
make-distribution.sh 解读
\spark-3.0.1-rc3\spark-3.0.1-rc3\dev\make-distribution.sh
128行 , 寻找系统的版本信息会比较慢,所以可以直接显性
VERSION=2.4.6 #Spark本版本号
SCALA_VERSION=2.12
SPARK_HADOOP_VERSION=2.6.0-cdh5.16.2
SPARK_HIVE=1 #是否支持Hive
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)
Specifying the Hadoop Version
指定Hadoop版本
在 pom 中
# 3.0.1 默认Hadoop版本为2.7.4 , 可以改一下
<hadoop.version>2.7.4</hadoop.version>
<protobuf.version>2.5.0</protobuf.version>
<yarn.version>${hadoop.version}</yarn.version>
./build/mvn -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package
# -Dhadoop.version 即为 指定的Hadoop版本 覆盖pom文件中的
./dev/make-distribution.sh \
--name 2.6.0-cdh5.16.2 \
--tgz \
-Pyarn \
-Phive \
-Phive-thriftserver \
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.16.2 \
-Pscala-2.12 -Dscala.version=2.12.10
# -phadoop 就是指定以下的参数
<profile>
<id>hadoop-2.7</id>
<!-- Default hadoop profile. Uses global properties. -->
</profile>
由于Spark的pom中没有CDH仓库 需要手动添加,pom.xml
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
Change Scala Version
When other versions of Scala like 2.12 are supported, it will be possible to build for that version. Change the major Scala version using (e.g. 2.13):
./dev/change-scala-version.sh 2.12
For Maven, please enable the profile (e.g. 2.13):
./build/mvn -Pscala-2.12 compile
./dev/make-distribution.sh \
--name 2.6.0-cdh5.16.2 \
--pip --r --tgz \
-Phive -Phive-thriftserver \
-Phadoop-2.6 \
-Dhadoop.version=2.6.0-cdh5.16.2 \
-Pyarn
漫长的下载又开始了
good night
编译完成
完成后:
spark-2.4.6-bin-2.6.0-cdh5.16.2.tgz
三. 编译Spark(Maven) 1.5
http://archive.apache.org/dist/spark/spark-1.5.2/
./dev/change-scala-version.sh 2.10
由于Spark的pom中没有CDH仓库 需要手动添加,pom.xml
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
./build/mvn -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package
# -Dhadoop.version 即为 指定的Hadoop版本 覆盖pom文件中的
./make-distribution.sh \
--name 2.6.0-cdh5.16.2 \
--tgz \
-Pyarn \
-Phive \
-Phive-thriftserver \
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.16.2 \
-Pscala-2.10 -Dscala.version=2.10.7
128行 , 寻找系统的版本信息会比较慢,所以可以直接显性
VERSION=1.5.2 #Spark本版本号
SCALA_VERSION=2.10
SPARK_HADOOP_VERSION=2.6.0-cdh5.16.2
SPARK_HIVE=1 #是否支持Hive
四. 编译Spark(Maven) 3.0.1
默认Scala就是2.12
1 修改makedistruibution.sh
VERSION=3.0.1 #Spark本版本号
SCALA_VERSION=2.12
SPARK_HADOOP_VERSION=3.0.1-cdh5.16.2
SPARK_HIVE=1 #是否支持Hive
3 执行make-distribution
# -Dhadoop.version 即为 指定的Hadoop版本 覆盖pom文件中的
./dev/make-distribution.sh \
--name 3.0.1-cdh5.16.2 \
--tgz \
-Pyarn \
-Phive \
-Phive-thriftserver \
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.16.2
又是一个官方坑死人?
修改 /home/ifeng/sourcecode/spark-3.0.1/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:298:
sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
try {
val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// logAggregationContext.setRolledLogsIncludePattern(includePattern)
// sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
// logAggregationContext.setRolledLogsExcludePattern(excludePattern)
// }
val setRolledLogsIncludePatternMethod =
logAggregationContext.getClass.gerMethod("setRolledLogsIncludePattern",classOf[String])
setRolledLogsIncludePatternMethod.invoke(logAggregationContext,includePattern)
sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach{ excludePattern =>
val setRolledLogsExcludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern",classOf[String])
setRolledLogsExcludePatternMethod.invoke(logAggregationContext,excludePattern)
}
五. CDH6.3.1
各个组件版本
查看CDH6.3.1各个组件版本
CDH 6.3.1 Packaging
Component Component Version Changes Information
Apache Avro 1.8.2 Changes
Apache Flume 1.9.0 Changes
Apache Hadoop 3.0.0 Changes
Apache HBase 2.1.4 Changes
HBase Indexer 1.5 Changes
Apache Hive 2.1.1 Changes
Hue 4.3.0 Changes
Apache Impala 3.2.0 Changes
Apache Kafka 2.2.1 Changes
Kite SDK 1.0.0 Changes
Apache Kudu 1.10.0 Changes
Apache Solr 7.4.0 Changes
Apache Oozie 5.1.0 Changes
Apache Parquet 1.9.0 Changes
Parquet-format 2.4.0 Changes
Apache Pig 0.17.0 Changes
Apache Sentry 2.1.0 Changes
Apache Spark 2.4.0 Changes
Apache Sqoop 1.4.7 Changes
Apache ZooKeeper 3.4.5 Changes
1 指定版本号
VERSION=3.0.1 #Spark本版本号
SCALA_VERSION=2.12
SPARK_HADOOP_VERSION=3.0.0-cdh6.3.1
SPARK_HIVE=1 #是否支持Hive
2 编译
./dev/make-distribution.sh \
--name 3.0.1-cdh6.3.1 \
--tgz \
-Pyarn \
-Phive \
-Phive-thriftserver \
-Phadoop-3.0 -Dhadoop.version=3.0.0-cdh6.3.1
Spark3.x链接Hive
vim spark-env.sh
export SPARK_DIST_CLASSPATH=$(/home/ifeng/app/hadoop/bin/hadoop classpath)
export JAVA_HOME=/usr/java/jdk1.8.0_181
export CLASSPATH=$CLASSPATH:/home/ifeng/app/hive/lib
export SCALA_HOME=/usr/scala
export HADOOP_CONF_DIR=/home/ifeng/app/hadoop/etc/hadoop
export HIVE_CONF_DIR=/home/ifeng/app/hive/conf
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/ifeng/app/hive/lib/mysql-connector-java-5.1.47-bin.jar
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveCtx = new HiveContext(sc)
scala> val studentRDD = hiveCtx.sql("select * from sparktest.student").rdd
找不到驱动…