安装R 安装包及其依赖类库
- 安装R语言环境
- 安装R类库
R -e “install.packages(c(‘knitr’, ‘rmarkdown’, ‘devtools’, ‘e1071’, ‘survival’), repos=‘http://cran.us.r-project.org’)”
R -e “devtools::install_version(‘testthat’, version = ‘1.0.2’, repos=‘http://cran.us.r-project.org’)”
R -e “install.packages(c(‘roxygen2’), repos=‘http://cran.us.r-project.org’)”
- 安装 miktex,到 https://miktex.org/download 这里下载dmg文件并安装
- 执行 ./R/run-tests.sh 进行测试R环境
- 执行
cd R && /Library/Frameworks/R.framework/Resources/bin/R CMD check --as-cran --no-tests SparkR_2.3.0.tar.gz
测试
编译报错:
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
checking PDF version of manual without hyperrefs or index … ERROR
这个是因为缺少 miktex
编译Spark包
git clone https://github.com/Intel-bigdata/spark-adaptive.git
./dev/make-distribution.sh --name spark-ae-2.3 --pip --r --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DskipTests
生产环境部署
sudo tar --same-owner -zxvf spark-2.3.0-bin-spark-ae-2.3.tgz
sudo chown -R hadoop:hadoop spark-2.3.0-bin-spark-ae-2.3
sudo scp -pr hadoop@x.x.x.x:/usr/lib/spark-2.3.0-bin-spark-ae-2.3 /usr/lib/spark-2.3.0-bin-spark-ae-2.3
sudo mkdir -p /etc/spark/conf && sudo chown -R hadoop:hadoop /etc/spark && sudo chmod -R 755 /etc/spark
sudo rm -rf /usr/lib/spark-2.3.0-bin-spark-ae-2.3/conf && sudo ln -s /etc/spark/conf /usr/lib/spark-2.3.0-bin-spark-ae-2.3/conf
sudo mv spark-2.3.0-bin-spark-ae-2.3/conf/* /etc/spark/conf/
sudo ln -fTs /usr/lib/spark-2.3.0-bin-spark-ae-2.3 /usr/lib/spark
使用SBT 编译Apache spark
Spark 3.0 版本编译
git clone git@github.com:apache/spark.git
sbt package
Spark 3.0 默认使用的是scala-2.12版本进行编译的,使用2.11 编译会失败
[error] /Users/wankun/ws/apache/spark/core/src/main/scala/org/apache/spark/util/logging/DriverLogger.scala:178: type mismatch;
[error] found : () => Unit
[error] required: Runnable
[error] threadpool.execute(() => DfsAsyncWriter.this.close())
[error] ^
Spark 2.4 版本编译
sbt -Dscala.version=2.11.12 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver package
CDH Spark spark2-2.4.0-cloudera1 版本编译
git clone https://github.com/cloudera/spark.git
# 根据对应的Tag创建对应的Branch
git checkout -b spark2-2.4.0-cloudera1 spark2-2.4.0-cloudera1
- 擦,这个版本的sbt配置文件居然还有Bug,
project/SparkBuild.scala
val sqlProjects@Seq(catalyst, sql, hive, hiveThriftServer, sqlKafka010, avro) = Seq(
- "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10", "avro", "hive-exec"
+ "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10", "avro"
).map(ProjectRef(buildLocation, _))
- 添加自定义repositories
build/sbt
}
+SBT_REPOSITORIES_CONFIG="$(dirname "$(realpath "$0")")/sbt-config/repositories"
+export SBT_OPTS="-Dsbt.override.build.repos=true -Dsbt.repository.config=$SBT_REPOSITORIES_CONFIG"
+
. "$(dirname "$(realpath "$0")")"/sbt-launch-lib.bash
- pom 中修改对应软件版本
--- a/pom.xml
+++ b/pom.xml
@@ -129,15 +129,15 @@
<hbase.version>${cdh.hbase.version}</hbase.version>
<hbase.artifact>hbase-server</hbase.artifact>
<flume.version>${cdh.flume-ng.version}</flume.version>
- <zookeeper.version>${cdh.zookeeper.version}</zookeeper.version>
- <curator.version>${cdh.curator.version}</curator.version>
+ <zookeeper.version>3.4.5-cdh5.13.3</zookeeper.version>
+ <curator.version>2.7.1</curator.version>
<hive.group>org.apache.hive</hive.group>
<!-- Version used in Maven Hive dependency -->
- <hive.version>${cdh.hive.version}</hive.version>
+ <hive.version>1.1.0-cdh5.13.3</hive.version>
<!-- Version used for internal directory structure -->
<hive.version.short>1.1.0</hive.version.short>
<derby.version>10.12.1.1</derby.version>
- <parquet.version>${cdh.parquet.version}</parquet.version>
+ <parquet.version>1.5.0-cdh5.13.3</parquet.version>
<orc.version>1.5.5</orc.version>
<orc.classifier>nohive</orc.classifier>
<hive.parquet.version>${parquet.version}</hive.parquet.version>
@@ -147,7 +147,7 @@
<ivy.version>2.4.0</ivy.version>
<oro.version>2.0.8</oro.version>
<codahale.metrics.version>3.1.5</codahale.metrics.version>
- <avro.version>${cdh.avro.version}</avro.version>
+ <avro.version>1.7.6-cdh5.13.3</avro.version>
<avro.mapred.classifier>hadoop2</avro.mapred.classifier>
<aws.kinesis.client.version>1.8.10</aws.kinesis.client.version>
<!-- Should be consistent with Kinesis client dependency -->
--- a/build/sbt
+++ b/build/sbt
@@ -47,6 +47,9 @@ realpath () {
)
}
+SBT_REPOSITORIES_CONFIG="$(dirname "$(realpath "$0")")/sbt-config/repositories"
+export SBT_OPTS="-Dsbt.override.build.repos=true -Dsbt.repository.config=$SBT_REPOSITORIES_CONFIG"
+
. "$(dirname "$(realpath "$0")")"/sbt-launch-lib.bash
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -291,11 +291,11 @@ if [ -d "$SPARK_HOME/R/lib/SparkR" ]; then
fi
# CDH: remove scripts for which the actual code is not included.
-rm "$DISTDIR/bin/spark-sql"
-rm "$DISTDIR/bin/beeline"
-rm "$DISTDIR/bin/sparkR"
-rm "$DISTDIR/sbin/start-thriftserver.sh"
-rm "$DISTDIR/sbin/stop-thriftserver.sh"
+# rm "$DISTDIR/bin/spark-sql"
+# rm "$DISTDIR/bin/beeline"
+# rm "$DISTDIR/bin/sparkR"
+# rm "$DISTDIR/sbin/start-thriftserver.sh"
+# rm "$DISTDIR/sbin/stop-thriftserver.sh"
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -43,3 +43,7 @@ addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11")
// the plugin; this is tracked at SPARK-14401.
addSbtPlugin("org.spark-project" % "sbt-pom-reader" % "1.0.0-spark")
+
+logLevel := Level.Debug
+
+addSbtPlugin("io.get-coursier" % "sbt-coursier" % "1.0.3")
修改文件列表
modified: build/sbt
new file: build/sbt-config/repositories
modified: pom.xml
modified: project/MimaBuild.scala
modified: project/MimaExcludes.scala
modified: project/SparkBuild.scala
modified: project/plugins.sbt
new file: project/project/build.properties
new file: project/project/plugins.sbt
sbt -Dscala.version=2.11.12 -Pyarn -Phive -Phive-thriftserver package
- 可以给sbt添加-d 参数来调试程序
- 程序遇到诡异的错误,可以尝试sbt clean ,然后再进行编译
CDH版本编译
dev/make-distribution.sh
# CDH: remove scripts for which the actual code is not included.
-rm "$DISTDIR/bin/spark-sql"
-rm "$DISTDIR/bin/beeline"
-rm "$DISTDIR/bin/sparkR"
-rm "$DISTDIR/sbin/start-thriftserver.sh"
-rm "$DISTDIR/sbin/stop-thriftserver.sh"
+# rm "$DISTDIR/bin/spark-sql"
+# rm "$DISTDIR/bin/beeline"
+# rm "$DISTDIR/bin/sparkR"
+# rm "$DISTDIR/sbin/start-thriftserver.sh"
+# rm "$DISTDIR/sbin/stop-thriftserver.sh"
pom.xml
<repositories>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
<repository>
./dev/make-distribution.sh --name 2.6.0-cdh5.13.3 --tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.13.3 -Dhive.version=1.1.0-cdh5.13.3 -Dzookeeper.version=3.4.5-cdh5.13.3 -DskipTests -T 4C