spark学习、下载、编译、安装、运行

最新推荐文章于 2024-06-08 17:53:28 发布

xl132598798

最新推荐文章于 2024-06-08 17:53:28 发布

阅读量215

点赞数

分类专栏： BIG_DATD 文章标签： spark 大数据 hadoop

本文链接：https://blog.csdn.net/xl132598798/article/details/105669254

版权

BIG_DATD 专栏收录该内容

18 篇文章 2 订阅

订阅专栏

学习网站：
官网学习，纯英文需要耐心寻找
 databricks，关注其中的Blog
源码，相关配置文件参考

Apache Spark™是用于大规模数据处理的统一分析引擎。

下载

官网-download。
官网下载

编译

编译方式
- Maven编译
- SBT编译
- 打包编译make-distribution.sh

选择Maven方式编译。

注意版本要求

spark的编译对maven，java版本有要求，下载并解压相应版本的Maven和Java。

(a)

#配置JAVA_HOME
$ sudo vi /etc/profile
export JAVA_HOME=/XXX/XXX/jdkX.X.X_XX
export PATH=$PATH:$JAVA_HOME/bin

编辑退出后，使用source /etc/profile使之生效；

#如果遇到不能加载当前版本的问题
rpm -qa|grep jdk
rpm -e --nodeps jdk版本
which java 删除/usr/bin/java

(b)

$ sudo vi /etc/profile
export MAVEN_HOME=
export PATH=$PATH:$MAVEN_HOME/bin
#configure Maven to use more memory than usual by setting MAVEN_OPTS
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024M"

PS：

<!--ReservedCodeCacheSize是可选的，但是不写的话可能会出错-->
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.12/classes...
[ERROR] Java heap space -> [Help 1]

编辑退出后，使用source /eytc/profile使之生效；

在/etc/resolv.conf:添加如下内容

nameserver 8.8.8.8
nameserver 8.8.4.4

在spark/dev/make-distribution.sh中添加如下内容，加快编译速度。


#Spark的版本
VERSION=
#Scala的版本
SCALA_VERSION=
#Hadoop的版本
SPARK_HADOOP_VERSION=
#支持spark on hive
SPARK_HIVE=1

编译时网络要外网连接

$./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-X.X -Phive -Phive-thriftserver  -Pyarn

#编译完成之后解压

tar -zxf spark-X.X.X-bin-custom-spark.tgz -C /opt/modules/

Scala安装

官网中描述对于Sacla的版本也是有要求的，将相应版本的Scala下载解压到指定目录。

配置环境变量

$ sudo vi /etc/profile
export SCALA_HOME=/XXX/XXX/scala-X.XX.X
export PATH=$PATH:$SCALA_HOME/bin
$ source /etc/profile

运行

运行成功的log
访问Spark context web UI，查看相关服务的运行情况。

Spark运行模式之Standalone：

Standalone(Spark自身的集群管理)

需要安装JDK、Scala、Hadoop、Spark Standalone（只需要在集群中每个节点上安装一个Spark的编译版本）

spark 集群架构图

$./sbin/start-master.sh：启动一个standalone master服务，一旦启动，master将打印spark://HOST:PORTURL，这个URL可以连接workers，或者作为"master"参数传递给SparkContext，也可以在master’s的web UI中发现这个URL。

$./sbin/start-slave.sh <master-spark-URL>：启动workers，并连接master。

使用脚本启动Spark standalone 集群，需要在Spark目录下创建conf/slaves文件，并在文件中写入计划运行Spark workers机器的hostname，如果该文件不存在，默认启动Local模式。注意：master机器可以无密钥连接每一个worker。

#slaves：一行一个
worker1 所在的hotname
worker2 所在的hotname

那么可以在Spark master上运行以下脚本：

#Starts a master instance on the machine the script is executed on.
$ sbin/start-master.sh
#Starts a slave instance on each machine specified in the conf/slaves file.
$ sbin/start-slaves.sh  
#Starts a slave instance on the machine the script is executed on.
$ sbin/start-slave.sh
#Starts both a master and a number of slaves as described above.
$ sbin/start-all.sh
 #Stops the master that was started via the sbin/start-master.sh script.
$ sbin/stop-master.sh
#Stops all slave instances on the machines specified in the conf/slaves file.
$ sbin/stop-slaves.sh 
 #Stops both the master and the slaves as described above
$ sbin/stop-all.sh

通过conf/spark-env.sh进一步配置集群，可以参考conf/spark-env.sh.template，然后将该文件复制到所有worker机器上。

#spark-env.sh
export JAVA_HOME=/XXX/XXX/jdkX.X.X_XX
export SCALA_HOME=/XXX/XXX/scala-X.XX.X
#配置这一项时，hadoop集群得启动
export HADOOP_CONF_DIR=/opt/modules/hadoop-X.X.X/etc/hadoop
export SPARK_CONF_DIR=/opt/modules/spark-X.X.X/conf
#绑定master，填写hostname or IP address
export SPARK_MASTER_HOST=
#默认是7077
export SPARK_MASTER_PORT=7077
#master web UI的端口，默认是8080
export SPARK_MASTER_WEBUI_PORT=8080
#允许Spark 程序使用的内核总数
export SPARK_WORKER_CORES=1
#saprk程序使用的内存总数
export SPARK_WORKER_MEMORY=1g
#Spark worker启动的端口，默认random
export SPARK_WORKER_PORT=7078
#worker web UI的端口8081
export SPARK_WORKER_WEBUI_PORT=8081

将应用程序连接到集群,只需要将master的URL（spark://IP:PORT）传递给SparkContext构造器。运行一个交互Spark shell：$./bin/saprk-shell --master spark://IP:ROOT。

spark-submit脚本提交编译的Spark应用程序到集群中。对于standalone集群，Spark有两种deploy modes：
1.client模式（ the driver is launched in the same process as the client that submits the application ）
2.cluster模式（ the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish ）。
如果你的应用程序通过Spark submit运行，则应用程序自动分发给所有的worker节点，应用程序所依赖的jar包，通过–jars jar1,jar2