学习spark——搭建spark完全分布式环境
本次带来的是spark的学习使用教程
Spark
下载安装spark
下载spark
hadoop搭建及其他一些生态圈组件安装可查看我的另一篇博客一文搞定hadoop及其生态圈组件的安装,赶紧收藏起来吧!
去开源社区下载spark,注意版本和你hadoop版本对应,否则会出错。
我安装的是hadoop-2.7,所以我下载了:
安装spark
将下载的spark通过远程传输工具(我使用的是XFTP)传到虚拟机家目录下,并解压到/usr/local
目录下
[root@master ~]# tar -xvf spark-2.4.0-bin-hadoop2.7.tgz -C /usr/spark/
熟悉spark的目录结构
- spark目录结构
bin目录:
进入spark shell的命令:spark-shell(进入Scala语言环境)、pyspark(进入Python语言环境)、sparkR(进入R语言环境)
spark-submit用于提交任务,类似于hadoop jar命令
sbin目录:start-all.sh、stop-all.sh用于spark集群的启动和停止的命令在sbin目录下
conf目录: 用于存放Spark的配置文件,如spark-env.sh.template(env是environment的意思)
配置环境变量
- 配置
/etc/profile
# set spark
export SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
- 更新
/etc/rpofile
[root@master ~]# source /etc/profile
- 将配置文件更新到从节点上
[root@master ~]# scp /etc/profile root@slave1:/etc/profile
[root@master ~]# scp /etc/profile root@slave2:/etc/profile
- 更新从节点配置文件
配置spark完全分布式集群
配置spark的环境变量文件spark-env.sh
[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/conf/
[root@master conf]# cp spark-env.sh.template spark-env.sh
[root@master conf]# vim spark-env.sh
在最后添加这个内容:
HADOOP_CONF_DIR用于告诉Spark使用Hadoop的配置文件
配置此参数后,Spark就可以知道HDFS文件系统、Yarn参数的相关配置
配置此参数后,Spark默认读取的文件系统就是HDFS文件系统
配置spark的slaves文件
[root@master conf]# cp slaves.template slaves
[root@master conf]# vim slaves
同步文件
Spark 支持的部署方式为Local模式(单机模式)、Standalone模式(使用Spark自带的简单集群管理器)、YARN模式(使用YARN作为集群管理器)和Mesos模式(使用Mesos作为集群管理器);其中Local是单机模式,后3种是分布式部署
我们来配置Spark的Standalone集群模式(Standalone模式是比较常用的方式,即Spark集群自己管理资源)
- 将master的spark目录同步到slave1 和 slave2
[root@master ~]# scp -r /usr/spark/ root@slave1:/usr/
[root@master ~]#
[root@master ~]# scp -r /usr/spark/ root@slave2:/usr/
测试spark完全分布式集群
- 启动hadoop集群
[root@master ~]# cd /usr/hadoop/hadoop-2.7.3/sbin/
[root@master sbin]# ./start-all.sh
- 启动spark集群
[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/sbin/
[root@master sbin]# ./start-all.sh
- 显示各节点进程
[root@master sbin]# jps
2225 Master
1559 NameNode
1928 ResourceManager
2312 Jps
1759 SecondaryNameNode
[root@master sbin]$ ssh s1 jps
3460 Jps
3116 DataNode
3374 Worker
3215 NodeManager
注意:这种集群运行的方式称为Standalone模式
Spark集群WebUI——http://master:8080
可以观看到spark的运行端口是7077
提交任务到spark集群
- 观察spark-submit的基本用法
[root@master sbin]# spark-submit
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
[root@master sbin]#
停止spark集群
[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/sbin/
[root@master sbin]# ./stop-all.sh
spark-shell的进入和退出
[root@master sbin]# ./start-all.sh
[root@master sbin]# spark-shell --master spark://master:7077
…………………………
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = spark://master:7077, app id = app-20200516221317-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
可以看出Spark预编译的二进制包中已经包含Scala运行环境
scala>:quit #或者输入sys.exit或者按下Ctrl+d退出Scala环境
pyspark的进入和退出
[root@master sbin]# pyspark --master spark://master:7077
Python 2.7.5 (default, Apr 11 2018, 07:36:10)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 2.7.5 (default, Apr 11 2018 07:36:10)
SparkSession available as 'spark'.
>>>
>>> exit() #输入exit()或者按下Ctrl+d退出Python环境
本篇搭建及简单打开退出操作全部结束,关于使用spark处理分析数据下一篇博客将会详细介绍!!!