大数据生态组件——Spark学习（搭建spark完全分布式环境）

最新推荐文章于 2022-04-18 08:50:06 发布

shy-2

最新推荐文章于 2022-04-18 08:50:06 发布

阅读量834

点赞数

分类专栏：大数据技术与应用文章标签：分布式大数据 linux hadoop spark

本文链接：https://blog.csdn.net/qq_44032277/article/details/106100193

版权

大数据技术与应用专栏收录该内容

6 篇文章 1 订阅

订阅专栏

学习spark——搭建spark完全分布式环境

Spark

本次带来的是spark的学习使用教程

Spark

下载安装spark

下载spark

hadoop搭建及其他一些生态圈组件安装可查看我的另一篇博客一文搞定hadoop及其生态圈组件的安装，赶紧收藏起来吧！

去开源社区下载spark，注意版本和你hadoop版本对应，否则会出错。
我安装的是hadoop-2.7，所以我下载了：
在这里插入图片描述

安装spark

将下载的spark通过远程传输工具（我使用的是XFTP）传到虚拟机家目录下，并解压到/usr/local目录下

[root@master ~]# tar -xvf spark-2.4.0-bin-hadoop2.7.tgz -C /usr/spark/

熟悉spark的目录结构

bin目录：
进入spark shell的命令：spark-shell（进入Scala语言环境）、pyspark（进入Python语言环境）、sparkR（进入R语言环境）
spark-submit用于提交任务，类似于hadoop jar命令

sbin目录：start-all.sh、stop-all.sh用于spark集群的启动和停止的命令在sbin目录下

conf目录：用于存放Spark的配置文件，如spark-env.sh.template（env是environment的意思）

配置环境变量

配置/etc/profile

# set spark
export SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin

更新/etc/rpofile

[root@master ~]# source /etc/profile

将配置文件更新到从节点上

[root@master ~]# scp /etc/profile root@slave1:/etc/profile

[root@master ~]# scp /etc/profile root@slave2:/etc/profile

更新从节点配置文件

配置spark完全分布式集群

配置spark的环境变量文件`spark-env.sh`

[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/conf/
[root@master conf]# cp spark-env.sh.template spark-env.sh
[root@master conf]# vim spark-env.sh

在最后添加这个内容：
在这里插入图片描述

HADOOP_CONF_DIR用于告诉Spark使用Hadoop的配置文件
配置此参数后，Spark就可以知道HDFS文件系统、Yarn参数的相关配置
配置此参数后，Spark默认读取的文件系统就是HDFS文件系统

配置spark的slaves文件

[root@master conf]# cp slaves.template slaves
[root@master conf]# vim slaves

在这里插入图片描述

同步文件

Spark 支持的部署方式为Local模式（单机模式）、Standalone模式（使用Spark自带的简单集群管理器）、YARN模式（使用YARN作为集群管理器）和Mesos模式（使用Mesos作为集群管理器）；其中Local是单机模式，后3种是分布式部署

我们来配置Spark的Standalone集群模式（Standalone模式是比较常用的方式，即Spark集群自己管理资源）

将master的spark目录同步到slave1 和 slave2


[root@master ~]# scp -r /usr/spark/ root@slave1:/usr/
[root@master ~]# 
[root@master ~]# scp -r /usr/spark/ root@slave2:/usr/

测试spark完全分布式集群

启动hadoop集群

[root@master ~]# cd /usr/hadoop/hadoop-2.7.3/sbin/
[root@master sbin]# ./start-all.sh

启动spark集群

[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/sbin/
[root@master sbin]# ./start-all.sh

显示各节点进程

[root@master sbin]# jps
2225 Master
1559 NameNode
1928 ResourceManager
2312 Jps
1759 SecondaryNameNode

[root@master sbin]$ ssh s1 jps
3460 Jps
3116 DataNode
3374 Worker
3215 NodeManager

注意：这种集群运行的方式称为Standalone模式

Spark集群WebUI——http://master:8080
可以观看到spark的运行端口是7077

提交任务到spark集群

观察spark-submit的基本用法

[root@master sbin]# spark-submit 
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      
[root@master sbin]#

停止spark集群

[root@master ~]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/sbin/
[root@master sbin]# ./stop-all.sh

spark-shell的进入和退出

[root@master sbin]# ./start-all.sh 

[root@master sbin]# spark-shell --master spark://master:7077
…………………………
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = spark://master:7077, app id = app-20200516221317-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

可以看出Spark预编译的二进制包中已经包含Scala运行环境



scala>:quit  #或者输入sys.exit或者按下Ctrl+d退出Scala环境

pyspark的进入和退出

[root@master sbin]# pyspark --master spark://master:7077
Python 2.7.5 (default, Apr 11 2018, 07:36:10) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.5 (default, Apr 11 2018 07:36:10)
SparkSession available as 'spark'.
>>>

>>> exit()  #输入exit()或者按下Ctrl+d退出Python环境

本篇搭建及简单打开退出操作全部结束，关于使用spark处理分析数据下一篇博客将会详细介绍！！！

shy-2

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
大数据生态组件——Spark学习（搭建spark完全分布式环境）

学习sparkSpark下载安装spark下载spark安装spark熟悉spark的目录结构配置环境变量配置spark完全分布式集群配置spark的环境变量文件`spark-env.sh`配置spark的slaves文件同步文件测试spark完全分布式集群本次带来的是spark的学习使用教程Spark下载安装spark下载sparkhadoop搭建及其他一些生态圈组件安装可查看我的另一篇博客一文搞定hadoop及其生态圈组件的安装，赶紧收藏起来吧！去开源社区下载spark，注意版本和你ha
复制链接

扫一扫