Spark-环境搭建(Spark Yarn|Standalone)

本文详细介绍了如何在CentOS系统中搭建Spark环境,包括关闭防火墙、配置主机名和SSH免密登录、安装JDK、HDFS以及Spark。重点讲述了在YARN和Standalone模式下的安装步骤,以及如何进行远程和本地测试。同时提到了History Server的配置与启动,用于记录Spark任务的历史状态。
摘要由CSDN通过智能技术生成

Spark Yarn|Standalone

基础环境
  • 关闭防火墙
[root@centos ~]# service iptables stop # 关闭防火墙
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@centos ~]# chkconfig iptables off # 关闭开机自启动
  • 修改主机名
 [root@centos ~]# cat /etc/sysconfig/network
 NETWORKING=yes
 HOSTNAME=centos
  • 配置主机名和IP的映射关系
[root@centos ~]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.58.24 centos
  • 重启centos系统

  • 配置SSH免密登录

[root@centos ~]# ssh-keygen -t rsa

[root@centos ~]# ssh-copy-id centos
  • 安装JDK,配置JAVA_HOME
[root@centos ~]# rpm -ivh jdk-8u191-linux-x64.rpm

[root@centos ~]# vi ~/.bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@centos ~]# source .bashrc
安装HDFS
[root@centos ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr/
[root@centos ~]# vi ~/.bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
[root@centos ~]# source .bashrc
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--nn访问入口-->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://centos:9000</value>
</property>
<!--hdfs工作基础目录-->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/slaves
centos
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<!--block副本因子-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--配置Sencondary namenode所在的物理主机-->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>centos:50090</value>
</property>
<!--设置datanode最大文件操作数-->
<property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
</property>
<!--设置datanode并行处理能力-->
<property>
    <name>dfs.datanode.handler.count</name>
    <value>6</value>
</property>

[root@centos ~]# hdfs namenode -format # 创建启动NameNode所需的fsimage文件
[root@centos ~]# start-dfs.sh 

安装Spark
[root@centos ~]# tar -zxf spark-2.4.3-bin-without-hadoop.tgz -C /usr/
[root@centos ~]# mv /usr/spark-2.4.3-bin-without-hadoop/ /usr/spark-2.4.3
[root@centos ~]# cd /usr/spark-2.4.3/

[root@centos spark-2.4.3]# mv conf/slaves.template conf/slaves
[root@centos spark-2.4.3]# vi conf/slaves
centos

[root@centos spark-2.4.3]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@centos spark-2.4.3]# vi conf/spark-env.sh
SPARK_MASTER_HOST=centos
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=4
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=2
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_MASTER_HOST
export SPARK_MASTER_PORT
export SPARK_WORKER_CORES
export SPARK_WORKER_MEMORY
export SPARK_WORKER_INSTANCES
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH

[root@centos spark-2.4.3]# ./sbin/start-all.sh  #只有在Standalone模式下才需要启动
[root@centos spark-2.4.3]# jps
8064 Jps
2066 NameNode
2323 SecondaryNameNode
7912 Worker
7801 Master
7981 Worker
2157 DataNode

用户可以访问:http://centos:8080/
在这里插入图片描述
测试集群计算功能:

[root@centos spark-2.4.3]# ./bin/spark-shell 
	--master spark://centos:7077  # 连接集群的Master 
	--deploy-mode client          # Diver运行方式:必须是client
	--total-executor-cores 4      # 分配计算资源
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://centos:4040
Spark context available as 'sc' (master = spark://centos:7077, app id = app-20190924232452-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.textFile("hdfs:///demo/words")
        .flatMap(_.split(" "))
        .map((_,1))
        .groupBy(t=>t._1)
        .map(t=>(t._1,t._2.size))
        .sortBy(t=>t._2,false,4)
        .saveAsTextFile("hdfs:///demo/results")

YARN
  • 修改yarn-site.xml

    [root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml
    <!--配置MapReduce计算框架的核心实现Shuffle-洗牌-->
    <property> 
        <name>yarn.nodemanager.aux-services</name> 
        <value>mapreduce_shuffle</value> 
    </property> 
    <!--配置资源管理器所在的目标主机-->
    <property> 
        <name>yarn.resourcemanager.hostname</name> 
        <value>centos</value> 
    </property> 
    <!--关闭物理内存检查-->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name> 
        <value>false</value> 
    </property> 
    <!--关闭虚拟内存检查-->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    
    
  • 修改mapred-site.xml

 [root@centos ~]# mv /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml 
 [root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
 <!--MapRedcue框架资源管理器的实现-->
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>
 
  • 启动YARN服务
[root@centos ~]# start-yarn.sh

安装Spark
[root@CentOS ~]# tar -zxf spark-2.4.3-bin-without-hadoop.tgz -C /usr/
[root@CentOS ~]# mv /usr/spark-2.4.3-bin-without-hadoop/ /usr/spark-2.4.3
[root@CentOS ~]# cd /usr/spark-2.4.3/
[root@CentOS spark-2.4.3]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@CentOS spark-2.4.3]# vi conf/spark-env.sh
HADOOP_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
YARN_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
SPARK_EXECUTOR_CORES=4
SPARK_EXECUTOR_MEMORY=1g
SPARK_DRIVER_MEMORY=1g
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR
export YARN_CONF_DIR
export SPARK_EXECUTOR_CORES
export SPARK_DRIVER_MEMORY
export SPARK_EXECUTOR_MEMORY
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH

注意:这里和standalone不同,用户无需启动start-all.sh服务,因为任务的执行会交给YARN执行

[root@centos spark-2.4.3]# ./bin/spark-shell
	--master yarn                 # 连接集群的Master 
	--deploy-mode client          # Diver运行方式:必须是client
	--executor-cores 4            # 每个进程最多运行两个Core
	--num-executors 2             # 分配2个Executor进程

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/25 00:14:40 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/09/25 00:14:43 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
Spark context Web UI available at http://centos:4040
Spark context available as 'sc' (master = yarn, app id = application_1569341195065_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.textFile("hdfs:///demo/words")
        .flatMap(_.split(" "))
        .map((_,1))
        .groupBy(t=>t._1)
        .map(t=>(t._1,t._2.size))
        .sortBy(t=>t._2,false,4)
        .saveAsTextFile("hdfs:///demo/results")

发布与部署

远程测试

  • 添加Spark开发的依赖
<properties>
    <spark.version>2.4.3</spark.version>
    <scala.version>2.11</scala.version>
</properties>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.version}</artifactId>
    <version>${spark.version}</version>
    <!--告知:在Maven打jar包时,自动剔除该包-->
    <scope>provided</scope>
</dependency>

//1.创建SparkContext
val sparkConf = new SparkConf()
.setAppName("wordcount")
.setMaster("spark://centos:7077")
val sc = new SparkContext(sparkConf)
//2.创建分布式集合RDD  -细化
val lines:RDD[String] = sc.textFile("hdfs:///demo/words")
//3.对数据集合进行转换 -细化
val transformRDD:RDD[(String,Int)] = lines.flatMap(_.split(" "))
.map((_, 1))
.groupBy(t => t._1)
.map(t => (t._1, t._2.size))
.sortBy(t => t._2, false, 4)
//4.对RDD做Action动作提交任务 -细化
transformRDD.saveAsTextFile("hdfs:///demo/results")
//5.释放资源
sc.stop()
  • 添加Maven插件
<!--在执行package时,将scala源码编译进jar-->
<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>4.0.1</version>
    <executions>
        <execution>
            <id>scala-compile-first</id>
            <phase>process-resources</phase>
            <goals>
                <goal>add-source</goal>
                <goal>compile</goal>
            </goals>
        </execution>
    </executions>
</plugin>
<!--将依赖jar打入到jar中-->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.4.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
        </execution>
    </executions>
</plugin>
  • 执行Maven打包指令
mvn -package
  • 在项目的target目录下会产生两个jar
original-xxx-1.0-SNAPSHOT.jar //没有第三方依赖
xxx-1.0-SNAPSHOT.jar          //除provide以外的所有jar都会打包进来--->fatjar
  • 使用spark-submit指令提交spark任务
[root@centos spark-2.4.3]# ./bin/spark-submit
							--master spark://centos:7077 
							--deploy-mode cluster 
							--class com.baizhi.demo01.SparkWordCount 
							--driver-cores 2  
							--total-executor-cores 4 
							/root/rdd-1.0-SNAPSHOT.jar

本地测试

//1.创建SparkContext
val sparkConf = new SparkConf()
.setAppName("wordcount")
.setMaster("local[6]")
val sc = new SparkContext(sparkConf)
//2.创建分布式集合RDD -细化
val lines:RDD[String] = sc.textFile("file:///D:/demo/words")
//3.对数据集合做转换 - 细化
val transformRDD:RDD[(String,Int)] = lines.flatMap(_.split(" "))
.map((_, 1))
.groupBy(t => t._1)
.map(t => (t._1, t._2.size))
.sortBy(t => t._2, false, 4)
//4.对RDD做Action动作提交任务 -细化
transformRDD.saveAsTextFile("file:///D:/demo/results")
//5.释放资源
sc.stop()

注意:直接本地运行,不需要Spark环境的支持,注释掉<scope>provide</scope>即可

./bin/spark-shell 
	--master local[6]  # 连接集群Master 
	--deploy-mode client      # Diver运行方式:必须是client
	--total-executor-cores 4  # 计算资源

History Server

用于记录任务在执行过程中的历史状态信息

  • 添加spark-env.sh
[root@centos spark-2.4.3]# vi conf/spark-env.sh
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"
export SPARK_HISTORY_OPTS
  • 修改spark-defaults.conf
[root@centos spark-2.4.3]# mv conf/spark-defaults.conf.template conf/spark-defaults.conf
[root@centos spark-2.4.3]# vi conf/spark-defaults.conf
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs
  • 在HDFS上创建spark-logs目录(History服务器存储数据的地方)
[root@centos ~]# hdfs dfs -mkdir /spark-logs
  • 启动history server
[root@centos spark-2.4.3]# ./sbin/start-history-server.sh

测试服务是否正常:http://centos:18080
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值