Spark从入门到放弃——Spark2.4.7安装和启动(二)

Spark版本

  Spark是Apache开源的顶级项目,官网地址,目前也一直不断更新迭代,截至本博客发布时间(20201202)最新版本是Spark 3.0.1 released (Sep 08, 2020),因为公司目前生产用的是Spark2.4.7,所以后续的教程都是以Spark2.4.7为主;

   讲解版本:Spark2.4.7

  工欲善其事,必先利其器!这里就讲解下Spark的安装与启动;

安装准备

  Spark从入门到放弃——初始Spark(一)说到,Spark只是计算框架,取代的是MapReduce,因此Spark必须依赖文件系统,常用开源文件系统我们通常选用Hadoop的HDFS或者Hive等,因此关于HDFS的安装,Hive的安装,以及需要集群的免密登录,依赖的Java版本等请参考以下博客,这里不再反复累赘叙说;

  1. Hadoop集群大数据解决方案之搭建Hadoop3.X+HA模式(二)
  2. Hive从入门到放弃——Hive安装和基本使用(二)

  本着数据本地性原则,基本上有HDFS上有NameNode,DataNode最好都安装上Spark,那我们在测试集群的node1,node2,node3,node4上都装上Spark;

Scala安装

  Spark框架本身由Scala开发,而且天然支持Scala API接口,不论是临时利用Spark拉取数据,还是编写Spark应用程序,Scala都是非常简洁方便的,就连spark-shell默认启动的交互式界面,也是scala语言,以上这些,就注定了Scala和Spark天生一对,密不可分,就连Java之父詹姆斯·高斯林 (James Gosling)也曾说过,抛开Java而言,再让他选一门开发语言,他会选择Scala!因此,安装Spark之前必须先安装好Scala;

  Scala官网链接:传送门

  1. scala-2.11.12.tgz上传到node1的/data/tools目录下,采用以下命令解压scala-2.11.12.tgz
tar -zxvf scala-2.11.12.tgz
  1. 设置Scala的环境变量;
# 编辑环境变量文件/etc/profile
sudo vim /etc/profile

# 在文件/etc/profile内最下面添加scala的路径,然后wq保存
 export SCALA_HOME=/data/tools/scala-2.11.12
 export PATH==$PATH:$SCALA_HOME/bin

# 刷新环境变量文件
source /etc/profile
  1. 测试scala是否安装成功,因为设置了环境变量,随便一个folder下,输入scala启动scala的交互式界面,里面输入1+1,返回一个Int变量res0等于2,具体如下,恭喜你,scala安装成功了,按ctrl+c结束scala交互式界面;
[hadoop@node1 tools]$ scala
Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211).
Type in expressions for evaluation. Or try :help.

scala> 1+1
res0: Int = 2

scala>

  1. 在node2,node3,node4上都安装好scala,如下scp到不同节点;
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node2:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  40.0MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node3:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  42.8MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node4:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  47.8MB/s   00:00

  1. 这里我就不写循环shell操作了,毕竟就几台机器,偷个懒,使用MobaXtermWrite commands on all terminals偷个懒,即n屏联动功能,可以让打开的远程窗口同时编辑相同的内容,是不是很拽,很好用的一个功能,具体如下图;

在这里插入图片描述

图1 开启MobaXterm的Write commands on all terminals功能

在这里插入图片描述

图2 MobaXterm的Write commands on all terminals功能使用教学

  到此就算scala安装完成;

Spark2.4.7安装

Spark2.4.7下载

  官网下载:传送门
  下载完后将spark-2.4.7-bin-hadoop2.7.tgz上传到node1的/data/toolsfolder下,利用以下指令解压;

tar -zxvf spark-2.4.7-bin-hadoop2.7.tgz

修改spark的配置文件

  spark的配置文件主要设置在/data/tools/spark-2.4.7-bin-hadoop2.7/conf 目录下,具体如下;

-rwxr-xr-x. 1 hadoop hadoop    996 9月   8 13:48 docker.properties.template
-rwxr-xr-x. 1 hadoop hadoop   1105 9月   8 13:48 fairscheduler.xml.template
-rwxr-xr-x. 1 hadoop hadoop   2025 9月   8 13:48 log4j.properties.template
-rwxr-xr-x. 1 hadoop hadoop   7801 9月   8 13:48 metrics.properties.template
-rwxr-xr-x. 1 hadoop hadoop    865 9月   8 13:48 slaves.template
-rwxr-xr-x. 1 hadoop hadoop   1292 9月   8 13:48 spark-defaults.conf.template
-rwxr-xr-x. 1 hadoop hadoop   4221 9月   8 13:48 spark-env.sh.template

  需要修改的配置文件有spark-env.sh,spark-defaults.confslaves,这些文件上面也没有呀,这些文件需要自己从他们的template文件总cp出来;

修改spark的配置文件spark-env.sh
  1. 先cp出这个文件,shell指令如下;
# 进入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh文件
cp ./spark-env.sh.template ./spark-env.sh
  1. 编辑文件spark-env.sh,具体如下,只需要关注最后未被注释的内容即可,设置的值根据自己的机器来配置,export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)一定要写,不然容易出异常,/data/tools/hadoop-2.8.5/bin/hadoop根据自己的hadoop路径来配置,不要照抄!;
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS


export JAVA_HOME=/data/tools/jdk1.8.0_211
export SCALA_HOME=/data/tools/scala-2.11.12
export HADOOP_HOME=/data/tools/hadoop-2.8.5
export HADOOP_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop
export YARN_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop

export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077

export SPARK_DRIVER_MEMORY=1G
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=2G

export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=2G
export SPARK_WORKER_INSTANCES=1

export SPARK_LOG_DIR=/data/logs/spark/
export SPARK_WORKER_DIR=/data/logs/spark/worker
export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=4000 -Dspark.history.retainedApplications=100 -Dspark.history.fs.logDirectory=hdfs://dw-cluster:8020/opt/spark/applicationHistory"

修改spark的配置文件spark-defaults.conf
  1. 一样的,先cp出这个文件,shell指令如下;
# 进入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh文件
cp ./spark-defaults.conf.template ./spark-defaults.conf
  1. 编辑文件spark-defaults.conf,具体内容如下,node1是我一个节点的hostname,hdfs://dw-cluster:8020/user/hive/warehouse是我HA的HDFS上的Hive数据仓库的根目录;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

spark.master                            spark://node1:7077
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://dw-cluster:8020/opt/spark/applicationHistory
spark.serializer                        org.apache.spark.serializer.KryoSerializer
spark.eventLog.compress                 true
spark.yarn.historyServer.address        http://node1:18018
spark.sql.warehouse.dir                 hdfs://dw-cluster:8020/user/hive/warehouse
spark.sql.parquet.enableVectorizedReader        false
spark.sql.parquet.writeLegacyFormat             true
spark.debug.maxToStringFields           100
spark.network.timeout                   300000
spark.yarn.jars               hdfs://dw-cluster/tmp/spark_jars/*.jar
spark.port.maxRetries   100

修改spark的配置文件slaves
  1. 一样的,先cp出这个文件,shell指令如下;
# 进入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh文件
cp ./slaves.template ./slaves
  1. 编辑文件slaves,具体如下,就是设置spark的work节点;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
node2
node3
node4

添加hive-site.xml文件

  将你安装好的hive的配置文件复制一份到spark的conf下,实现指令如下,具体路径根据自己的hive和spark路径做调整;

cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/

  目的是为了实现spark on hive,即采用spark-sql读取操作hive内的数据;

添加/mysql-connector-java-8.0.13.jar

  因为Hive的元数据在mysql上,所以需要将hive内连接mysql的驱动包mysql-connector-java-8.0.13.jar也一起拿过来,具体操作 如下;

cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/

  到此就能真正实现spark on hive;

   所有的配置文件到此就配置完成了!

将spark复制到其他节点

  先在node1上将spark-2.4.7-bin-hadoop2.7.tgz scp到其他的节点;

scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node2:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node3:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node4:/data/tools/

  使用MobaXtermWrite commands on all terminals解压spark-2.4.7-bin-hadoop2.7.tgz,然后将node1上的配置文件全部copy到其他节点,操作如下;

scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node2:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node3:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node4:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/

Spark2.4.7配置环境变量

  使用MobaXtermWrite commands on all terminals在node1,node2和node3,node4上配置好spark的环境变量,即sudo vim /etc/profile,然后最后面追加以下内容;

export SPARK_HOME=/data/tools/spark-2.4.7-bin-hadoop2.7
export PATH==$PATH:$SPARK_HOME/bin

Spark2.4.7启动

  1. 随便找一个节点,这里就选择node1为实验,先启动spark服务,操作如下;
[hadoop@node1 conf]$ cd /data/tools/spark-2.4.7-bin-hadoop2.7/sbin/
[hadoop@node1 sbin]$ ll
总用量 92
-rwxr-xr-x. 1 hadoop hadoop 2803 9月   8 13:48 slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1429 9月   8 13:48 spark-config.sh
-rwxr-xr-x. 1 hadoop hadoop 5689 9月   8 13:48 spark-daemon.sh
-rwxr-xr-x. 1 hadoop hadoop 1262 9月   8 13:48 spark-daemons.sh
-rwxr-xr-x. 1 hadoop hadoop 1190 9月   8 13:48 start-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1274 9月   8 13:48 start-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 2050 9月   8 13:48 start-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1877 9月   8 13:48 start-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1423 9月   8 13:48 start-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1279 9月   8 13:48 start-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 3151 9月   8 13:48 start-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1527 9月   8 13:48 start-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1857 9月   8 13:48 start-thriftserver.sh
-rwxr-xr-x. 1 hadoop hadoop 1478 9月   8 13:48 stop-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1056 9月   8 13:48 stop-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 1080 9月   8 13:48 stop-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1227 9月   8 13:48 stop-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1084 9月   8 13:48 stop-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1067 9月   8 13:48 stop-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1557 9月   8 13:48 stop-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1064 9月   8 13:48 stop-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1066 9月   8 13:48 stop-thriftserver.sh
[hadoop@node1 sbin]$ pwd
/data/tools/spark-2.4.7-bin-hadoop2.7/sbin
[hadoop@node1 sbin]$ ./start-all.sh
  1. 开启spark-shellspark-sql做交互式操作;
  • 先来测试下spark-sql,在node1上输入spark-sql,如下;
[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……

  然后一顿加载,不报错的话,基本就成功了,报错把相应的bug解决,然后测试下hive内的数据请求,我在hive先 建好了一个student表,只有两个字段id和sname,表内有一行记录;

表1 hive表student内的记录
idsname
1rowyet

  然后我们的spark-sql测试语句效果如下;

spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr      cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1       rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)

  • 再测试下spark-shell,在node1上输入spark-shell,具体效果如下,会加载spark的版本,然后是scala语言的交互式界面;
[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:20:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/14 03:20:37 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node1:4041
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214032040-0004).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

  再测试下spark-shell的功能是否有异常,具体操作如下,初始化一个只有一列的dataframe(概念后续会讲到,这里只做测试!),并显示出来;

[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:33:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214033402-0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val myRange = spark.range(1000).toDF("number")
myRange: org.apache.spark.sql.DataFrame = [number: bigint]

scala> myRange.show
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows



Spark监控界面

  spark的默认监控界面是8080,即在浏览器上输入http://node1:8080/即可查看哪些Job正在running,哪些已经完成,集群的资源状况等;
在这里插入图片描述

图3 spark的监控界面

Spark on Yarn的启动

  Spark on Yarn是开源用的最广的一种模式,后续会细讲,这里就先开个引子,先讲讲启动,主要涉及的参数如下;

–master: 指定master在哪里运行,即指定Master的地址,默认为local
–deploy-mode: 指定发布的驱动到worker节点还是yarn的cluster 模式或者client模式
–class:你的英勇的启动类,该参数用于spark-submit
–executor-memory :每个executor的内存
–executor-cores:每个executor的cores
–num-executors: executor个数,说明这个启动的环境是20核,32G内存
–queue:指定该Job在yarn的什么队列运行
–conf:配置参数,格式key=value,如果值包含空格,可以加引号,如“key==value”

#启动spark-sql 
spark-sql --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#启动spark-shell
spark-shell --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#spark-submit
spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m  --class cn.ruoying.esapp.App  hdfs:///app/hive_to_es/etl_jar/SparkOnHiveToEs_v1.jar

  到此,整个Spark2.4.7的安装配置就算是完成了!

  • 5
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
注:下文中的 *** 代表文件名中的组件名称。 # 包含: 中文-英文对照文档:【***-javadoc-API文档-中文(简体)-英语-对照版.zip】 jar包下载地址:【***.jar下载地址(官方地址+国内镜像地址).txt】 Maven依赖:【***.jar Maven依赖信息(可用于项目pom.xml).txt】 Gradle依赖:【***.jar Gradle依赖信息(可用于项目build.gradle).txt】 源代码下载地址:【***-sources.jar下载地址(官方地址+国内镜像地址).txt】 # 本文件关键字: 中文-英文对照文档,中英对照文档,java,jar包,Maven,第三方jar包,组件,开源组件,第三方组件,Gradle,中文API文档,手册,开发手册,使用手册,参考手册 # 使用方法: 解压 【***.jar中文文档.zip】,再解压其中的 【***-javadoc-API文档-中文(简体)版.zip】,双击 【index.html】 文件,即可用浏览器打开、进行查看。 # 特殊说明: ·本文档为人性化翻译,精心制作,请放心使用。 ·本文档为双语同时展示,一行原文、一行译文,可逐行对照,避免了原文/译文来回切换的麻烦; ·有原文可参照,不再担心翻译偏差误导; ·边学技术、边学英语。 ·只翻译了该翻译的内容,如:注释、说明、描述、用法讲解 等; ·不该翻译的内容保持原样,如:类名、方法名、包名、类型、关键字、代码 等。 # 温馨提示: (1)为了防止解压后路径太长导致浏览器无法打开,推荐在解压时选择“解压到当前文件夹”(放心,自带文件夹,文件不会散落一地); (2)有时,一套Java组件会有多个jar,所以在下载前,请仔细阅读本篇描述,以确保这就是你需要的文件;

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

╭⌒若隐_RowYet——大数据

谢谢小哥哥,小姐姐的巨款

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值