一、Spark是什么?
Spark是基于DAG(有向无环图)的内存计算引擎.
二、Spark的安装部署
Spark和MapReduce效率不同的原因:
Spark把运算的中间数据存放在内存,迭代计算效率更高;
MapReduce的中间结果需要落地,需要保存到磁盘,这样必然会有磁盘io操作,影响性能。
Standalone模式
单机(Master+Worker同时运行)
ly@hadoop00 spark-2.2.0-bin-hadoop2.7 % echo $SPARK_HOME
/Users/ly/apps/spark-2.2.0-bin-hadoop2.7
$SPARK_HOME/sbin/start-all.sh
ly@hadoop00 spark-2.2.0-bin-hadoop2.7 % jps
19984 Master
20015 Worker
ly@hadoop00 spark-2.2.0-bin-hadoop2.7 % netstat -an|grep 7077
tcp4 0 0 192.168.237.1.7077 192.168.237.1.61730 ESTABLISHED
tcp4 0 0 192.168.237.1.61730 192.168.237.1.7077 ESTABLISHED
tcp4 0 0 192.168.237.1.7077 *.* LISTEN
ly@hadoop00 spark-2.2.0-bin-hadoop2.7 % netstat -an|grep 8080
tcp46 0 0 *.8080 *.* LISTEN
tcp4 0 0 10.9.70.214.61305 109.244.128.37.8080 ESTABLISHED
Spark Standalone Master : spark://hadoop00:7077
webui http://hadoop00:8080
多机(1个Master+N个Worker)
Master节点配置:
# spark-env.sh
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_251.jdk/Contents/Home
#使spark运行在yarn上,必配,否则连不上YARN,并访问不了HDFS
export HADOOP_HOME=${HADOOP_HOME:-/Users/ly/apps/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/Users/ly/apps/hadoop/etc/hadoop}
# slaves
hadoop00
hadoop01
Slave节点配置:
# spark-env.sh
export JAVA_HOME=${JAVA_HOME:-/opt/jdk1.8.0_251}
export SPARK_MASTER_HOST=hadoop00
在Master上运行sbin/start-all.sh
ly@hadoop00 spark-2.2.0-bin-hadoop2.7 % sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /Users/ly/apps/spark-2.2.0-binhadoop2.7/logs/spark-ly-org.apache.spark.deploy.master.Master-1-hadoop00.out
hadoop00: starting org.apache.spark.deploy.worker.Worker, logging to /Users/ly/apps/spark-2.2.0-
bin-hadoop2.7/logs/spark-ly-org.apache.spark.deploy.worker.Worker-1-hadoop00.out
hadoop01: starting org.apache.spark.deploy.worker.Worker, logging to /Users/ly/apps/spark-2.2.0-
bin-hadoop2.7/logs/spark-ly-org.apache.spark.deploy.worker.Worker-1-hadoop01.out
ly@hadoop00 spark-2.2.0-bin-hadoo