Spark部分一：基本环境搭建

最新推荐文章于 2024-01-19 15:43:45 发布

stay_running

最新推荐文章于 2024-01-19 15:43:45 发布

阅读量200

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/weixin_38602383/article/details/94337614

版权

为什么使用Spark

在这里插入图片描述

安装Spark 环境

CentOS 6.5
Server模式
固定IP地址配置
vi /etc/sysconfig/network-scripts/ifcfg-eth0
主要编辑 IPADDR, NETMASK, GATEWAY, BOOTPROTO,DNS1这几项
DNS域名解析配置
vi /etc/resolv.conf
配置 nameserver=dns服务器地址
配置 /etc/hosts 以便集群中的机器可以通过名称识别
上传spark压缩包（1.5.0），hadoop压缩包（2.7.1）

提示：
最好先装好一台机器，再进行虚拟机的clone，节约配置和上传时间
虚拟机软件采用vmware, 使用virtualbox在虚拟机之间通信是严重丢包，原因未知

搭建Spark环境

Spark本地环境变量配置
cp conf/spark-env.sh.template conf/spark-env.sh
vi conf/spark-env.sh
在这里插入图片描述
主要修改SPARK_LOCAL_IP

Spark Shell

Spark shell 是一套非常好的学习spark的工具，可以用作执行scala脚本、运行spark程序，下图体现了Spark Shell的用途：
在这里插入图片描述

连接Spark本地服务

bin/spark-shell --master local
The master URL passed to Spark can be in one of the following formats:
local：Run Spark locally with one worker thread (i.e. no parallelism at all)
local[K]： Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*]：Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT：Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST:PORT：Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://…
yarn-client：Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
yarn-cluster：Connect to a YARN cluster in cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

连接Spark集群服务（需要hdfs支持）

以下集群配置，均指standalone模式
配置hadoop环境（伪分布式）：
编辑JAVA_HOME
vi etc/hadoop/hadoop-env.sh
vi etc/hadoop/core-site.xml
在这里插入图片描述
vi etc/hadoop/hdfs-site.xml

本机SSH 免登录

*ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh localhost*

如果其他机器需要SSH免登录
仍然使用上步生成的密钥对，执行：
ssh-copy-id root@目标机器

格式化文件系统并启动服务

*bin/hdfs namenode -format
sbin/start-dfs.sh*

创建新的文件夹，添加文件

*bin/hdfs dfs -mkdir /user*
*bin/hdfs dfs -mkdir /user/spark
bin/hdfs dfs -put /root/spark-1.5.0-bin-hadoop2.6/README.md /user/spark
bin/hdfs dfs -ls /user/spark*

hdfs常见问题：
可以用jps查看后台java进程，确保hdfs的进程都正常启动
由于重新格式化系统可能出现某些进程不能启动，解决方法是先停止服务，删除rm -fr /tmp/hadoop-root 下name,data等节点数据，重新启动服务

让Spark集群识别hdfs

vi conf/spark.env.sh

在这里插入图片描述
**注意：**最后SPARK_MASTER的配置是需要配置ip的，否则其他slave无法访问7077端口,可以使用ps -fe | grep java 检查后台master进程

如果截图 --ip 参数地址是ip形式，则无问题；如果是主机名，那么其他机器的slave无法
连接7077端口并且，spark-shell访问集群或是slave连接集群时，spark的地址也需要是spark://ip:7077而非spark://主机名:7007
启动Spark集群，以一个master,一个worker为例：

# 启动master
sbin/start-master.sh
# 启动 worker
sbin/start-slave.sh spark://ip:7077
# 启动 history-server 便于分析调试
sbin/start-history-server.sh
# 连接至集群
bin/spark-shell --master spark://ip:7077

还有个简单方法可以一块启动所有worker节点

cp conf/slaves.template conf/slaves
vi conf/slaves

在这个文件里添加所有worker的主机名
注意：这样做的前提是master连接所有worker时ssh免登录

重新停止并启动服务

sbin/stop-all.sh
sbin/start-all.sh

stay_running

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫