打算自己玩玩spark-yarn-cluster,公司的环境是别人装的,没有亲自搞一遍,总觉得心里不踏实。另外一个重要的原因是,公司的服务器性能比较强,这样spark任务不容易暴露性能问题或者内存溢出,反而不便于深入学习spark。
0. 环境
hp ProBook, intel core i3-4030u 1.9Ghz,伪四核,实际应该是双核,8G内存;
windows7 旗舰版;
VirtualBox5.0.16 r105871,自带增强;
虚拟机安装CentOS-7-x86_64-DVD-1511;
jdk-8u74-linux-x64.tar.gz;
scala-2.11.8.tgz;
hadoop-2.6.4.tar.gz;
spark-1.6.1-bin-hadoop2.6.tgz;
共三台VM: hmaster,hslave1,hslave2,各动态分配4G内存,30G硬盘;
打算按照几个部分来说,假定virtualbox和虚拟机都已经安装完毕:
- ssh 登录;
- oracle jdk;
- scala 2.11.8;
- hadoop安装;
- spark 安装;
- trouble shooting;
1. ssh 登录
note:下面步骤在所有虚机上都要执行
1)修改各虚机的hosts文件:/etc/hosts
192.168.56.103 hmaster
192.168.56.104 hslave1
192.168.56.105 hslave2
2) 创建用户:hadoop
# useradd hadoop
# passwd hadoop
3) 将用户加入:wheel group
$ usermod -a -G wheel hadoop
4)授予权限
$ vi /etc/sudoers
%wheel ALL=(ALL) ALL
%wheel ALL=(ALL) NOPASSWD: ALL
5)文件权限
$ chown -R hadoop ~
$ chown -R hadoop /opt
$ chmod 775 -R ~
$ chmod 775 -R /opt
6)更换用户
# su hadoop
7)ssh登录
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hmaster
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave2
$ chmod 0600 ~/.ssh/authorized_keys
note:此处先参考了Installing Hadoop 2.6.0 on CentOS 7,但是没有授予权限部分,挖了个坑。后来参考How to setup Keyless SSH with non root users in CentOS
补上了3),4),5)步
2. 安装Oracle Java SDK
1)在hmaster上,解压jdk:
# cd /opt
# tar -zxf jdk-8u74-linux-x64.tar.gz
# mv jdk-8u74-linux-x64 jdk
2)拷贝至其它slave虚机
# scp -r jdk hslave1:/opt
# scp -r jdk hslave2:/opt
3)在每台虚机上执行alternatives
# alternatives --install /usr/bin/java java /opt/jdk/bin/java 2
# alternatives --config java # 选择相应数字 (/opt/jdk/bin/java)
# alternatives --install /usr/bin/jar jar /opt/jdk/bin/jar 2
# alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 2
# alternatives --set jar /opt/jdk/bin/jar
# alternatives --set javac /opt/jdk/bin/javac
最后java -version 确认安装无误
4)设置环境变量
# vi /etc/bashrc
export JAVA_HOME=/opt/jdk
export JRE_HOME=/opt/jdk/jre
export PATH=$PATH:/opt/jdk/bin:/opt/jdk/jre/bin
alias ll='ls -l --color'
alias cp='cp -i'
alias mv='mv -i'
alias rm='rm -i'