搭建环境:Ubuntu16.04(虚拟机)
分布式集群:
192.168.159.128 vm01
192.168.159.129 vm02
192.168.159.130 vm03
若是单节点(伪分布式),则在Hadoop的配置过程中,将其他节点的主机名替换成单节点的主机名即可。
镜像源:阿里源
新建虚拟机后最好更换镜像源,下载依赖包速度更快
更换源:
cd /etc/apt/
sudo cp sources.list sources.list.bak
vim sources.list
sudo apt-get update
阿里源:
deb http://mirrors.aliyun.com/ubuntu/ xenial main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse
##测试版源
deb http://mirrors.aliyun.com/ubuntu/ xenial-proposed main restricted universe multiverse
# 源码
deb-src http://mirrors.aliyun.com/ubuntu/ xenial main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse
##测试版源
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-proposed main restricted universe multiverse
# Canonical 合作伙伴和附加
deb http://archive.canonical.com/ubuntu/ xenial partner
deb http://extras.ubuntu.com/ubuntu/ xenial main
赋予当前用户sudo权限(实验环境用户为hadoop)
sudo adduser hadoop sudo
免密登录设置
ssh localhost
此时会有如下提示(SSH首次登陆提示),输入 yes 。然后按提示输入密码 ,这样就登陆到本机了。
首先退出刚才的 ssh,就回到了我们原先的终端窗口,然后利用 ssh-keygen 生成密钥,并将密钥加入到授权中:
exit # 退出刚才的 ssh localhost
cd ~/.ssh/ # 若没有该目录,请先执行一次ssh localhost
ssh-keygen -t rsa # 会有提示,都按回车就可以
cat ./id_rsa.pub >> ./authorized_keys #
此时再用 ssh localhost 命令,无需输入密码就可以直接登陆了
Java环境搭建
解压下载的jdk包jdk-8u161-linux-x64.tar.gz ,并配置环境变量
#JAVA_HOME
export JAVA_HOME=/opt/modules/jdk1.7.0_67(路径自己设置)
export PATH=$PATH:$JAVA_HOME/bin
保存退出后source /etc/profile
使环境变量生效
Hadoop环境搭建
下载相应的Hadoop版本包,本次实验采用的是Hadoop2.6.4
解压Hadoop的包,然后在设置环境变量
在/etc/profile中添加Hadoop的环境变量
export HADOOP_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
在/hadoop/etc/hadoop/下修改相关文件
(1)修改hadoop-evn.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67
(2)修改yarn-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67
(3)修改mapred-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67
(4)修改core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.110.101:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.6.4/tmp</value>
</property>
<property>
<name>fs.checkpoint.period</name>
<value>3600</value>
</property>
<property>
<name>fs.checkpoint.size</name>
<value>67108864</value>
</property>
</configuration>
hdfs://192.168.110.101:9000 此处IP为当前 机器的IP,另外tmp目录需要手动添加
(5)hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.http.address</name>
<value>vm01:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>vm02:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop-2.6.4/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop-2.6.4/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
这里面的vm02、vm03替换成主机名,如果是分布式的话就改成ResourceManager所在的节点主机名,以下一次类推
(6)yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>vm01</value>
</property>
<!-- 是否启用日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间(单位为秒) -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
<!--
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>vm02:8088</value>
</property>
-->
<property>
<name>yarn.resourcemanager.address</name>
<value>vm01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>vm01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>vm01:8031</value>
</property>
<!-- mr运行不验证虚拟内存 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<!--修改虚拟内存倍数,默认是2.1 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
</configuration>
(7)mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!--
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx1024m -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8883</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx1024m -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8884</value>
</property>
-->
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
</configuration>
(8)格式化hdfs
bin/hdfs namenode -format
(9)启动Hadoop
sbin/start-all.sh
使用jps命令,或者web UI界面查看resourcemanager和nodemanager是否已成功启动
单节点:
[hadoop@hadoop]$ jps
82334 DataNode
82757 NodeManager
82874 Jps
82248 NameNode
82507 ResourceManager
分布式:
hadoop@vm01:~$ jps
2355 Jps
1685 ResourceManager
1434 NameNode
hadoop@vm02:~$ jps
1925 Jps
1338 DataNode
1418 SecondaryNameNode
hadoop@vm03:~$ jps
1717 Jps
1303 DataNode
至此Hadoop配置完成。
Spark环境搭建
搭建spark需要与安装的Hadoop版本相对应,因为spark是scala开发的,所以也要下载对应版本的Scala
下载Spark包并解压
sudo tar -zxvf spark-2.1.0-bin-hadoop2.6.tgz -C /opt/
sudo mv spark-2.1.0-bin-hadoop2.6/ spark-1.6.1 #重命名文件
在/etc/profile添加Spark的环境变量
#SPARK_HOME
export SPARK_HOME=/home/hadoop/app/spark-2.1.0-bin-hadoop2.6
export PATH=$PATH:${SPARK_HOM}/bin
配置Spark
cd /opt/spark-1.6.1/conf/
cp spark-env.sh.template spark-env.sh
sudo vi spark-env.sh
在Spark-env.sh文件尾部添加以下配置:
export JAVA_HOME=/opt/java/jdk1.8.0_161
export SPARK_WORKING_MEMORY=1g
export SPARK_MASTER_IP=vm01
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.4
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
配置slave主机
cp slaves.template slaves
sudo vim slaves
添加主机
vm02
vm03
将配置好的Spark分发给所有的slave
scp -r /opt/spark-1.6.1 hadoop@slave1:~/opt/
验证Spark是否安装成功
sbin/start-all.sh