搭建一个大数据环境,首先离不开Hadoop集群的搭建,作为一个分布式存储和计算平台hadoop官网给我提供了三种安装模式,分别为:
- 单机模式:单机模式最主要的目的是在本机调试mapreduce代码.
- 伪分布式模式:用多个线程模拟多台真实机器,即模拟真实的分布式环境。
- 完全分布式模式:用多台机器(或启动多个虚拟机)来完成部署集群。
前提准备
安装jdk
a)下载jdk-8u65-linux-x64.tar.gz
b)tar开
tar -xzvf jdk-8u65-linux-x64.tar.gz
c)创建/soft文件夹
d)移动tar开的文件到/soft下
e)创建符号连接
ln -s /soft/jdk-1.8.0_65 /soft/jdk
f)验证jdk安装是否成功
配置环境变量
1.编辑/etc/profile
sudo nano /etc/profile
export JAVA_HOME=/soft/jdk
exprot PATH=$PATH:$JAVA_HOME/bin
2.使环境变量即刻生效
source /etc/profile
3.进入任意目录下,测试是否ok
$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
安装hadoop
1.安装hadoop
a)下载hadoop-2.7.3.tar.gz
b)tar开
tar -xzvf hadoop-2.7.3.tar.gz
d)移动tar开的文件到/soft下
e)创建符号连接
ln -s /soft/hadoop-2.7.3 /soft/hadoop
f)验证jdk安装是否成功
$ cd /soft/hadoop/bin
$ ./hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
2.配置hadoop环境变量
$>sudo nano /etc/profile
...
export JAVA_HOME=/soft/jdk
exprot PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/soft/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
3.生效
$>source /etc/profile
单机模式
不需要启动单独的Hadoop进程
伪分布模式
a)进入${HADOOP_HOME}/etc/hadoop目录
b)编辑core-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
c)编辑hdfs-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
d)编辑mapred-site.xml
注意:cp mapred-site.xml.template mapred-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
e)编辑yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
配置ssh
1)检查是否安装了ssh相关软件包(openssh-server + openssh-clients + openssh)
$yum list installed | grep ssh
2)检查是否启动了sshd进程
$>ps -Af | grep sshd
3)在client侧生成公私秘钥对。
$>ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
4)生成~/.ssh文件夹,里面有id_rsa(私钥) + id_rsa.pub(公钥)
5)追加公钥到~/.ssh/authorized_keys文件中(文件名、位置固定)
$>cd ~/.ssh
$>cat id_rsa.pub >> authorized_keys
6)修改authorized_keys的权限为644.
$>chmod 644 authorized_keys
7)测试
$>ssh localhost
Last login: Tue Jul 28 09:10:26 2020
The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
完全分布式
1、准备3台(以上的)机器。
2、修改hostname和ip地址文件
[/etc/hostname]
s202
[etc/hosts]
192.168.156.01 s201
192.168.156.02 s202
192.168.156.03 s203
3、配置静态IP地址
4、重启网络服务
$>sudo service network restart
5.重复以上2 ~ 4过程
准备完全分布式主机的ssh和集群配置
-------------------------
1.删除所有主机上的/home/centos/.ssh/*
2.在s201主机上生成密钥对
$>ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
3.将s201的公钥文件id_rsa.pub远程复制到202 ~ 204主机上。
并放置/home/centos/.ssh/authorized_keys
$>scp id_rsa.pub centos@s201:/home/centos/.ssh/authorized_keys
$>scp id_rsa.pub centos@s202:/home/centos/.ssh/authorized_keys
$>scp id_rsa.pub centos@s203:/home/centos/.ssh/authorized_keys
$>scp id_rsa.pub centos@s204:/home/centos/.ssh/authorized_keys
4.配置完全分布式(${hadoop_home}/etc/hadoop/)
[core-site.xml]
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s201/</value>
</property>
<!--- 配置新的本地目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/centos/hadoop</value>
</property>
</configuration>
[hdfs-site.xml]
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
[mapred-site.xml]
不变
[yarn-site.xml]
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>s201</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
[slaves]
s202
s203
s204
[hadoop-env.sh]
...
export JAVA_HOME=/soft/jdk
5.分发配置
$>cd /soft/hadoop/etc/
$>scp -r full centos@s202:/soft/hadoop/etc/
$>scp -r full centos@s203:/soft/hadoop/etc/
$>scp -r full centos@s204:/soft/hadoop/etc/
6.删除符号连接
$>cd /soft/hadoop/etc
$>rm hadoop
$>ssh s202 rm /soft/hadoop/etc/hadoop
$>ssh s203 rm /soft/hadoop/etc/hadoop
$>ssh s204 rm /soft/hadoop/etc/hadoop
7.创建符号连接
$>cd /soft/hadoop/etc/
$>ln -s full hadoop
$>ssh s202 ln -s /soft/hadoop/etc/full /soft/hadoop/etc/hadoop
$>ssh s203 ln -s /soft/hadoop/etc/full /soft/hadoop/etc/hadoop
$>ssh s204 ln -s /soft/hadoop/etc/full /soft/hadoop/etc/hadoop
8.删除临时目录文件
$>cd /tmp
$>rm -rf hadoop-centos
$>ssh s202 rm -rf /tmp/hadoop-centos
$>ssh s203 rm -rf /tmp/hadoop-centos
$>ssh s204 rm -rf /tmp/hadoop-centos
9.删除hadoop日志
$>cd /soft/hadoop/logs
$>rm -rf *
$>ssh s202 rm -rf /soft/hadoop/logs/*
$>ssh s203 rm -rf /soft/hadoop/logs/*
$>ssh s204 rm -rf /soft/hadoop/logs/*
10.格式化文件系统
$>hadoop namenode -format
11.启动hadoop进程
$>start-all.sh
为了查看集群的Hadoop进程有没有启动,可以编写个脚本
[xcall.sh]脚本
#!/bin/bash
params=$@
i=201
for (( i=201 ; i <= 203 ; i = $i + 1 )) ; do
tput setaf 2
echo ============= s$i =============
tput setaf 7
ssh -4 s$i "source /etc/profile ; $params"
done
通过脚本查看进程
$ xcall.sh jps
这篇主要讲述的是三种模式的安装,这几种模式主要适用于在自己本地机器上搭建,当个测试环境。真正是生产环境上一般都是基于zookeeper HA模式,这里就不细描述了,后续分享到zookeeper在来补坑。
这篇文章主要是根据我当初在学习大数据时留下的笔记整理而来。如果有存在不足的点,欢迎指点改进。
后续打算分享一篇怎么编写MR作业。MR作业的运行机制和原理。。。