准备工作
准备环境
下载链接:
VM12 : https://pan.baidu.com/s/1hsvkHe8 密码:ycan
CentOS 7.0:https://pan.baidu.com/s/1nvUmu05 密码:ktqh
jdk1.8: https://pan.baidu.com/s/1bo69W67 密码:3vol
hadoop2.7.3: https://pan.baidu.com/s/1qYiJgT2 密码:d96k
准备三台机器,一master 两slave
192.168.122.128 master
192.168.122.129 slave1
192.168.122.130 slave2
在三台机器的host文件中,添加ip和主机名
vi /etc/hosts
免密登录(root身份,也可以新建hadoop用户)
检查时间一致性 时间同步
date
若时间不同步,参考 CentOS 7 中使用NTP进行时间同步
关闭防火墙
systemctl stop firewalld.service
关闭开机自启动
systemctl disable firewalld.service
生成ssh秘钥
ssh-keygen -t rsa -P ''
查看秘钥是否生成成功 (有两个文件 id_rsa 和 id_rsa.pub)
ls /root/.ssh/
创建authorized_keys文件
在master上生成一个authorized_keys文件并查看文件是否成功生成
touch /root/.ssh/authorized_keys ls /root/.ssh/
将三台机器的id_rsa.pub文件中的内容拷贝到authorized_keys中
例如此次实验中生成的RSA如下:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDuFQVKFAn/hodZUwj9xSybiKIP2XJHqxZfgNz7ADSFwEORMAkKpUyJ51fM/S+uW9oNSlSG/TyfvSSme1fE6xut0t4iPLV2dp9Ia+dPDs9ub3XEyvId1ADMNhO3SveuMVNPpJ50PiBnmqgHQ1OuPMopgfgRFWmbodLmz0gtmJZ6KubI3P90Do44X1+TJdX+eRECFomefayj23x+/xBVLxKXQH7+vNVn4vIM8JIWFFT8XEBN2+HKxCqEN4yilTIk+X5Ov10sfJcQhNlivThV+t9AeBH/T6J4bLOrdiQYNTnMuN+Ii5tNc7fKpzdaCmlJmzaxzESrXQRtu+7C3areZYe9 root@master ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDMQ8YA7APRnS5Rt3y5OIJexL2A6a0KR+MhLpMInnGMzEcpItryIU8FBPV3fmsKdhzr99pxryLSxQibvHxQo1Kx2FUN1HUTW4fftZsum+ddGY6w+/iQefjbddrmUzaZUxhsHuqCBb80UclfbR7BcRv1FQDelyig2FU9U28LjU9iTvwdEzttdBq433GL/2lDC1xw2tidWkc0CfjACprzjJ16vzb88awm8VOTp5ExylD7gT8sXmAsmAr3W8FsilKFKCrLwCEop3/r+6g8eIDM53XOt7UciK/FJAyCarKbUexeEfBqpzeilW1wcHd/5DiLJgCZ2fJhJnI+3xQKGv9xdYoR root@slave1 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKuppRgfn5Tx7ST3C17jfguukTaaJVJdWbFziVm/jbwU57o4CmN7CuTzI4VvEnVeVsKTN8S5+rxC3hBIMjuJbVopR8vjHLSd7ysByUiFUusg7RPmJRMlZ0LwWMJCUm9E/xIoq9zNGr38u0yKNjS27PYf8PLgYQx2qHUGbla3KlSX5i81hxyeF/sHqfn6F+RQ/BAxVziu7atDTZF+RojYfiw087Zp/57Th6ouSPIeObTeYkJjFFENavsCcDwbqUnMyndDoPbCqV/f0494HSFZWPX8KUVfWnnJ1HQWp37vgZV8uU59OMLibYCD6t/p4Qfvp0/CCgFW8a6XoYXwYcm/tl root@slave2
将master中的authorized_keys拷贝到slave1和slave2中
scp /root/.ssh/authorized_keys root@192.168.122.129:/root/.ssh scp /root/.ssh/authorized_keys root@192.168.122.130:/root/.ssh
检查是否成功拷贝
测试是否能成功互相免密登录 (exit 登出)
ssh master ssh slave1 ssh slave2
安装jdk
1. 将本机的jdk文件拷贝到集群中
使用SecureCRT连接集群,在终端输入rz,选择上传的jdk文件
本例中,新建/opt/java文件夹,拷贝到java文件夹中。
2. tar解压jdk文件
cd /opt/java
tar -zxvf jdk-8u60-linux-x64.tar.gz
ls
3. 修改配置文件 /etc/profile
vi /etc/profile
编辑添加java环境
export JAVA_HOME=/opt/java/jdk1.8.0_60
export CLASSPATH=$:CLASSPATH:$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin
使编辑文件生效
source /etc/profile
查看java版本
java -version
4. 补充 (可能出现问题:环境不能生效)
由于版权问题,大部分Linux发行版都默认安装了OpenJDK,并且OpenJDK的java命令加入到环境变量中,所以我们安装完SunJDK后,系统中存在两个jdk环境——OpenJDK和SunJDK。如何将SunJDK替换原来的OpenJDK呢?
原因查找
查看java命令所在目录
whereis java
输出:
java: /usr/bin/java /usr/lib/java /etc/java /usr/share/java /opt/java/jdk1.8.0_60/bin/java /usr/share/man/man1/java.1.gz
其中/opt/java/jdk1.8.0_60/bin/java 是我们安装的SunJDK,/usr/bin/java 是系统默认安装的java命令所在的目录,接着往下看
ls -la /usr/bin/java
输出:
lrwxrwxrwx. 1 root root 22 11月 23 23:44 /usr/bin/java -> /etc/alternatives/java
进入到/etc/alternatives 目录中
cd /etc/alternatives ls -la
输出(其中一条):
lrwxrwxrwx. 1 root root 70 11月 23 23:44 java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64/jre/bin/java
原因就在这:系统默认的java指向到OpenJDK中的java命令,所以导致我们再/etc/profile配置的环境变量不能生效。
接下来要做的就是将这个java软链接指定到我们的SunJDK的目录/opt/java/jdk1.8.0_60/bin/java 中。更改java软链接路径,解决问题
查看当前的java默认配置
update-alternatives --display java
输出:(其中一部分)
java - 状态为自动。 链接当前指向 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64/jre/bin/java /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/jre/bin/java - 优先度 1700091
从中可以看到系统默认使用的java是OpenJDK的(注意优先度)
配置成我们安装的SunJDK
update-alternatives --install /usr/bin/java java /opt/java/jdk1.8.0_60/bin/java 170130 update-alternatives --config java
输出:
共有 3 个提供“java”的程序。 选项 命令 <hr /> 1 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/jre/bin/java *+ 2 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64/jre/bin/java 3 /opt/java/jdk1.8.0_60/bin/java 按 Enter 保留当前选项[+],或者键入选项编号:
编号选择我们安装的jdk,本例中是3
由于我们配置的优先度比OpenJDK的优先度低,需要手动选择,如果配置的优先度比OpenJDK的优先度高,可以无需手动选择,系统自动回选择优先度高的作为默认的alternative。
查看java版本
java -version
安装hadoop
1. 将本机的hadoop拷贝到集群(见jdk拷贝方法)
本例中,新建/opt/hadoop目录
2. tar解压
3.配置环境
将hadoop的环境加入到/ect/profile 中
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
export PATH=PATH:JAVA_HOME/bin:$HADOOP_HOME/bin
4. 新建hadoop相关目录
mkdir /root/hadoop
mkdir /root/hadoop/tmp
mkdir /root/hadoop/var
mkdir /root/hadoop/dfs
mkdir /root/hadoop/dfs/name
mkdir /root/hadoop/dfs/data
5. 修改hadoop目录下etc/hadoop 的一系列配置文件
本例中,目录为/opt/hadoop/hadoop-2.7.3/etc/hadoop
修改core-site.xml
在标签中添加:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/root/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> </configuration>
修改hadoop-env.sh
export JAVA_HOME=${JAVA_HOME}
改为:
export JAVA_HOME=/opt/java/jdk1.8.0_60
路径为安装的jdk路径
修改hdfs-site.xml
在标签中添加:
<property> <name>dfs.name.dir</name> <value>/root/hadoop/dfs/name</value> <description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.</description> </property> <property> <name>dfs.data.dir</name> <value>/root/hadoop/dfs/data</value> <description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions</name> <value>true</value> <description>need not permissions</description> </property>
说明:dfs.permissions配置为false后,可以允许不要检查权限就生成dfs上的文件,方便倒是方便了,但是你需要防止误删除,请将它设置为true,或者直接将该property节点删除,因为默认就是true。
新建并且修改mapred-site.xml
将mapred-site.xml.template模板拷贝出一份
cp mapred-site.xml.template mapred-site.xml ls
修改
vi mapred-site.xml
在标签中添加:
<property> <name>mapred.job.tracker</name> <value>master:49001</value> </property> <property> <name>mapred.local.dir</name> <value>/root/hadoop/var</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
修改slaves文件
将里面的localhost替换为:
slave1 slave2
修改yarn-site.xml文件
在标签中添加:
<property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <description>The address of the applications manager interface in the RM.</description> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> </property> <property> <description>The address of the scheduler interface.</description> <name>yarn.resourcemanager.scheduler.address</name> <value>${yarn.resourcemanager.hostname}:8030</value> </property> <property> <description>The http address of the RM web application.</description> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <property> <description>The https adddress of the RM web application.</description> <name>yarn.resourcemanager.webapp.https.address</name> <value>${yarn.resourcemanager.hostname}:8090</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <property> <description>The address of the RM admin interface.</description> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <discription>每个节点可用内存,单位MB,默认8182MB</discription> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
说明:yarn.nodemanager.vmem-check-enabled这个的意思是忽略虚拟内存的检查,如果你是安装在虚拟机上,这个配置很有用,配上去之后后续操作不容易出问题。如果是实体机上,并且内存够多,可以将这个配置去掉。
启动hadoop
1. 在namenode上执行初始化
由于master节点是namenode,slave1和slave2是datanode节点;所以只需要在namenode上执行初始化,即对hdfs执行格式化
cd /opt/hadoop/hadoop-2.7.3/bin
./hadoop namenode -format
由于之前配过hadoop环境,可直接
hadoop namenode -format
一系列代码滚动完成后,可以看到一些配置信息;查看/root/hadoop/dfs/name 目录下是否多了一个current文件夹,里面有几个文件。
cd /root/hadoop/dfs/name
ls
cd current
ls
2. 在namenode上执行启动命令
执行启动命令:
cd /opt/hadoop/hadoop-2.7.3/sbin/
./start-all.sh
测试hadoop
打开网址
http://192.168.122.128:50070
http://192.168.122.128:8088
能够跳转到hadoop的页面
至此,整个hadoop环境搭建成功。
WordCount 测试
本地写两个文件
cd /opt
mkdir file
cd file
echo "hello world" >> file1.txt
echo "hello hadoop" >> file2.txt
ls
more file1.txt
more file2.txt
hadoop集群中创建输入输出文件夹
hadoop fs -mkdir -p /test/hadoop/input
上传文件到集群
hadoop fs -put /opt/file/file*.txt /test/hadoop/input
hadoop fs -ls /test/hadoop/input
调用MapReduce测试
hadoop jar /opt/hadoop/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /test/hadoop/input /test/hadoop/output
说明: 调用share目录下mapreduce的jar包中wordcount方法,后面两个参数分别是输入和输出
结果:
17/11/29 14:33:01 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.122.128:8032
17/11/29 14:33:03 INFO input.FileInputFormat: Total input paths to process : 2
17/11/29 14:33:03 INFO mapreduce.JobSubmitter: number of splits:2
17/11/29 14:33:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511935947803_0003
17/11/29 14:33:04 INFO impl.YarnClientImpl: Submitted application application_1511935947803_0003
17/11/29 14:33:04 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1511935947803_0003/
17/11/29 14:33:04 INFO mapreduce.Job: Running job: job_1511935947803_0003
17/11/29 14:33:14 INFO mapreduce.Job: Job job_1511935947803_0003 running in uber mode : false
17/11/29 14:33:14 INFO mapreduce.Job: map 0% reduce 0%
17/11/29 14:33:29 INFO mapreduce.Job: map 50% reduce 0%
17/11/29 14:33:30 INFO mapreduce.Job: map 100% reduce 0%
17/11/29 14:33:38 INFO mapreduce.Job: map 100% reduce 100%
17/11/29 14:33:38 INFO mapreduce.Job: Job job_1511935947803_0003 completed successfully
17/11/29 14:33:39 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=55
FILE: Number of bytes written=355783
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=247
HDFS: Number of bytes written=25
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=27082
Total time spent by all reduces in occupied slots (ms)=6022
Total time spent by all map tasks (ms)=27082
Total time spent by all reduce tasks (ms)=6022
Total vcore-milliseconds taken by all map tasks=27082
Total vcore-milliseconds taken by all reduce tasks=6022
Total megabyte-milliseconds taken by all map tasks=27731968
Total megabyte-milliseconds taken by all reduce tasks=6166528
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=41
Map output materialized bytes=61
Input split bytes=222
Combine input records=4
Combine output records=4
Reduce input groups=3
Reduce shuffle bytes=61
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1132
CPU time spent (ms)=4410
Physical memory (bytes) snapshot=500752384
Virtual memory (bytes) snapshot=6232047616
Total committed heap usage (bytes)=307437568
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=25
File Output Format Counters
Bytes Written=25
JobId: Running job: job15119359478030003
测试结果
hadoop fs -ls /test/hadoop/output
Found 2 items
-rw-r--r-- 2 root supergroup 0 2017-11-29 14:33 /test/hadoop/output/_SUCCESS
-rw-r--r-- 2 root supergroup 25 2017-11-29 14:33 /test/hadoop/output/part-r-00000
hadoop fs -cat /test/hadoop/output/part-r-00000
hadoop 1
hello 2
world 1
另外,可以通过http://192.168.122.128:8088/cluster/app
看到提交的任务执行结果。
至此,wordcount测试成功
遇到的问题
测试网页打不开
搭建完成后,发现测试网页打不开,以为是配置出现问题,排查了好长时间,最后不得已问小伙伴是怎么回事,“你代理取消了吗?”,结果取消代理,测试成功,心情舒畅中带着崩溃;或者在代理中,将集群ip格式添加到不使用代理规则中。
hadoop: 未找到命令…
将hadoop的环境加入到/ect/profile 中
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin