本文主要介绍Hadoop大数据环境的搭建与配置,Hadoop 主要有三大发行版本,Apache 原生态版本是最基本的版本, 在企业实际使用当中,并不多;cloudera公司对Apache版本进行开发发行的CDH版本,在国内企业中用的比较多;Hortonworks 公司开发的HDP版本。本文主要介绍的伪分布式环境的搭建,分布式环境的搭建和高可用(HA)环境的搭建。
Hadoop及其组件下载地址:https://archive.cloudera.com/cdh5/cdh/5/,版本自行选择。
Hadoop官方文档地址:https://hadoop.apache.org/docs/r2.6.0/ 或者http://wiki.apache.org/hadoop/
Hadoop安装之前首先安装jdk。
一、伪分布式安装与部署
1.解压缩到/opt目录下: tar -zxvf hadoop-2.5.0-cdh5.3.6.tar.gz -C /opt
2.配置文件hdfs:
(1)vi etc/hadoop/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
(2)vi etc/hadoop/core-site.xml:
<property>
<name>fs.defaultFS</name> #指定NameNode
<value>hdfs://localhost:9000</value> #根据自己的主机名配置
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.5.0-cdh5.3.6/data/tmp</value> #mkdir /opt/hadoop-2.5.0-cdh5.3.6/data/tmp
</property>
(3)vi etc/hadooop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
3.格式化文件系统namenode:
bin/hdfs namenode -format
启动namenode:sbin/hadoop-daemon.sh start namenode
启动datanode:sbin/hadoop-daemon.sh start datanode
4.配置yarn
(1) vi etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
(2)vi etc/hadoop/yarn-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
(3) vi etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>zhangbk:10020</value> #主机名
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
5.配置slaves, 指定DataNode
locahost #节点主机名
6.配置环境变量
export HADOOP_HOME=/opt/hadoop-2.5.0-cdh5.3.6
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
浏览器访问hdfs:http://localhst:50070
浏览器访问yarn管理器:http://localhost:8088
7.测试
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar wordcount /input /output/wdcount2
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar wordcount /input /output/wdcount2
出现的问题:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
解决:1.jdk是64位,Hadoop是32位编译的。
2.由于cdh版本lib/native文件夹下没有文件,将其他版本的lib/native文件夹下的文件拷贝过来即可。
3. lib/native查找方式:https://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/5.3.6/RPMS/x86_64/下载hadoop-2.5.0+cdh5.3.6+898-1.cdh5.3.6.p0.18.el5.x86_64.rpm(根据自己版本下载),然后解压rpm包,
rpm2cpio hadoop-2.5.0+cdh5.3.6+898-1.cdh5.3.6.p0.18.el5.x86_64.rpm | cpio -idmv
解压后打开在文件夹下usr/lib/hadoop/lib/native/有需要的包,拷贝到Hadoop安装目录的lib/native,并修改软链接即可。
二、Hadoop分布式环境搭建
Hadoop 2.x 部署
* Local Mode
* Distributed Mode
* 伪分布式
一台机器,运行所有的守护进程,
从节点DataNode、NodeManager
* 完全分布式
有多个从节点
DataNodes
NodeManagers
配置文件
$HADOOP_HOME/etc/hadoop/slaves
================================================================
1.环境准备
三台机器(redhat6)
192.168.159.21 192.168.159.22 192.168.159.23
hadoop-senior hadoop-senior02 hadoop-senior03
1.5G 1 G 1G
1 CPU 1 CPU 1 CPU
配置映射
/etc/hosts
192.168.159.21 hadoop-senior01.zhangbk.com hadoop-senior01
192.168.159.22 hadoop-senior02.zhangbk.com hadoop-senior02
192.168.159.23 hadoop-senior03.zhangbk.com hadoop-senior03
配置主机名
/etc/sysconfig/network
192.168.159.21 hadoop-senior01.zhangbk.com
192.168.159.22 hadoop-senior02.zhangbk.com
192.168.159.23 hadoop-senior03.zhangbk.com
关闭防火墙
service iptables stop #关闭防火墙
chkconfig iptables off #永久关闭防火墙
netstat -apn | grep 8080 #查看端口
=====================================================================
2.环境规划与文件配置
hadoop-senior01 hadoop-senior02 hadoop-senior03
HDFS
NameNode
DataNode DataNode DataNode
SecondaryNameNode
YARN
ResourceManager
NodeManager NodeManager NodeManager
MapReduce
JobHistoryServer
文件配置
* hdfs
* hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
* core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-senior01.zhangbk.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.5.0-cdh5.3.6/data/tmp</value> #创建文件夹/opt/hadoop-2.5.0-cdh5.3.6/data/tmp
</property>
* hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-senior03.zhangbk.com:50090</value>
</property>
* slaves
hadoop-senior01.zhangbk.com
hadoop-senior02.zhangbk.com
hadoop-senior03.zhangbk.com
* yarn
* yarn-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
* yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-senior02.zhangbk.com</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>640800</value>
</property>
* mapredue
* mapred-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
* mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-senior01.zhangbk.com:10020</value>
</property>
3. 免密登录
ssh-keygen -t rsa 一路按回车就行了
打开~/.ssh 下面有三个文件
authorized_keys,已认证的keys
id_rsa,私钥
id_rsa.pub,公钥
ssh-copy-id hadoop-senior01.zhangbk.com 复制文件到需要的环境
scp 拷贝到其他节点
scp -r ./hadoop-2.5.0-cdh5.3.6 hadoop-senior02.zhangbk.com:/opt
格式化文件系统
bin/hdfs namenode -format
遇到问题:
Cannot assign requested address; For more details
Namenode和ResourceManger如果不是同一台机器,不能在NameNode上启动 yarn,
应该在ResouceManager所在的机器上启动yarn。现在问题解决了。
======================================================================
集群搭建完成以后
* 基本测试
服务启动,是否可用,简单的应用
* hdfs
读写操作
bin/hdfs dfs -mkdir -p /user/beifeng/tmp/conf
bin/hdfs dfs -put etc/hadoop/*-site.xml /user/beifeng/tmp/conf
bin/hdfs dfs -text /user/beifeng/tmp/conf/core-site.xml
* yarn
run jar
* mapreduce
bin/yarn jar share/hadoop/mapreduce/hadoop*example*.jar wordcount /user/beifeng/mapreuce/wordcount/input /user/beieng/mapreduce/wordcount/output
* 基准测试
测试集群的性能
* hdfs
写数据
读数据
* 监控集群
Cloudera
Cloudera Manager
* 部署安装集群
* 监控集群
* 配置同步集群
* 预警。。。。。
=============================================================
集群的时间要同步
* 找一台机器
时间服务器
* 所有的机器与这台机器时间进行定时的同步
比如,每日十分钟,同步一次时间
# rpm -qa|grep ntp
# vi /etc/ntp.conf
restrict 192.168.21.0 mask 255.255.255.0 nomodify notrap #放开
#server 0.rhel.pool.ntp.org #注释
#server 1.rhel.pool.ntp.org #注释
#server 2.rhel.pool.ntp.org #注释
server 127.127.1.0 # local clock #放开
fudge 127.127.1.0 stratum 10 #放开
# vi /etc/sysconfig/ntpd
# Drop root to id 'ntp:ntp' by default.
SYNC_HWCLOCK=yes
OPTIONS="-u ntp:ntp -p /var/run/ntpd.pid -g"
[root@hadoop-senior hadoop-2.5.0]# service ntpd status
ntpd is stopped
[root@hadoop-senior hadoop-2.5.0]# service ntpd start
Starting ntpd: [ OK ]
[root@hadoop-senior hadoop-2.5.0]# chkconfig ntpd on
crontab -e #root用户
0-59/10 * * * * /usr/sbin/ntpdate hadoop-senior01.zhangbk.com #定时同步时间
调整服务器时间
[root@hadoop-senior03 ~]# date -s 2019-06-19
Wed Jun 19 00:00:00 PDT 2019
[root@hadoop-senior03 ~]# date -s 22:45:30
Wed Jun 19 22:45:30 PDT 2019
三、高可用(HA)分布式环境搭建
1. 首先安装部署zookeeper
yarn hbase spark都需要zookeeper
分布式应用提供协调服务的Apache项目
集群个数为奇数 (一个leader)
tickTime:作为zookeeper服务器之间或客户端与服务器之间维持心跳的时间间隔。
dataDir:Zookeeper保存数据的目录,默认情况下。写数据的日志文件也保存在这个目录。
clientPort:这个端口就是客户端连接zookeeper服务器的端口。zookeeper会监听这个端口,接受客户端的访问请求。
zookeeper client命令:
bin/zkCli.sh -server localhost:2181
ls,get,create,delete,set
搭建zookeeper分布式集群
mkdir - p /opt/zookeeper-3.4.5-cdh5.3.6/data/zkData
mv conf/zoo_sample.cfg conf/zoo.cfg
vi conf/zoo.cfg
dataDir=/opt/zookeeper-3.4.5-cdh5.3.6/data/zkData
# the port at which the clients will connect
clientPort=2181
server.1=hadoop-senior01.zhangbk.com:2888:3888
server.2=hadoop-senior02.zhangbk.com:2888:3888
server.3=hadoop-senior03.zhangbk.com:2888:3888
vi data/zkDate/myid
1
scp -r zookeeper-3.4.5-cdh5.3.6 root@hadoop-senior02.zhangbk.com:/opt
scp -r zookeeper-3.4.5-cdh5.3.6 root@hadoop-senior03.zhangbk.com:/opt
在第二、三节点分别修改myid文件,分别设置为2 3
Zookeeper集群脚本启动问题
脚本启动,jps找不到进程。
分析原因
首先知道交互式shell和非交互式shell、登录shell和非登录shell是有区别的
在登录shell里,环境信息需要读取/etc/profile和~ /.bash_profile, ~/.bash_login, and ~/.profile按顺序最先的一个,并执行其中的命令。除非被 —noprofile选项禁止了;在非登录shell里,环境信息只读取 /etc/bash.bashrc和~/.bashrc
手工执行是属于登陆shell,脚本执行数据非登陆shell,而我的linux环境配置中只对/etc/profile进行了jdk1.8等环境的配
置,所以脚本执行/opt/module/zookeeper-3.4.10/bin/zkServer.sh start 启动zookeeper失败了
解决方法
把profile的配置信息echo到.bashrc中 cat /etc/profile >> ~/.bashrc
在/zookeeper/bin/zkEnv.sh的中开始位置添加export JAVA_HOME=/opt/module/jdk1.8.0_191
Hadoop HA高可用分布式集群搭建
分布式存储
* NameNode
元数据
/user/beifeng/tmp/core-site.xml
文件名称 路劲 拥有者(所属者) 所属组 权限 副本数 .....
* DataNode
128 MB
Block方式进行存储
本地磁盘
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/data</value>
</property>
* Client
-> NameNode ->
put
hadoop 2.x
2.2.0
HDFS HA
* NameNode Active
* NameNode Standby
HDFS
edits
变化
NN Active
NN Standby
配置HA要点
* share edits
JournalNode
* NameNode
Active,Standby
* Client
Proxy
* fence
同一时刻仅仅有一个NameNode对外提供服务
使用的方式sshfence
两个NameNode之间能够ssh无密码登录
21(NameNode) ssh -> 22
22(NameNode) ssh -> 21
============================================================
2.规划集群与文件配置
规划集群
21 22 23
NameNode NameNode
JournalNode JournalNode JournalNode
DataNode DataNode DataNode
配置文件(https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Configuration_details)
hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.5.0-cdh5.3.6/data/tmp</value> #创建文件夹/opt/hadoop-2.5.0-cdh5.3.6/data/tmp
</property>
hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>hadoop-senior01.zhangbk.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>hadoop-senior02.zhangbk.com:8020</value>
</property>
#namenode http web address
<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>hadoop-senior01.zhangbk.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>hadoop-senior02.zhangbk.com:50070</value>
</property>
#namenode shared edits address
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-senior01.zhangbk.com:8485;hadoop-senior02.zhangbk.com:8485;hadoop-senior03.zhangbk.com:8485/ns1</value>
</property>
#hdfs proxy client
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
#namenode fence
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/hadoop-2.5.0-cdh5.3.6/data/dfs/jn</value>
</property>
slaves
hadoop-senior01.zhangbk.com
hadoop-senior02.zhangbk.com
hadoop-senior03.zhangbk.com
scp -r /opt/hadoop-2.5.0-cdh5.3.6/etc/hadoop root@hadoop-senior02.zhangbk.com:/opt/hadoop-2.5.0-cdh5.3.6/etc/
QJM HA启动
step1 在各个journalnode节点上,启动journalnode
sbin/hadoop-daemon.sh start journalnode
step2 在nn1上,进行格式化,并启动namenode
bin/hdfs namenode -format
sbin/hadoop-daemon.sh start namenode
step3 在nn2上,同步nn1的元数据信息
bin/hdfs namenode -bootstrapStandby
step4 启动nn2
sbin/hadoop-daemon.sh start namenode
step5 将nn1切换为Active
bin/hdfs haadmin -transitionToActive nn1
step6 在nn1上启动所有datanode
sbin/hadoop-daemons.sh start datanode
HA 自动故障转移 Zookeeper
启动HA以后,都是standby,选举一个为Active
hdfs-site.xml
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
core-site.xml
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-senior01.zhangbk.com:2181,hadoop-senior02.zhangbk.com:2181,hadoop-senior03.zhangbk.com:2181</value>
</property>
启动自动故障转移
(1)关闭所有HDFS服务 sbin/stop-dfs.sh
(2)启动Zookeeper集群 bin/zkServer.sh start
(3)初始化HA在Zookeeper中状态 bin/hdfs zkfc -formatZK
(4)启动HDFS服务 sbin/start-dfs.sh
(5)在各个NameNode节点上启动DFSZK Failover Controller,先在哪台机器上启动,
哪台机器的NameNode就是Active NameNode sbin/hadoop-daemon.sh start zkfc
验证:Active NameNode进程杀掉
Active NameNode机器断开网络 service network stop
=========================================================
NameNode
能不能有多个NameNode
NameNode NameNode NameNode
元数据 元数据 元数据
log machine 电商数据/话单数据
所有的DataNodes节点上
=============================================================
测试机群
生产集群
多个版本的Hadoop 大数据集群
hadoop distcp -i hftp://sourceFS:8020/src hdfs://destFS:8020/dest
MapReduce
=============================================================
YARN
ResourceManager
* 管理集群资源
* 分配调度集群资源
NodeManagers
备注:安装出现问题,欢迎骚扰!!!