简介
Hadoop是Apache旗下的一个用java语言实现开源软件框架,是一个开发和运行处理大规模数据的软件平台。允许使用简单的编程模型在大量计算机集群上对大型数据集进行分布式处理。
架构
Hadoop的核心组件有:
HDFS(分布式文件系统):解决海量数据存储
MAPREDUCE(分布式运算编程框架):解决海量数据计算
YARN(作业调度和集群资源管理的框架):解决资源任务调度
Hadoop生态圈
当下的Hadoop已经成长为一个庞大的体系,随着生态系统的成长,新出现的项目越来越多,其中不乏一些非Apache主管的项目,
这些项目对HADOOP是很好的补充或者更高层的抽象。比如:
HDFS:分布式文件系统
MAPREDUCE:分布式运算程序开发框架
HIVE:基于HADOOP的分布式数据仓库,提供基于SQL的查询数据操作
HBASE:基于HADOOP的分布式海量数据库
ZOOKEEPER:分布式协调服务基础组件
Mahout:基于mapreduce/spark/flink等分布式运算框架的机器学习算法库
OOZIE:工作流调度框架
Sqoop:数据导入导出工具(比如用于mysql和HDFS之间)
FLUME:日志数据采集框架
IMPALA:基于hive的实时sql查询分析
安装
集群规划
主机名 | hadoop100 | hadoop101 | hadoop102 |
---|---|---|---|
IP | 192.168.100.100 | 192.168.100.101 | 192.168.100.102 |
前置要求 | jdk、zookeeper | ||
前置要求节点 | zk | zk | zk |
HDFS | JournalNode | JournalNode | JournalNode |
NameNode | NameNode | ||
DataNode | DataNode | DataNode | |
YARN | ResourceManager | ResourceManager | |
NodeManager | NodeManager | NodeManager |
解压
tar -zxvf hadoop-3.2.2.tar.gz -C /opt/software/
#修改目录名称
cd /opt/software
mv hadoop-3.2.2 hadoop
配置默认环境用户
cd /opt/software/hadoop/sbin
vim start-dfs.sh
#添加
#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
HDFS_JOURNALNODE_USER=root
vim stop-dfs.sh
#添加
#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
HDFS_JOURNALNODE_USER=root
vim start-yarn.sh
#添加
#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
vim stop-yarn.sh
#添加
#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
修改相关配置文件
cd /opt/software/hadoop/etc/hadoop/
设置datanode节点
vim workers
hadoop100
hadoop101
hadoop102
添加jdk
vim hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_271
vim mapred-env.sh
export JAVA_HOME=/opt/jdk1.8.0_271
vim yarn-env.sh
export JAVA_HOME=/opt/jdk1.8.0_271
修改配置文件
vim hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!--文件在HDFS系统中的副本数,一般小于DataNode的节点数-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- 块大小 -->
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<!--mycluster为自定义的值,下面的配置需要使用该值,指定hdfs的mycluster为id,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- mycluster下面有两个NameNode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!--开启hdfs的web访问接口,默认端口时50070-->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<!-- nn1的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop100:9000</value>
</property>
<!-- nn1的http通信地址,配置NameNode节点的Web页面访问地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop100:50070</value>
</property>
<!-- nn2的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop101:9000</value>
</property>
<!-- nn2的http通信地址,配置NameNode节点的Web页面访问地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop101:50070</value>
</property>
<!-- 指定NameNode的共享edits元数据在JournalNode上的存放位置,一般配置奇数个,以适应zk选举 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop100:8485;hadoop101:8485;hadoop102:8485/mycluster</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 ,JournalName用于存放元数据和状态信息的目录-->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/data/hadoopdata/dfs/journaldata</value>
</property>
<!-- 配置失败自动切换实现方式 ,客户端与NameNode通讯的地址-->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法,多个机制用换行分割,即每个机制暂用一行,解决HA脑裂问题-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用sshfence隔离机制时需要ssh免密登录(root免密),上述属性ssh通讯使用的秘钥文件-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 超时设置 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!--不检查用户权限 -->
<!-- <property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property> -->
<!--nameNode的edits元数据在机器本地磁盘的存放位置 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/data/hadoopdata/dfs/namenode</value>
</property>
<!-- DataNode节点数据在本地文件的存放位置-->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/data/hadoopdata/dfs/datanode</value>
</property>
<!--开启NameNode失败自动切换,自动故障转移,mycluster为自定义配置的mycluster id的值-->
<property>
<name>dfs.ha.automatic-failover.enabled.mycluster</name>
<value>true</value>
</property>
</configuration>
vim core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 配置hdfs文件系统的命名空间 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!-- 配置操作hdfs的缓存大小 -->
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/data/hadoopdata/dfs/tmp</value>
</property>
<!-- 指定zookeeper地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property>
<!--修改core-site.xml中的ipc参数,防止出现连接journalnode服务ConnectException-->
<property>
<name>ipc.client.connect.max.retries</name>
<value>100</value>
</property>
<property>
<name>ipc.client.connect.retry.interval</name>
<value>10000</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<!-- 当前用户全设置成root -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
<!-- 不开启权限检查 -->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
vim mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!--指定mr框架为yarn方式-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!--指定mr历史服务器主机,端口-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop100:10020</value>
</property>
<!--指定mr历史服务器webUI主机,端口-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop100:19888</value>
</property>
<!--历史服务器的WEB UI最多显示20000个历史的作业记录信息-->
<property>
<name>mapreduce.jobhistory.joblist.cache.size</name>
<value>20000</value>
</property>
<!--配置作业运行日志-->
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
</property>
</configuration>
vim yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!-- 根据ResourceManager规划修改 hadoop100 rm1 ,hadoop101注释掉,hadoop102 rm2-->
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm1</value>
</property>
<!--启用resourcemanager ha-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!--reducer获取数据的方式-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--yarn HA虚拟服务名-->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarncluster</value>
</property>
<!--yarn HA虚拟服务名下的rm-->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop100</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop102</value>
</property>
<!--指定zookeeper集群的地址-->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property>
<!--启动自动恢复-->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>hadoop100:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>hadoop100:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm1</name>
<value>hadoop100:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm1</name>
<value>hadoop100:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm1</name>
<value>hadoop100:23142</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop100:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop102:8088</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>hadoop102:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>hadoop102:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm2</name>
<value>hadoop102:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm2</name>
<value>hadoop102:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm2</name>
<value>hadoop102:23142</value>
</property>
<!--指定resourcemanager的状态信息存储在zookeeper集群-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<!--配置与zookeeper的连接地址-->
<!-- <property>
<name>yarn.resourcemanager.zk-state-store.address</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property> -->
<!--Zookeeper集群列表 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/software/hadoop/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/opt/software/hadoop/yarn/logs</value>
</property>
<property>
<name>mapreduce.shuffle.port</name>
<value>23080</value>
</property>
<!--故障处理类-->
<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
</property>
<!-- 为zookeeper的数据存储目录需要跟zoo.cfg中配置保持一致-->
<property>
<name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name>
<value>/opt/software/zookeeper/data</value>
</property>
<!-- hadoop classpath -->
<property>
<name>yarn.application.classpath</name>
<value>/opt/software/hadoop/etc/hadoop:/opt/software/hadoop/share/hadoop/common/lib/*:/opt/software/hadoop/share/hadoop/common/*:/opt/software/hadoop/share/hadoop/hdfs:/opt/software/hadoop/share/hadoop/hdfs/lib/*:/opt/software/hadoop/share/hadoop/hdfs/*:/opt/software/hadoop/share/hadoop/mapreduce/lib/*:/opt/software/hadoop/share/hadoop/mapreduce/*:/opt/software/hadoop/share/hadoop/yarn:/opt/software/hadoop/share/hadoop/yarn/lib/*:/opt/software/hadoop/share/hadoop/yarn/*</value>
</property>
</configuration>
分发
scp -r /opt/software/hadoop hadoop101:/opt/software/hadoop
scp -r /opt/software/hadoop hadoop102:/opt/software/hadoop
修该vim /opt/software/hadoop/etc/hadoop/yarn-site.xml
yarn.resourcemanager.ha.id ,hadoop101注释调,hadoop102上改成rm2
添加环境变量
vim /etc/profile
#HADOOP_HOME
export HADOOP_HOME=/opt/software/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
#同步环境变量
rsync.sh /etc/profile
#环境变量生效
all.sh "source /etc/profile "
注册到zk
需要提前启动zk,zookeeper安装与启动参照大数据之zookeeper
#创建zk节点 (在hadoop100执行)
/opt/software/hadoop/bin/hdfs zkfc -formatZK
#所有节点
sh all.sh "/opt/software/hadoop/sbin/hadoop-daemon.sh start journalnode"
#hadoop100执行
/opt/software/hadoop/bin/hdfs namenode -format
/opt/software/hadoop/sbin/hadoop-daemon.sh start namenode
#hadoop101执行
/opt/software/hadoop/bin/hdfs namenode -bootstrapStandby
/opt/software/hadoop/sbin/hadoop-daemon.sh start namenode
#在hadoop100,hadoop101执行
/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc
hadoop批量启停脚本
vim /opt/script/hadoop.sh
#!/bin/bash
##hadoop集群启停脚本
hadoop_start(){
sh /opt/script/zk.sh start
ssh hadoop100 "/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc"
ssh hadoop101 "/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc"
ssh hadoop100 "/opt/software/hadoop/sbin/start-dfs.sh;/opt/software/hadoop/sbin/start-yarn.sh start"
ssh hadoop100 "/opt/software/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver"
echo "****************** hadoop start *********************"
}
hadoop_stop(){
# sh /opt/script/zk.sh stop
ssh hadoop100 "/opt/software/hadoop/sbin/stop-all.sh"
ssh hadoop100 "/opt/software/hadoop/sbin/hadoop-daemon.sh stop zkfc"
ssh hadoop101 "/opt/software/hadoop/sbin/hadoop-daemon.sh stop zkfc"
ssh hadoop100 "/opt/software/hadoop/bin/mapred --daemon stop historyserver"
echo "****************** hadoop stop *********************"
}
hadoop_status(){
# sh /opt/script/zk.sh status
sh /opt/script/all.sh "jps"
echo "****************** hadoop status *********************"
}
case $1 in
"start"){
hadoop_start
};;
"stop"){
hadoop_stop
};;
"restart"){
hadoop_stop
sleep 2
hadoop_start
};;
"status"){
hadoop_status
};;
*){
echo "[ERROR-输入参数错误]:请输入start|stop|restart|status"
};;
esac
验证
先启动zk,再启动hadoop,jps 三台都有hadoop对应进程都启动
并且访问namenode节点web界面(100和101节点)一个活跃一个standby,且datanode有3个live节点
#NameNode节点
http://192.168.100.100:50070/
#NameNode节点
http://192.168.100.101:50070/
#Yarn节点
http://192.168.100.100:8088/