03.大数据之hadoop

简介

Hadoop是Apache旗下的一个用java语言实现开源软件框架,是一个开发和运行处理大规模数据的软件平台。允许使用简单的编程模型在大量计算机集群上对大型数据集进行分布式处理。

架构

Hadoop的核心组件有:

HDFS(分布式文件系统):解决海量数据存储
MAPREDUCE(分布式运算编程框架):解决海量数据计算
YARN(作业调度和集群资源管理的框架):解决资源任务调度

在这里插入图片描述
Hadoop生态圈

当下的Hadoop已经成长为一个庞大的体系,随着生态系统的成长,新出现的项目越来越多,其中不乏一些非Apache主管的项目,
这些项目对HADOOP是很好的补充或者更高层的抽象。比如:
HDFS:分布式文件系统
MAPREDUCE:分布式运算程序开发框架
HIVE:基于HADOOP的分布式数据仓库,提供基于SQL的查询数据操作
HBASE:基于HADOOP的分布式海量数据库
ZOOKEEPER:分布式协调服务基础组件
Mahout:基于mapreduce/spark/flink等分布式运算框架的机器学习算法库
OOZIE:工作流调度框架
Sqoop:数据导入导出工具(比如用于mysql和HDFS之间)
FLUME:日志数据采集框架
IMPALA:基于hive的实时sql查询分析

在这里插入图片描述

安装

集群规划

主机名hadoop100hadoop101hadoop102
IP192.168.100.100192.168.100.101192.168.100.102
前置要求jdk、zookeeper
前置要求节点zkzkzk
HDFSJournalNodeJournalNodeJournalNode
NameNodeNameNode
DataNodeDataNodeDataNode
YARNResourceManagerResourceManager
NodeManagerNodeManagerNodeManager

解压

tar -zxvf hadoop-3.2.2.tar.gz -C /opt/software/ 
#修改目录名称
cd /opt/software 
mv hadoop-3.2.2 hadoop

配置默认环境用户

cd /opt/software/hadoop/sbin

vim start-dfs.sh

#添加
#!/usr/bin/env bash 
HDFS_DATANODE_USER=root 
HADOOP_SECURE_DN_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root 
HDFS_JOURNALNODE_USER=root

vim stop-dfs.sh

#添加
#!/usr/bin/env bash 
HDFS_DATANODE_USER=root 
HADOOP_SECURE_DN_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root 
HDFS_JOURNALNODE_USER=root

vim start-yarn.sh

#添加
#!/usr/bin/env bash 
YARN_RESOURCEMANAGER_USER=root 
HADOOP_SECURE_DN_USER=yarn 
YARN_NODEMANAGER_USER=root

vim stop-yarn.sh

#添加
#!/usr/bin/env bash 
YARN_RESOURCEMANAGER_USER=root 
HADOOP_SECURE_DN_USER=yarn 
YARN_NODEMANAGER_USER=root

修改相关配置文件

cd /opt/software/hadoop/etc/hadoop/

设置datanode节点

vim workers

hadoop100 
hadoop101 
hadoop102 

添加jdk

vim hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_271

vim mapred-env.sh

export JAVA_HOME=/opt/jdk1.8.0_271

vim yarn-env.sh

export JAVA_HOME=/opt/jdk1.8.0_271

修改配置文件

vim hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
<!--文件在HDFS系统中的副本数,一般小于DataNode的节点数--> 
<property> 
<name>dfs.replication</name> 
<value>2</value> 
</property> 
<!-- 块大小 --> 
<property> 
<name>dfs.block.size</name> 
<value>134217728</value> 
</property> 
<!--mycluster为自定义的值,下面的配置需要使用该值,指定hdfs的mycluster为id,需要和core-site.xml中的保持一致 --> 
<property> 
<name>dfs.nameservices</name> 
<value>mycluster</value> 
</property> 
<!-- mycluster下面有两个NameNode,分别是nn1,nn2 --> 
<property> 
<name>dfs.ha.namenodes.mycluster</name> 
<value>nn1,nn2</value> 
</property> 
<!--开启hdfs的web访问接口,默认端口时50070--> 
<property> 
<name>dfs.webhdfs.enabled</name> 
<value>true</value> 
</property> 
<!-- nn1的RPC通信地址 --> 
<property> 
<name>dfs.namenode.rpc-address.mycluster.nn1</name> 
<value>hadoop100:9000</value> 
</property> 
<!-- nn1的http通信地址,配置NameNode节点的Web页面访问地址 --> 
<property> 
<name>dfs.namenode.http-address.mycluster.nn1</name> 
<value>hadoop100:50070</value> 
</property> 
<!-- nn2的RPC通信地址 --> 
<property> 
<name>dfs.namenode.rpc-address.mycluster.nn2</name> 
<value>hadoop101:9000</value> 
</property> 
<!-- nn2的http通信地址,配置NameNode节点的Web页面访问地址 --> 
<property> 
<name>dfs.namenode.http-address.mycluster.nn2</name> 
<value>hadoop101:50070</value> 
</property> 
<!-- 指定NameNode的共享edits元数据在JournalNode上的存放位置,一般配置奇数个,以适应zk选举 --> 
<property> 
<name>dfs.namenode.shared.edits.dir</name> 
<value>qjournal://hadoop100:8485;hadoop101:8485;hadoop102:8485/mycluster</value> 
</property> 
<!-- 指定JournalNode在本地磁盘存放数据的位置 ,JournalName用于存放元数据和状态信息的目录--> 
<property> 
<name>dfs.journalnode.edits.dir</name> 
<value>/opt/data/hadoopdata/dfs/journaldata</value> 
</property> 
<!-- 配置失败自动切换实现方式 ,客户端与NameNode通讯的地址--> 
<property> 
<name>dfs.client.failover.proxy.provider.mycluster</name> 
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
</property> 
<!-- 配置隔离机制方法,多个机制用换行分割,即每个机制暂用一行,解决HA脑裂问题--> 
<property> 
<name>dfs.ha.fencing.methods</name> 
<value>sshfence</value> 
</property> 
<!-- 使用sshfence隔离机制时需要ssh免密登录(root免密),上述属性ssh通讯使用的秘钥文件--> 
<property> 
<name>dfs.ha.fencing.ssh.private-key-files</name> 
<value>/root/.ssh/id_rsa</value> 
</property> 
<!-- 超时设置 --> 
<property> 
<name>dfs.ha.fencing.ssh.connect-timeout</name> 
<value>30000</value> 
</property> 
<!--不检查用户权限 --> 
<!-- <property> 
<name>dfs.permissions.enabled</name> 
<value>false</value> 
</property> --> 
<!--nameNode的edits元数据在机器本地磁盘的存放位置 --> 
<property> 
<name>dfs.namenode.name.dir</name> 
<value>/opt/data/hadoopdata/dfs/namenode</value> 
</property> 
<!-- DataNode节点数据在本地文件的存放位置--> 
<property> 
<name>dfs.datanode.data.dir</name> 
<value>/opt/data/hadoopdata/dfs/datanode</value> 
</property> 
<!--开启NameNode失败自动切换,自动故障转移,mycluster为自定义配置的mycluster id的值--> 
<property> 
<name>dfs.ha.automatic-failover.enabled.mycluster</name> 
<value>true</value> 
</property> 
</configuration>

vim core-site.xml

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
<!-- 配置hdfs文件系统的命名空间 --> 
<property> 
<name>fs.defaultFS</name> 
<value>hdfs://mycluster</value> 
</property> 
<!-- 配置操作hdfs的缓存大小 --> 
<property> 
<name>io.file.buffer.size</name> 
<value>4096</value> 
</property> 
<property> 
<name>hadoop.tmp.dir</name> 
<value>file:/opt/data/hadoopdata/dfs/tmp</value> 
</property> 
<!-- 指定zookeeper地址 --> 
<property> 
<name>ha.zookeeper.quorum</name> 
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value> 
</property> 
<!--修改core-site.xml中的ipc参数,防止出现连接journalnode服务ConnectException--> 
<property> 
<name>ipc.client.connect.max.retries</name> 
<value>100</value> 
</property> 
<property> 
<name>ipc.client.connect.retry.interval</name> 
<value>10000</value> 
</property> 
<property> 
<name>hadoop.proxyuser.root.hosts</name> 
<value>*</value> 
</property> 
<property> 
<name>hadoop.proxyuser.root.groups</name> 
<value>*</value> 
</property> 
<!-- 当前用户全设置成root --> 
<property> 
<name>hadoop.http.staticuser.user</name> 
<value>root</value> 
</property> 
<!-- 不开启权限检查 --> 
<property> 
<name>dfs.permissions.enabled</name> 
<value>false</value> 
</property> 
</configuration>

vim mapred-site.xml

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
<!--指定mr框架为yarn方式--> 
<property> 
<name>mapreduce.framework.name</name> 
<value>yarn</value> 
</property> 
<!--指定mr历史服务器主机,端口--> 
<property> 
<name>mapreduce.jobhistory.address</name> 
<value>hadoop100:10020</value> 
</property> 
<!--指定mr历史服务器webUI主机,端口--> 
<property> 
<name>mapreduce.jobhistory.webapp.address</name> 
<value>hadoop100:19888</value> 
</property> 
<!--历史服务器的WEB UI最多显示20000个历史的作业记录信息--> 
<property> 
<name>mapreduce.jobhistory.joblist.cache.size</name> 
<value>20000</value> 
</property> 
<!--配置作业运行日志--> 
<property> 
<name>mapreduce.jobhistory.done-dir</name> 
<value>${yarn.app.mapreduce.am.staging-dir}/history/done</value> 
</property> 
<property> 
<name>mapreduce.jobhistory.intermediate-done-dir</name> 
<value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value> 
</property> 
</configuration>

vim yarn-site.xml

<?xml version="1.0"?> 
<configuration> 
<!-- 根据ResourceManager规划修改 hadoop100 rm1 ,hadoop101注释掉,hadoop102 rm2--> 
<property> 
<name>yarn.resourcemanager.ha.id</name> 
<value>rm1</value> 
</property> 
<!--启用resourcemanager ha--> 
<property> 
<name>yarn.resourcemanager.ha.enabled</name> 
<value>true</value> 
</property> 
<!--reducer获取数据的方式--> 
<property> 
<name>yarn.nodemanager.aux-services</name> 
<value>mapreduce_shuffle</value> 
</property> 
<!--yarn HA虚拟服务名--> 
<property> 
<name>yarn.resourcemanager.cluster-id</name> 
<value>yarncluster</value> 
</property> 
<!--yarn HA虚拟服务名下的rm--> 
<property> 
<name>yarn.resourcemanager.ha.rm-ids</name> 
<value>rm1,rm2</value> 
</property> 
<property> 
<name>yarn.resourcemanager.hostname.rm1</name> 
<value>hadoop100</value> 
</property> 
<property> 
<name>yarn.resourcemanager.hostname.rm2</name> 
<value>hadoop102</value> 
</property> 
<!--指定zookeeper集群的地址--> 
<property> 
<name>yarn.resourcemanager.zk-address</name> 
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value> 
</property> 
<!--启动自动恢复--> 
<property> 
<name>yarn.resourcemanager.recovery.enabled</name> 
<value>true</value> 
</property> 
<property> 
<name>yarn.resourcemanager.address.rm1</name> 
<value>hadoop100:8032</value> 
</property> 
<property> 
<name>yarn.resourcemanager.scheduler.address.rm1</name> 
<value>hadoop100:8030</value> 
</property> 
<property> 
<name>yarn.resourcemanager.resource-tracker.address.rm1</name> 
<value>hadoop100:8031</value> 
</property> 
<property> 
<name>yarn.resourcemanager.admin.address.rm1</name> 
<value>hadoop100:8033</value> 
</property> 
<property> 
<name>yarn.resourcemanager.ha.admin.address.rm1</name> 
<value>hadoop100:23142</value> 
</property> 
<property> 
<name>yarn.resourcemanager.webapp.address.rm1</name> 
<value>hadoop100:8088</value> 
</property> 
<property> 
<name>yarn.resourcemanager.webapp.address.rm2</name> 
<value>hadoop102:8088</value> 
</property> 
<property> 
<name>yarn.resourcemanager.address.rm2</name> 
<value>hadoop102:8032</value> 
</property> 
<property> 
<name>yarn.resourcemanager.scheduler.address.rm2</name> 
<value>hadoop102:8030</value> 
</property> 
<property> 
<name>yarn.resourcemanager.resource-tracker.address.rm2</name> 
<value>hadoop102:8031</value> 
</property> 
<property> 
<name>yarn.resourcemanager.admin.address.rm2</name> 
<value>hadoop102:8033</value> 
</property> 
<property> 
<name>yarn.resourcemanager.ha.admin.address.rm2</name> 
<value>hadoop102:23142</value> 
</property> 
<!--指定resourcemanager的状态信息存储在zookeeper集群--> 
<property> 
<name>yarn.resourcemanager.store.class</name> 
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> 
</property> 
<!--配置与zookeeper的连接地址--> 
<!-- <property> 
<name>yarn.resourcemanager.zk-state-store.address</name> 
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value> 
</property> --> 
<!--Zookeeper集群列表 --> 
<property> 
<name>yarn.resourcemanager.zk-address</name> 
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value> 
</property> 
<property> 
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
<value>org.apache.hadoop.mapred.ShuffleHandler</value> 
</property> 
<!-- 日志聚集功能使能 --> 
<property> 
<name>yarn.log-aggregation-enable</name> 
<value>true</value> 
</property> 
<!-- 日志保留时间设置7天 --> 
<property> 
<name>yarn.log-aggregation.retain-seconds</name> 
<value>604800</value> 
</property> 
<property> 
<name>yarn.nodemanager.local-dirs</name> 
<value>/opt/software/hadoop/yarn/local</value> 
</property> 
<property> 
<name>yarn.nodemanager.log-dirs</name> 
<value>/opt/software/hadoop/yarn/logs</value> 
</property> 
<property> 
<name>mapreduce.shuffle.port</name> 
<value>23080</value> 
</property> 
<!--故障处理类--> 
<property> 
<name>yarn.client.failover-proxy-provider</name> 
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value> 
</property> 
<!-- 为zookeeper的数据存储目录需要跟zoo.cfg中配置保持一致--> 
<property> 
<name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name> 
<value>/opt/software/zookeeper/data</value> 
</property> 
<!-- hadoop classpath --> 
<property> 
<name>yarn.application.classpath</name> 
<value>/opt/software/hadoop/etc/hadoop:/opt/software/hadoop/share/hadoop/common/lib/*:/opt/software/hadoop/share/hadoop/common/*:/opt/software/hadoop/share/hadoop/hdfs:/opt/software/hadoop/share/hadoop/hdfs/lib/*:/opt/software/hadoop/share/hadoop/hdfs/*:/opt/software/hadoop/share/hadoop/mapreduce/lib/*:/opt/software/hadoop/share/hadoop/mapreduce/*:/opt/software/hadoop/share/hadoop/yarn:/opt/software/hadoop/share/hadoop/yarn/lib/*:/opt/software/hadoop/share/hadoop/yarn/*</value> 
</property> 
</configuration>

分发

scp -r /opt/software/hadoop hadoop101:/opt/software/hadoop
scp -r /opt/software/hadoop hadoop102:/opt/software/hadoop

修该vim /opt/software/hadoop/etc/hadoop/yarn-site.xml

yarn.resourcemanager.ha.id ,hadoop101注释调,hadoop102上改成rm2

添加环境变量

vim /etc/profile

#HADOOP_HOME
export HADOOP_HOME=/opt/software/hadoop 
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
#同步环境变量
rsync.sh /etc/profile 
#环境变量生效
all.sh  "source /etc/profile "

注册到zk

需要提前启动zk,zookeeper安装与启动参照大数据之zookeeper

#创建zk节点 (在hadoop100执行)
/opt/software/hadoop/bin/hdfs zkfc -formatZK
#所有节点 
sh all.sh "/opt/software/hadoop/sbin/hadoop-daemon.sh start journalnode" 
#hadoop100执行
/opt/software/hadoop/bin/hdfs namenode -format 
/opt/software/hadoop/sbin/hadoop-daemon.sh start namenode 
#hadoop101执行
/opt/software/hadoop/bin/hdfs namenode -bootstrapStandby 
/opt/software/hadoop/sbin/hadoop-daemon.sh start namenode 
#在hadoop100,hadoop101执行
/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc

hadoop批量启停脚本

vim /opt/script/hadoop.sh

#!/bin/bash 
##hadoop集群启停脚本 
hadoop_start(){ 
sh /opt/script/zk.sh start 
ssh hadoop100 "/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc" 
ssh hadoop101 "/opt/software/hadoop/sbin/hadoop-daemon.sh start zkfc" 
ssh hadoop100 "/opt/software/hadoop/sbin/start-dfs.sh;/opt/software/hadoop/sbin/start-yarn.sh start" 
ssh hadoop100 "/opt/software/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver" 
echo "****************** hadoop start *********************" 
} 
hadoop_stop(){ 
# sh /opt/script/zk.sh stop 
ssh hadoop100 "/opt/software/hadoop/sbin/stop-all.sh" 
ssh hadoop100 "/opt/software/hadoop/sbin/hadoop-daemon.sh stop zkfc" 
ssh hadoop101 "/opt/software/hadoop/sbin/hadoop-daemon.sh stop zkfc" 
ssh hadoop100 "/opt/software/hadoop/bin/mapred --daemon stop historyserver" 
echo "****************** hadoop stop *********************" 
} 
hadoop_status(){ 
# sh /opt/script/zk.sh status 
sh /opt/script/all.sh "jps" 
echo "****************** hadoop status *********************" 
} 
case $1 in 
"start"){ 
hadoop_start 
};; 
"stop"){ 
hadoop_stop 
};; 
"restart"){ 
hadoop_stop 
sleep 2 
hadoop_start 
};; 
"status"){ 
hadoop_status 
};; 
*){ 
echo "[ERROR-输入参数错误]:请输入start|stop|restart|status" 
};; 
esac

验证

先启动zk,再启动hadoop,jps 三台都有hadoop对应进程都启动
并且访问namenode节点web界面(100和101节点)一个活跃一个standby,且datanode有3个live节点 
#NameNode节点 
http://192.168.100.100:50070/
#NameNode节点 
http://192.168.100.101:50070/ 
#Yarn节点 
http://192.168.100.100:8088/

验证

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值