About Hadoop 集群管理维护

Hadoop Test

1、安全模式

查看namenode是否处于安全模式下

[root@bdpnamenodemaster ~]# hadoop dfsadmin -safemode get
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF in bdpnamenodebackup.edw.com/192.168.1.40:8020
Safe mode is OFF in bdpnamenodemaster.edw.com/192.168.1.39:8020

进入和退出安全模式

hadoop dfsadmin -safemode enter
hadoop dfsadmin -safemode leave

在执行某条命令前退出安全模式

hadoop dfsadmin -safemode wait
  • 系统会在满足“最小复本条件”时30秒后自动退出安全模式。

2、日志审计

  • Hadoop的日志审计功能是log4j在info级别实现的,默认是 log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN 。
  • 可以通过将WARN替换成INFO来启动日志审计特性,log4j中配置日志的独立路径。

3、工具

3.1 hadoop dfsadmin

[root@bdpnamenodemaster etc]# hadoop dfsadmin
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Usage: hdfs dfsadmin
Note: Administrative commands can only be run as the HDFS superuser.
        [-report [-live] [-dead] [-decommissioning]]
        [-safemode <enter | leave | get | wait>]
        [-saveNamespace]
        [-rollEdits]
        [-restoreFailedStorage true|false|check]
        [-refreshNodes]
        [-setQuota <quota> <dirname>...<dirname>]
        [-clrQuota <dirname>...<dirname>]
        [-setSpaceQuota <quota> <dirname>...<dirname>]
        [-clrSpaceQuota <dirname>...<dirname>]
        [-finalizeUpgrade]
        [-rollingUpgrade [<query|prepare|finalize>]]
        [-refreshServiceAcl]
        [-refreshUserToGroupsMappings]
        [-refreshSuperUserGroupsConfiguration]
        [-refreshCallQueue]
        [-refresh <host:ipc_port> <key> [arg1..argn]
        [-reconfig <datanode|...> <host:ipc_port> <start|status|properties>]
        [-printTopology]
        [-refreshNamenodes datanode_host:ipc_port]
        [-deleteBlockPool datanode_host:ipc_port blockpoolId [force]]
        [-setBalancerBandwidth <bandwidth in bytes per second>]
        [-fetchImage <local directory>]
        [-allowSnapshot <snapshotDir>]
        [-disallowSnapshot <snapshotDir>]
        [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
        [-getDatanodeInfo <datanode_host:ipc_port>]
        [-metasave filename]
        [-triggerBlockReport [-incremental] <datanode_host:ipc_port>]
        [-help [cmd]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

3.2 检查HDFS中文件的健康状况

[root@bdpnamenodemaster etc]# hadoop fsck
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Usage: DFSck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]
        <path>  start checking from this path
        -move   move corrupted files to /lost+found
        -delete delete corrupted files
        -files  print out files being checked
        -openforwrite   print out files opened for write
        -includeSnapshots       include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
        -list-corruptfileblocks print out list of missing blocks and files they belong to
        -blocks print out block report
        -locations      print out locations for every block
        -racks  print out network topology for data-node locations

        -blockId        print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Please Note:
        1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
        2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

3.3 datanode 块扫描器

  • 各个datanode上运行一个块扫描器,定期(默认为504小时,可以通过dfs.datanode.scan.period.hours属性设置)检测本节点上的所有块,从而在客户端读到坏块之前及时的检测和恢复坏块。扫描器使用节流机制,来维持DataNode的磁盘带宽

3.4 均衡器 balancer (块均衡) :不均衡的集群会降低MR的本地性

  • balancer程序是hadoop的一个守护进程,启动均衡器 start-balancer.sh
    为降低集群负荷、避免干扰其他用户,均衡器被设计为后台运行。在不同节点间复制数据的带宽也受限,默认1M/s,可以通过hdfs-site.xml文件中的dfs.balance.bandwidthPerSec属性设置(单位为字节)。

3.5 备份工具

  • distcp理想的备份工具,并行的文件复制供能可以将备份文件存储在其他HDFS集群或者其他Hadoop文件系统,如(S3或kfs)。

4、委任和解除节点

4.1、委任新节点,拓展集群

(1)将新节点的网络地址添加到include文件中;
(2)hadoop dfsadmin -refreshNodes #将经过审核的一系列DataNode集合更新至namenode信息;
(3)hadoop mradmin -refreshNodes #将经过审核的一系列tasktracker集合更新至jobtracker信息;
(4)以欣节点更新slaves文件。这样的话,hadoop控制脚本会将新节点包括在未来操作之中;
(5)启动新的DataNode和tasktracker;
(6)检查新的DataNode和tasktracker是否出现在网页界面中。

HDFS不会自动将旧的DataNode移动新的DataNode以平衡集群,需要自行运行均衡器。

4.2、从集群移除故障节点

HDFS能够容忍datanode故障,但这并不意味着允许随意终止datanode。以三复本策略为例, 如果同时关闭不同机架上的三个datanode ,则数据丢失的概率会非常高。
正确的方法是: 用户将拟退出的若干 datanode 告知namenode, Hadoop 系统就可在这些datanode 停机之前将块复制到其他 datanode。

(1)将待解除节点的网络地址添加到exclude文件中,不更新include文件。
(2)执行以下指令,使用一组新的审核过的datanode来更新namenode设置:% hadoop dfsadmin -refreshNodes
(3)使用一组新的审核过的datanode来更新jobtracker设置:% hadoop mradmin -refreshNodes
(4)转到网页界面,查看待解除datanode的管理状态是否已经变为“正在解除”(DecommissionTn Progress),因为此时相关的datanode正在被解除过程之中。这些datanode会把它们的块复制到其他 datanode中。
(5)当所有datanode的状态变为“解除完毕”(Decommissioned)时,表明所有块都已经复制完毕。关闭已经解除的节点。
(6)从include文件中移除这些节点,并运行以下命令:
% hadoop dfsadmin -refreshNodes
% hadoop mradmin -refreshNodes
(7)从slaves文件中移除节点。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值