2023年11月14日15:09:16
#hdfs 何时会进入安全模式:
1.首次启动,检测所有节点
2.数据节点数量过少,如果数据节点数量少于namenode 阈值;
3.数据节点坏块,进入安全模式,hdfs会尝试自动修复,直到坏块的数量 降低到可接受阈值;
#hbase 写数据的流程:
1. client 向zk 获取meta表所在的 regionserver 在A节点;
2. client 向A regionserver节点获取meta表的,并返回表元数据信息;
3. client 得知数据应该写入到B regionserver 的某个region数据块上;
4. client 向 B regionserver 写入WAL预写日志,B regionserver 将WAL 刷到 memstore,memstore 最终写入到磁盘StoreFile;
hbase 写数据时,数据往哪个regionserver 写,是由master 分配的,由zk协助一致管理的;这个有点类似es写数据的路由机制,
当某一个regionserver 写不进数据,原因是client获取的元数据就记录的数据该往这里写,所以是检测不到是否进入安全模式的,就算能检测到,如果不写,也不会往其他节点写;
#因此,当进入安全模式,可考虑的方向:
1)首次启动;
2)数据节点太少;
3)可以调整namemode 数据节点的阈值;
4)调整可接受坏块的阈值;
5)降低进入安全模式的阈值;
6)还有磁盘健康状态;
7)节点故障;
8)手动退出安全模式;
9)HDFS自动故障转移
#配置自动故障转移ZKFC ,进程名称为 DFSZKFailoverController:
https://developer.aliyun.com/article/1250213
自动故障转移,解决的是namenode节点故障和数据块丢失的问题
官网链接:
https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Configuration_overview Automatic Failover
hdfs-site.xml :
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
core-site.xml:
<property>
<name>ha.zookeeper.quorum</name>
<value>master:2181,slave2:2181,slave3:2181</value>
</property>
#如何手动退出安全模式
官网链接:
https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
hdfs dfsadmin -safemode enter|leave|get|wait|forceExit
Safe mode maintenance command. Safe mode is a Namenode state in which it
1. does not accept changes to the name space (read-only)
2. does not replicate or delete blocks.
Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. If Namenode detects any anomaly then it will linger in safe mode till that issue is resolved. If that anomaly is the consequence of a deliberate action, then administrator can use -safemode forceExit to exit safe mode. The cases where forceExit may be required are
1. Namenode metadata is not consistent. If Namenode detects that metadata has been modified out of band and can cause data loss, then Namenode will enter forceExit state. At that point user can either restart Namenode with correct metadata files or forceExit (if data loss is acceptable).
2. Rollback causes metadata to be replaced and rarely it can trigger safe mode forceExit state in Namenode. In that case you may proceed by issuing -safemode forceExit.
Safe mode can also be entered manually, but then it can only be turned off manually as well.
#检查磁盘健康状态:
hdfs fsck / -files -blocks -locations
#检查数据块是否损坏:
1.hdfs 数据块是否损坏 ,检查数据块是否损坏命令,如果打印的状态为: CORRUPT,并且 CORRUPT FILES 大于0,说明数据块损坏
hdfs fsck /
2.查找数据块损坏的位置,会打印hdfs文件目录
hdfs fsck -list-corruptfileblocks
3. 删除或恢复数据块
删除:直接删除损坏的目录就可以了或者 hdfs dfsadmin -deleteBlock
修复: hdfs dfsadmin -fsck
如果数据块路径不存在了 hadoop dfsadmin -refreshNodes 重新分配数据块;
4.重新执行 hdfs fsck 查看是否解决损坏数据块;
# 降低进入安全模式的阈值,修改配置文件hdfs-site.xml
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.999f</value>
中午和运维沟通,海外环境出现安全模式,是因为节点通信问题,通过重启就解决了;这种情况有可能是数据节点少,触发了进入安全模式的第二种情况。
根据hbase写数据的流程,可以尝试设置hbase 和zk之间的元数据刷新的频率(默认是3min),配置为zookeeper.session.timeout;如果提高这个频率,master能够更早的知道这个节点通信有问题,master会马上将这个节点负责的region数据,分配给别的机器,并重新balance;
设置太小可能会导致频繁的负载均衡,可以考虑关闭自动负载均衡,定时脚本执行负载均衡