HDP学习--Managing HDFS Storage（01）

最新推荐文章于 2023-12-01 08:59:29 发布

javastart

最新推荐文章于 2023-12-01 08:59:29 发布

阅读量471

点赞数

分类专栏：大数据 hadoop

大数据同时被 2 个专栏收录

236 篇文章 16 订阅

订阅专栏

hadoop

104 篇文章 5 订阅

订阅专栏

一、 NameNode的持久化信息

为了提高性能， HDFS文件系统的当前状态保持在NameNode的内存中，当用户或程序请求文件系统的信息都由NameNode内存提供。当有Client对文件系统有修改操作，必须要更新NameNode内存中的文件系统状态。
虽然内存的速度很快，但是也是不稳定的，如果硬件或电源故障导致HDFS文件系统的状态的丢失，为了恢复，会定期的将内存中的文件系统的状态持久化到disk中。
当Client向HDFS写数据，这个修改请求会送到内存，首先会持久化到一个disk-based edit log。在内存中的文件系统状态没有更新， client没有收到一个成功的响应码之前，这个修改操作会首先存到 disk-based edit log 。
另外还可以保存多份日志，由hdfs-site.xml中的dfs.namenode.edits.dir所决定，管理员可以配置一个以逗号分隔的目录路径,每一个指向一个单独的编辑日志的副本。例如,一个目录路径可以位于本地磁盘,而第二个目录路径可能位于另一个本地磁盘。第二个路径也可以映射到一个目录挂载在远程NFS服务器。
真正的文件系统状态信息定期持久化到一个fsimage 文件中，被称为checkpointing。

如下图就是NameNode持久化文件系统信息流程：

这里写图片描述

二、NameNode Checkpoint 操作

NameNode必须请求的执行checkpoint 操作或者是edits file 在持续增长，fsimage file 的位置由hdfs-site.xml中的dfs.namenode.name.dir的属性值所决定，可以保持fsimage文件的多个副本通过配置一个以逗号分隔的目录。checkpoint操作会消耗大量的CPU和内存资源，为了在执行checkpoint的时候保持NameNode的性能，checkpoint操作转给了Secondary NameNode , 如果 NameNode配置了主备（HA）, StandBy NameNode 会代替Secondary NameNode来执行checkpointing。

下图是NameNode的checkpoint的流程：

这里写图片描述

三、NameNode Setup

NameNode的启动流程如下：

    When a NameNode starts up it enters **safemode**, which is a read-only mode. This is illustrated as step 1 above.
    In step 2, the NameNode loads its latest （最新的）fsimage and edits files into memory.  
    In step 3, the NameNode performs a checkpoint that merges the information in the fsimage and edits files. This creates an up-to-date, in-memory image of the file system.
    In step 4, it writes an up-to-date fsimage file to disk and creates a new, empty edits file to track file system changes. At this point the NameNode begins to listen for RPC or HTTP HDFS client requests. However, the NameNode is still running in read-only mode. The NameNode must be in read-only mode because the list of data blocks that contain HDFS file data are not persisted on the NameNode. Each DataNode must send its block list to the NameNode. These block lists are sent in a block report.
    In step 5, the DataNodes send their block lists to the NameNode. The NameNode aggregates this information to rebuild the file block maps in memory. The block maps are used to locate and read file data. The purpose of safemode is to provide the NameNode the time necessary to rebuild the block maps.
    In step 6, the block maps have been rebuilt and the NameNode can exit safemode. As the NameNode exits safemode, it begins its normal read-write operation. A NameNode may exit safemode only when 99.9% of all data blocks are minimally replicated. Minimally replicated means at least one replica is available. This percentage is determined by the **dfs.namennode.safemode.threshold-pct** property in **hdfs-site.xml**.

下图为NameNode启动流程：

这里写图片描述

四、 DataNode的可用性

NameNode通过监听DataNode的心跳（heartbeat）来判断DataNode是否可用，所有与DataNode可用性的配置都在hdfs-site.xml中，

hdfs-site.xml: 
     dfs.heartbeat.interval
     心跳的周期， 默认为3秒，
     dfs.namenode.stale.datanode.interval
     如果超过30秒， NameNode没有收到DataNode的心跳， 将DataNode标记为`stale`。最小值为3次心跳的间隔。
     dfs.namenode.avoid.read.stale.datanode
     当这个值为`True`时， NameNode为了满足Client的读请求，将`stale` DataNode放在列表的末尾。 HDP中这个值默认为`True`。
     dfs.namenode.avoid.write.stale.datanode
     True:避免向一个`stale`的DataNode写入数据， HDP默认为`True`.
     dfs.namenode.write.stale.datanode.ratio
     只有当`stale`的DataNode的比例超过这个值， 将向stale的DataNode写入数据。

当NameNode超过10分30秒没有收到DataNode的心跳， 会判定该DataNode已死。这个值由(2 x dfs.namenode.heartbeat.recheck-interval) + (10 x dfs.heartbeat.interval)决定.那么NameNode会重新分配一个副本。

DataNode的可用性及心跳：

这里写图片描述

DataNode的磁盘如果坏了：

这里写图片描述

五、如何处理坏的数据块

这里写图片描述

Over time disk media suffers a slow deterioration that can change a data bit, effectively corrupting a data file. This is sometimes referred to as data rot. The more data stored on disk, the higher the probability of experiencing a small amount of data rot .
        HDFS typically stores a massive amount of data on disk and therefore is more susceptible to data rot than other file system types. HDFS was designed with this in mind and includes two techniques to check data for corruption.
        When an HDFS client reads a data block it performs a checksum verification. If the checksums for a data block are okay, the client informs the DataNode, which records this information. This is an acceptable technique for files that are regularly read by clients. However, HDFS commonly has data that is rarely read and therefore would not have its checksums regularly checked by HDFS clients.
        A second technique is available that can regularly verify the checksums of data blocks that have not been recently read by an HDFS client. Each DataNode runs a block scanner that reads through unread data blocks on a cyclical basis and verifies their checksums. The block scanner does not fix corrupted blocks.
        The block scanner is configurable. The dfs.datanode.scan.period.hours property controls whether the block scanner is disabled or not. The value of 0 disables the block scanner. A positive number defines the time period in which all data blocks must be checked. A DataNode will not scan any individual block more than once in the specified time period. The block scanner adjusts its read rate to ensure it completes block scanning within the configured time period. 
        If you have large volumes of typically unread data and you would like to configure block scanning, add the property dfs.datanode.scan.period.hours to the hdfs-site.xml file and configure it with a positive number. For example, using 560 would ensure that each unread block has its checksum verified at least every two weeks.
        Whenever an HDFS client or the block scanner detects a corrupt block, it notifies the NameNode. The NameNode marks the replica as corrupt, but does not immediately schedule deletion of the replica. Instead, it replicates a good copy of the block from another DataNode. Once the good replica count reaches the block’s replication factor, the corrupt replica is scheduled to be removed. The goal of this process is to preserve data as long as possible. So even if all replicas of a block are corrupt, the policy allows the user to retrieve its data from the corrupt replicas. 
        If all replicas of a block are corrupted, but in different places, there is a possibility of manually examining and fixing the data. There is no automatic utility to perform this type of data repair.