【翻译】HDFS架构总览

Introduction

介绍

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.

Hadoop分布式文件系统(HDFS)是一个设计在商业硬件上运行的分布式文件系统。尽管它和现有的分布式文件系统有很多相似之处,但是两者之间的差异也相当显著。HDFS容错性高,且设计用于廉价硬件。HDFS为应用数据提供了高吞吐率的访问,并且适用于拥有大量数据集的应用。HDFS放宽了一些POSIX接口的限制,从而可以对文件系统数据进行流式访问。HDFS最初是用于Apache Nutch网页搜索引擎。现在HDFS是一个Apache Hadoop子项目。

Assumptions and Goals

假定和目标

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

硬件故障

硬件故障是常见的,并非偶然。一个HDFS实例可能会拥有几百上千台服务器,每台都存储了文件系统的部分数据。那么面临的情况就是有大量的组件,并且每个组件都有不小的可能性发生故障,以至于HDFS总存在有一些组件是不工作的。因此,故障检测和迅速自动的故障恢复是HDFS设计至关重要的目标。

Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

流数据访问

在HDFS上运行的应用采用流式访问其数据集。它们不是在通用文件系统上运行的通用应用。HDFS设计用于处理批量数据,而非交互式应用。更强调对于数据访问的高吞吐量,而非低延迟。POSIX接口有一些对于HDFS上的应用来说并非必须的要求。因此为了提高数据的吞吐量,POSIX的一些关键语义被舍弃。

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

大数据集

在HDFS上跑的应用都涉及大数据集。典型的HDFS上的文件大小通常是以GB或TB为单位。因而,HDFS为大文件而设计。它应该提供较高的集合数据带宽,并且单个集群可以扩展到几百个结点。单个HDFS实例应当能支持千万级的文件数量。

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

简单一致性模型

HDFS应用要求对文件的访问模式为一次写多次读。一个文件被创建,写入,关闭,之后不再发生修改。这个假定简化了数据一致性问题,并带来了数据访问的高吞吐量。MapReduce应用和网页爬虫就非常适合这个模型。未来会实现对文件的追加写。

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

移动计算比移动数据代价更小

当计算的运行和其访问的数据在一起时更高效,尤其当数据量很大的时候。因为它减少了网络拥塞,从而增加了系统的总吞吐量。并且将计算向数据端移动要比反过来更好。HDFS为应用将其代码执行移动到数据端提供了接口。

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

异质软硬件平台间的可移植性

HDFS被设计于更容易的从一个平台移动到另一个平台,使得HDFS在很多应用平台上得以应用。

NameNode and DataNodes

名字结点和数据结点

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
在这里插入图片描述

HDFS是一个主从结构。一个HDFS集群有一个名字结点,作为主结点管理整个文件系统的名字空间,并且管理客户端对于文件的访问。还有一系列的数据结点,通常是集群中每台机器一个,用于管理每台机器附带的存储。HDFS对外提供了一个文件系统的名字空间,并允许用户将数据存于文件中。内部实现中,一个文件会被划分成一个或多个分块,并且这些分块存储在一系列的数据结点上。名字结点执行文件系统的各种操作,比如:打开、关闭、重命名文件及目录。它还决定分块到数据结点的映射。数据结点则为文件系统客户端的读写请求进行服务。数据结点同时还听从名字结点的指令进行块创建、删除、复制操作。

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

名字结点和数据结点都是在商用机器上运行的软件。这些机器通常运行一个GNU/Linux文件系统。HDFS采用Java编写,任何可以执行Java的机器都可以执行名字结点和数据结点软件。使用高度可移植的Java语言,使得HDFS可以在很多机器上部署。典型部署是有专门的机器只用来运行名字结点软件。集群中的其他机器执行一个数据结点实例。这个架构不排除在同一台机器上执行多个数据结点,但这种情况并不常见。

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

集群中只有一个名字结点简化了系统的架构。名字结点是仲裁者,并且存储了所有HDFS的元数据信息。这样设计的话,用户数据永远不会流入名字结点。

The File System Namespace

文件系统名字空间

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

HDFS提供了一个传统的树形文件组织结构。一个用户或应用可以创建目录,并在目录下存放文件。文件系统名字空间结构和其他文件系统是一样的。你可以创建和删除文件,移动文件,重命名文件。HDFS并没有实现用户Quota,也没有实现软硬链接。但是其架构并不妨碍实现这些功能。

The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

名字结点管理了文件系统的名字空间。任何对于文件系统命名空间或者其属性的改动都会被记录在名字结点中。应用可以决定一个文件的拷贝数量。一个文件的拷贝数量被称作文件的拷贝因子,并保存在名字结点中。

Data Replication

数据复制

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
在这里插入图片描述

HDFS被设计来在集群中可靠地存储大文件。它把每个文件存储成一系列的数据块,除了最后一块,文件中的数据块都等大。文件的数据块被复制从而容错。数据块大小和拷贝因子可以按不同文件来不同配置。应用可以在创建文件时指定拷贝数量,之后仍然可以修改。HDFS中的文件只允许被写一次,并且任何时间只有一个写操作发生。

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

文件结点决定了数据块复制相关的一切。它定期从每个数据结点接受心跳包和数据块报表。接收到心跳包说明该数据结点还活着。数据块报表则包含了该数据结点上的所有数据块列表。

Replica Placement: The First Baby Steps

复制块的放置:第一步

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

复制块的放置对于HDFS的可靠性和性能至关重要。优化复制块的放置是HDFS与其他分布式文件系统不同之处。这个特性需要大量的经验和调试。目标是通过rack-aware的拷贝块放置策略来提供数据可靠性、可用性和网络带宽利用率。目前的拷贝块放置策略还是这个方向的第一步。短期目标是实现该策略,并在产品中进行验证,从而得到更多经验,再演化更多更复杂的策略。

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

大型HDFS实例会运行在跨多个rack的集群上。不同rack中结点的通信必须要经过交换机。多数情况下,同一个rack中的网络带宽要比不同rack中大。

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

名字结点决定了每个数据结点的rack id。一个简单粗糙的策略是把拷贝块放置在不同的rack中。这可以在整个rack发生故障的时候防止数据的丢失,并且在读数据的时候提高带宽。这个策略最终可以将拷贝块均匀分布在集群中,从而在故障时平衡负载。但是,这个策略增加了写的开销,因为写数据需要将数据块传播到不同的rack中。

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

比如常见的拷贝数量是3,HDFS放置策略将一个拷贝块放置在本地rack的一个结点上,另一个拷贝块放置在远程rack上,最后一个放置在远程rack的另一个结点上。这个策略减少了rack间的写通信,从而提升了写性能。rack故障的概率远比单个结点故障的概率小。这个策略并不影响数据可靠性和可用性,但是当数据块被放置在仅仅两个rack而非三个中,会降低了读的集合网络带宽。同时,文件的拷贝块并没有平均分布在不同的rack中。(1/3的拷贝处在一个结点中,2/3的拷贝块在同一个rack中,剩下的1/3均匀分布在剩下的rack中。这个策略既提高了写性能,同事也没有牺牲数据可靠性和读性能。)(这段没有完全理解)

The current, default replica placement policy described here is a work in progress.
目前,默认的拷贝放置策略还在实现中。

Replica Selection

拷贝块选择

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

为了减小整体的带宽消耗和读延迟,HDFS尝试从一个最近的replica中满足读请求。如果在读发生的rack上有一个拷贝,那么就用这个拷贝。如果HDFS集群跨多个数据中心,那么就选同一个数据中心的。

Safemode

安全模式

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

在启动时,名字结点会进入特殊的安全模式。在此模式中,数据块不会进行复制拷贝。名字结点接受来自数据结点的心跳包和数据块报表。一个数据块报表包含改数据结点上的数据块。每个数据块都有指定数据量的拷贝。当名字结点检查一个数据块的拷贝数据达到要求之后,该数据块才算是被安全拷贝了。当安全拷贝数据块的量达到一定比例时,再等30秒延迟,名字结点才退出安全模式。它之后算出未达标的数据块列表,然后将拷贝送到其他数据结点。

The Persistence of File System Metadata

文件系统元数据的持久化

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

HDFS命名空间是存储在名字结点上的。名字结点采用transaction日志(EditLog)来持久化每次对于文件系统元数据发生的变化。比如:在HDFS上创建一个新文件在EditLog增加一条日志,改变文件的拷贝数量也会增加一条日志。名字结点使用它本地主机OS文件系统上的文件来存储EditLog。整个文件系统名字空间,包括数据块到文件的映射以及文件系统的属性,都存储在FsImage的文件里。FsImage是保存在名字结点本地文件系统上的一个文件。

The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.

名字结点把整个文件系统名字空间的镜像和文件块映射表保持在内存中。关键的元数据项是压缩的,因此4GB内存足够存放大量的文件和目录。当名字结点启动时,它将FsImage和EditLog从磁盘读入内存,将EditLog中的transaction修改更新到FsImage的内存镜像中,然后将新版本的FsImage写入硬盘,然后将早先的EditLog日志清空。上述过程被称作checkpoint。目前的实现中,checkpoint仅在名字结点启动时发生。未来会实现周期性的checkpoint。

The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.

数据结点将HDFS的数据存放在本地文件系统的文件中。HDFS文件对数据结点来说是透明的。数据结点将每个数据块存放在本地文件系统上的不同文件里。数据结点不会在同一个目录下产生所有文件。相反,它采用推导算法来决定每个目录下最优的文件数目,并相应的创建子目录。因为本地文件系统并不一定高效支持单个目录下存放大量文件,所以把所有本地文件都放在同一个目录下并非最优。当数据结点启动时,它扫描其本地文件系统,根据每一个本地文件生成所有HDFS数据块的列表,然后将其报告给名字结点:这就是数据块报表。

The Communication Protocols

通信协议

All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.

所有HDFS的通信都基于TCP/IP。客户端会和名字结点机器的TCP端口建立连接,通过客户端协议与名字结点通信。数据结点和名字结点通过数据结点协议通信。不论是客户端协议,还是数据结点协议,都采用RPC进行封装。设计中,名字结点不会发起任何RPC请求,它只接收并响应来自数据结点和客户端的RPC请求。

Robustness

健壮性

The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.

HDFS的目标就是在有故障的时候能够可靠的保存数据。有三种故障类型:名字结点故障、数据结点故障、网络故障。

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

数据硬盘故障、心跳报文和重新复制

每个数据结点都会周期性的发送心跳报文给名字结点。网络分片会导致部分的数据结点无法和名字结点通信。在此情况下,名字结点无法得到心跳报文。于是名字结点将数据阶段标成死结点,然后不再向它们转发IO请求。任何死结点上的数据对HDFS来说都是不可用的。死数据结点会导致拷贝数量减少。于是名字结点就持续的检查哪些数据块拷贝不足,然后发起新的拷贝。重新拷贝可能由如下原因导致:数据结点不可用,拷贝被毁坏,硬盘或者数据结点故障,或者拷贝因子增大。

Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

集群的重新平衡

HDFS架构是和数据负载均衡scheme兼容的。如果某个数据结点上的空闲空间降到某个阈值之下,某个scheme可以自动的将数据从一个数据结点移动到另一个结点。当对某个文件的需求突增时,某个scheme可以动态的创建额外的拷贝,并且在集群内部重新均衡数据。这些数据均衡scheme还尚未实现。

Data Integrity

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

数据完整性

某个从数据结点得到的数据很有可能被毁坏了。由于存储设备故障,网络故障,或者软件bug,数据的毁坏在所难免。HDFS客户端软件实现了HDFS文件的校验值检查。当客户端创建文件是,它对文件的每个数据块计算一个校验和,然后把这些校验和存出在HDFS命名空间的其他隐藏文件中。当客户端读取文件内容时,它比较接收的数据与校验和是否相符。若不符,客户端就去别的数据结点读取该数据块的其他拷贝。

Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.

元数据硬盘故障

FsImage和EditLog都是HDFS的核心数据结构。对这些文件的毁坏将导致整个HDFS都无法工作。因此,名字结点被配置成支持多份FsImage和EditLog拷贝。任何对FsImage和EditLog的更新都会到值每个拷贝同步被更新。对FsImage和EditLog多份拷贝的同步更新会降低每秒支持的transaction数量。然而,这个性能损失是可接受的,因为HDFS的应用通常对元数据的更改并不频繁。当名字结点重启时,它选择最新的一指的FsImage和EditLog来使用。

The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
名字结点机器在HDFS集群中会发生单点故障。如果名字结点故障了,必须要手动介入恢复。目前尚不支持在其他机器上自动启动名字结点软件。

Snapshots

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release.

快照

快照可以在某个特定时刻保存当时数据的拷贝。快照可以用来把毁坏的HDFS恢复到某个可用的时间点上。HDFS目前尚不支持快照,但以后会有。

Data Organization

数据组织

Data Blocks

HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.

数据块

HDFS被设计来支持大文件。在HDFS上运行的应用都是和大数据集打交道的。这些应用只将它们的数据写入一次,之后多次读取,并且期待流式读取。HDFS支持在文件上“写一次读多次”的语义。典型的HDFS的数据块大小是64MB。因此HDFS文件被分成64MB的块,并且每个数据块都尽量分布在不同的数据结点上。

Staging

A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.

分级缓存

来自客户端的创建文件的请求并不会立即到达名字结点。实际上,HDFS客户端在本地缓存了文件数据。来自应用的写会被透明地重定向到临时的本地文件。当本地文件的大小到达HDFS数据块要求时,客户端就会与名字结点联系。于是名字结点就在文件系统树中插入文件名,并为之分配数据块。名字结点将数据结点编号和数据块的目的地之返回给客户端。于是客户端就直接把本地临时文件的数据写入特定的数据结点。当文件被关闭时,临时文件的数据无论多大会被直接写入远程数据结点。客户端于是通知名字结点文件已被关闭。于是,名字结点将文件创建操作持久化。如果名字结点在文件关闭前发生了故障,那么文件就丢失了。

The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.
上述方式是经过对HDFS上应用的仔细分析得出的。这些应用都是流式写入文件。如果客户端不在本地缓存,直接往远程文件写,那么网络速度和拥塞会严重影响吞吐量。这个方案也有类似先例。早期的分布式文件系统,比如AFS,就使用客户端缓存来提高性能。POSIX接口的限制被放宽以达到数据上传的高性能。

Replication Pipelining

When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

流水线拷贝

当客户端往HDFS文件中写入数据,它的数据先是被写到一个本地文件中。假定HDFS文件的拷贝数量是3,当本地文件达到用户数据块的大小时,客户端从名字结点获取数据结点的列表。这个列表包含了所有将会存储数据拷贝的数据结点。客户端于是将数据块写出到第一个数据结点,当第一个数据结点接受了一些数据小块(以4KB为单位),它先把该小块写入本地存储,然后将该数据小块再传输给列表中的第二个数据结点。第二个数据结点,于是开始接收数据块的每个小块,写入本地存储,再将该小块传给列表上的第三个数据结点。最后,第三个数据结点把数据写入本地存储。因此,数据结点可以以流水线的方式从上一个结点接收数据,并同时将数据传给下一个结点。。最终,数据块从一个数据结点流入下一个。

Accessibility

访问

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.
应用可以通过多种方式访问HDFS。HDFS提供一个Java API接口,也提供C语言的封装接口。并且,可以通过HTTP浏览器访问HDFS中的文件。将HDFS通过WebDAV协议暴露出去的工作尚在进行中。

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:

FS Shell

HDFS允许用户数据以文件或者目录的方式被组织。它提供了一个被称作FS shell的命令行接口,用户可以用以与HDFS上的数据进行交互。命令的形式和其他shell的形式类似。下面是一些命令的例子:

Action Command

  • Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
  • Remove a directory named /foodir bin/hadoop dfs -rmr /foodir
  • View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt

FS shell is targeted for applications that need a scripting language to interact with the stored data.
有些应用需要通过脚本语言来操纵用户数据,那么可以使用FS shell。

DFSAdmin

The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:

DFSAdmin命令

DFSAdmin命令集是用来管理HDFS集群的一系列命令。这些命令只能由HDFS管理员来使用。下面是一些命令的例子:

Action Command

  • Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter
  • Generate a list of DataNodes bin/hadoop dfsadmin -report
  • Recommission or decommission DataNode(s) bin/hadoop dfsadmin -refreshNodes

Browser Interface

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

浏览器接口

典型的HDFS安装会配置一个web服务器,它可以将HDFS命名空间通过TCP端口暴露出去。它允许用户浏览某个HDFS名字空间,并通过浏览器查看文件的内容。

Space Reclamation

空间回收

File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

删除文件和恢复删除文件

当文件被用户或者应用删除时,它并不立即从HDFS中抹去。反之,HDFS先将其重命名到/trash目录下。该文件可以很容易的从/trash目录下恢复回来。/trash中的文件会保持一定的时间。当其生命期超时之后,名字结点会将该文件从HDFS名字空间中彻底删除。删除文件会导致和文件相关联的数据块被释放。需要注意的是:在文件被用户删除到HDFS的空闲空间增加是有一定时间延迟的。

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
用户也可以恢复被删除的文件,只要该文件还在/trash目录下。当用户决定恢复该文件时,他只需要走到/trash目录下获取该文件即可。/trash目录包含了被删文件的最新拷贝。/trash目录就和其他目录一样,仅仅多了一个策略:HDFS会自动删除里面的文件。目前的缺省策略是删除6小时以上的文件。未来,这个策略也是可配置的。

Decrease Replication Factor

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

减小拷贝因子

当一个文件的拷贝因子减小后,名字结点选出多出的拷贝,并删除。下一次心跳包将这个信息带回数据结点,之后数据结点就删除对应的数据块,并将其空间还给整个集群。值得注意的是:从减少拷贝因子到真正释放空间仍然有一段时间延迟。

References

HDFS Java API: https://hadoop.apache.org/core/docs/current/api/

HDFS source code: https://hadoop.apache.org/hdfs/version_control.html

by Dhruba Borthakur

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值