Append在HDFS中的发展历程

2 篇文章 0 订阅
1 篇文章 0 订阅

File Appends in HDFS

There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.

这个文章描述了HDFS支持文件Append的历史。

Background

Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.

早期版本的HDFS不支持append操作,一旦文件被关闭,它就是不可更改的,只能以不同文件名copy的方式进行写入修改。这个方式,和MapReduce配合得非常好;把数据处理job的输出写入到一组新文件中。这比操作那些已经就位的文件要高效得多。

A file didn’t exist until it had been successfully closed (by calling FSDataOutputStream‘s close()method). If the client failed before it closed the file, or if the close() method failed by throwing an exception, then (to other clients at least), it was as if the file had never been written. The only way to recover the file was to rewrite it from the beginning. MapReduce worked well with this behavior, since it would simply rerun the task that had failed from the beginning.

一个文件,除非被成功的closed(调用FSDataOutputStream.close()),否则不会存在。如果client在它close文件之前挂了,或者close抛出了异常,这个文件就不会存在,就像从来没有被写过一样。恢复这个文件的唯一方法就是从头再写一遍。MapReduce在这种模式下工作得很好,因为如果task挂了,它会简单地从头重新运行一遍。

First Steps Toward Append

It was not until the 0.15.0 release of Hadoop that open files were visible in the filesystem namespace (HADOOP-1708). Until that point, they magically appeared after they had been written and closed. At the same time, the contents of files could be read by other clients as they were being written, although only the last fully-written block was visible (see HADOOP-89). This made it possible to gauge the progress of a file that was being written, albeit in a crude manner. Additionally, tools such as hadoop fs -tail (and its web UI equivalent) were introduced and allowed users to view the contents of a file as it was being written, block by block.

在0.15.0版本后,open files在文件系统中才是可见的。也是在0.15.0后,这些文件在写入和关闭后,就可以看得到。同时,当这些文件在写入时,文件的内容能够被别的clients读取,尽管只有最后一个完整写入的block能看得见。这就有机会去度量文件写入的进度,虽然是一种粗暴的方式。另外,像hadoop fs -tail这样的工具,也被引入,允许用可可以查看正在写入的文件,一个一个block地。

Stronger Requirements

For some applications, the API offered by HDFS was not strong enough. For example, a database, such as HBase, which wishes to write its transaction log to HDFS, cannot do so in a reliable fashion. For this application, some form of sync operation is needed, which guarantees that the bytes up to a given point in the stream are persisted (like POSIX’s fsync). In the event of the process crashing, it can recover its previous state by playing through the transaction log, and then it can open the log in append mode to write new entries to it.

某些应用下,API还不够强大。比如在HBase的场景下,它希望将transaction写入到hdfs,但是却不能在一个可靠的方式下进行。对这些应用,sync操作是必须的,sync能保证在写入的bytes是持久化的。在进行崩溃的情形下,它能从之前的持久化的事务日志中恢复上一个状态,并能够在append模式下打开这个日志,并写入新的数据。

Similarly, writing unbounded log files to HDFS is unsatisfactory, since it is generally unacceptable to lose up to a block’s worth of log records if the client writing the log stream fails. The application should be able to choose how much data it is prepared to lose, since there is a trade-off between performance and reliability. A database transaction log would opt for reliability, while an application for logging web page accesses might tolerate a few lost records in exchange for better throughput.

同样地,向hdfs写入无限的日志文件是不能令人满意的,因为如果client在log stream的写入时挂掉,就导致block的log records丢失,这是不可接受的。应用程序应该能够选择要丢失多少数据,因为总有一个performance和reliability的折中。数据库的事务日志应该选择可靠性,而应用可能能够接受一些记录丢失但是性能提升的情况。

HADOOP-1700 was opened in August 2007 to add an append operation to HDFS, available through the append() methods on FileSystem. (The issue also includes discussion about a truncate operation, which sets the end of the file to a particular position, causing data at the end of the file to be deleted. However, truncates have never been implemented.) A little under a year later, in July 2008, the append operation was committed in time for the 0.19.0 release of Hadoop Core.

Hadoop-1700在2007年8月建立,在FileSystem中增加了append方法。在0.19.0中,2008年1月,append操作committed了

Implementing the append operation necessitated substantial changes to the core HDFS code. For example, in a pre-append world, HDFS blocks are immutable. However, if you can append to a file, then the last (incomplete) block is mutated, so there needs to be some way of updating its identity so that (for example) a datanode that is down when the block is updated is recognized as having an out-of-date version of the block when it become available again. This is solved by adding a generation stampto each block, an incrementing integer that records the version of a particular block. There are a host of other technical challenges to solve, many of which are articulated in the design document attached to HADOOP-1700.

实现append操作需要在hadoop core代码上做扎实的改进。举例来说,在append之前,block是不可更改的。尽管如此,如果你能append一个文件,最后一个block(未完成的block)是在变化的。这就需要在一个dn挂掉的时候,有一种方法去更新一个识别为过时的block。这是通过一个在每个block中增加generation stamp来解决的,递增记录block的整数值。还有很多其他的技术挑战,在Hadoop-1700中可以看得到。

Append Redux

After the work on HADOOP-1700 was committed, it was noticed in October 2008 that one of the guarantees of the append function, that readers can read data that has been flushed (viaFSDataOutputStream‘s sync() method) by the writer, was not working (HDFS-200 “In HDFS, sync() not yet guarantees data available to the new readers”). Further issues were found, which were related:

2008年10月,hadoop-1700 committed后,append操作还是有问题,见HDFS-200,sync不能保证新的reader能读取数据。

  • HDFS-142 “Datanode should delete files under tmp when upgraded from 0.17″
  • HADOOP-4692 “Namenode in infinite loop for replicating/deleting corrupted block”
  • HDFS-145 “FSNameSystem#addStoredBlock does not handle inconsistent block length correctly”
  • HDFS-168 “Block report processing should compare g[e]neration stamp”

Because of these problems, append support was disabled in the 0.19.1 release of Hadoop Core, and in the first release of the 0.20 branch 0.20.0. Configuration parameter dfs.support.append, which is false by default, was introduced (HADOOP-5332) to make it easy to enable or disable append functionality (note that append functionality is still unstable, so this flag should be set to true only on development or test clusters).

因为这些问题,append在0.19.1中被关闭了。在0.20的第一个release中, dfs.support.append来配置是否支持append,默认为false。

This prompted developers to step back and take a fresh look at the problem. One of the first actions was to create an umbrella issue (HDFS-265 “Revisit append”) with a new design document attached, which aimed to build on the work done to date on appends and provide a foundation for solving the remaining implementation challenges in a coherent fashion. It provides input to the open Jira issues mentioned above—it does not seek to replace them.

A group of interested Hadoop committers had a meeting at Yahoo!’s offices on May 22, 2009 to discuss the requirements for appends. They reached agreement on the precise semantics for the sync operation and renamed it to hflush in the design document in HDFS-265 to avoid confusion with other sync operations. They agreed on API3, which guarantees that data is flushed to all datanodes holding replicas for the current block but is not flushed to the operating system buffers or the datanodes’ persistent disks.

2009年5月22日,一个hadoop committer的小组在yahoo的办公室开会,讨论append的需求。他们为sync达成了精细的设计方案,重新取名为hflush,在HDFS-265中有设计。他们就API3达成一致。它能保证数据flush到replicas中,但没有写入到操作系统的buffer中或者datanode的持久化硬盘上。

At the time of this writing, not all of these issues have been fixed, but hopefully they will be fixed in time for a 0.21 release.

The Future

Getting appends supported has been a stormy ride. It’s not over yet, but when it is finished, it will enable a new class of applications to be built upon HDFS.

When appends are done, what will be next? Record appends from multiple concurrent writers (like Google’s GFS)? Truncation? File writes at arbitrary offsets? Bear in mind, however, that every new feature adds complexity and therefore may compromise reliability or performance, so each must be very carefully considered before being added.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值