FileSystem.getFileBlockLocations

最新推荐文章于 2024-04-27 20:11:33 发布

「已注销」

最新推荐文章于 2024-04-27 20:11:33 发布

阅读量2.9k

点赞数

分类专栏： Hadoop 文章标签： path patch input returning api traversal

本文链接：https://blog.csdn.net/joe_007/article/details/6736499

版权

本文档讨论了Hadoop中一个新特性，即FileSystem API新增加的`getFileBlockLocations`方法，旨在优化MapReduce应用程序的输入分片计算。当前实现导致大量RPC调用和命名系统搜索，提出的新API允许一次性获取目录中所有文件的块位置信息，减少RPC次数并提升性能。测试显示，对于包含8000个输入文件的作业，运行时间从8秒缩短到4秒。讨论中提出了不同的实现方案，包括传递目录和文件列表，最终达成一致，通过一个接受文件和目录路径数组的新方法来批量获取块位置信息。

摘要由CSDN通过智能技术生成

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.22.0
Component/s: hdfs client,name-node
Labels:
None Labels

Hadoop Flags:

Incompatible change (1)
, Reviewed

Description

Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
The downsides are multiple:

Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

Attachments

hdfsListFiles.patch
23/Jul/10 19:03

43 kB

Hairong Kuang
hdfsListFiles1.patch
27/Jul/10 04:42

40 kB

Hairong Kuang
hdfsListFiles2.patch
31/Jul/10 00:07

42 kB

Hairong Kuang
hdfsListFiles3.patch
02/Aug/10 22:30

54 kB

Hairong Kuang
hdfsListFiles4.patch
11/Aug/10 18:20

47 kB

Hairong Kuang
hdfsListFiles5.patch
11/Aug/10 19:32

48 kB

Hairong Kuang

Add Link

Issue Links

This issue blocks:
MAPREDUCE-1981	Improve getSplits performance by using listFiles, the new FileSystem API			Delete this link

This issue relates to:
HADOOP-6870	Add FileSystem#listLocatedStatus to list a directory's content together with each file's block locations			Delete this link

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Doug Cutting added a comment - 08/May/09 17:27

An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Show

Doug Cutting added a comment - 08/May/09 17:27 An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Hide

Permalink

Doug Cutting added a comment - 08/May/09 17:48

How about adding something like:
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Show

Doug Cutting added a comment - 08/May/09 17:48 How about adding something like: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]); This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Hide

Permalink

Arun C Murthy added a comment - 08/May/09 17:49

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

Show

Arun C Murthy added a comment - 08/May/09 17:49

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

Hide

Permalink

Konstantin Shvachko added a comment - 08/May/09 18:14

Currently getBlockLocations(src, offset, length) returns a class calledLocatedBlocks, which contains a list of LocatedBlock belonging to the file.

public class LocatedBlocks implements Writable {
  private long fileLength;
  private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}

The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics ofgetBlockLocations() to handle directories, or should we introduce a new method (rpc)getBlockLocations(srcDir) returning LocatedBlockMap.
Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Show

Konstantin Shvachko added a comment - 08/May/09 18:14 Currently getBlockLocations(src, offset, length) returns a class called LocatedBlocks, which contains a list of LocatedBlock belonging to the file.

public class LocatedBlocks implements Writable {
  private long fileLength;
  private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}

The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap. Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Hide

Permalink

Doug Cutting added a comment - 08/May/09 18:28

> Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Not that I can think of. +1 for replacing it.

Show

Doug Cutting added a comment - 08/May/09 18:28 > Is there a reason to keep current per file getBlockLocations() if we had a more generic method? Not that I can think of. +1 for replacing it.

Hide

Permalink

dhruba borthakur added a comment - 10/May/09 09:40

If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

Show

dhruba borthakur added a comment - 10/May/09 09:40 If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

Hide

Permalink

Arun C Murthy added a comment - 11/May/09 17:50

Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?

Show

Arun C Murthy added a comment - 11/May/09 17:50 Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the