FileSystem.getFileBlockLocations

本文档讨论了Hadoop中一个新特性,即FileSystem API新增加的`getFileBlockLocations`方法,旨在优化MapReduce应用程序的输入分片计算。当前实现导致大量RPC调用和命名系统搜索,提出的新API允许一次性获取目录中所有文件的块位置信息,减少RPC次数并提升性能。测试显示,对于包含8000个输入文件的作业,运行时间从8秒缩短到4秒。讨论中提出了不同的实现方案,包括传递目录和文件列表,最终达成一致,通过一个接受文件和目录路径数组的新方法来批量获取块位置信息。
摘要由CSDN通过智能技术生成
 

Details

  • Type: New Feature New Feature
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 0.22.0
  • Component/s: hdfs client,name-node
  • Labels:
    None Labels
  • Hadoop Flags:
    Incompatible change (1)
    , Reviewed

Description

Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
The downsides are multiple:

  1. Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
  2. The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

  1. Text File Licensed for inclusion in ASF works
    hdfsListFiles.patch
    23/Jul/10 19:03
    43 kB
    Hairong Kuang
  2. Text File Licensed for inclusion in ASF works
    hdfsListFiles1.patch
    27/Jul/10 04:42
    40 kB
    Hairong Kuang
  3. Text File Licensed for inclusion in ASF works
    hdfsListFiles2.patch
    31/Jul/10 00:07
    42 kB
    Hairong Kuang
  4. Text File Licensed for inclusion in ASF works
    hdfsListFiles3.patch
    02/Aug/10 22:30
    54 kB
    Hairong Kuang
  5. Text File Licensed for inclusion in ASF works
    hdfsListFiles4.patch
    11/Aug/10 18:20
    47 kB
    Hairong Kuang
  6. Text File Licensed for inclusion in ASF works
    hdfsListFiles5.patch
    11/Aug/10 19:32
    48 kB
    Hairong Kuang

Issue Links

Activity

Hide
Doug Cutting added a comment - 08/May/09 17:27

An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Show
Doug Cutting added a comment - 08/May/09 17:27 An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.
Hide
Doug Cutting added a comment - 08/May/09 17:48

How about adding something like:
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Show
Doug Cutting added a comment - 08/May/09 17:48 How about adding something like: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]); This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.
Hide
Arun C Murthy added a comment - 08/May/09 17:49

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

+1

Show
Arun C Murthy added a comment - 08/May/09 17:49
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
+1
Hide
Konstantin Shvachko added a comment - 08/May/09 18:14

Currently getBlockLocations(src, offset, length) returns a class calledLocatedBlocks, which contains a list of LocatedBlock belonging to the file.

public class LocatedBlocks implements Writable {
  private long fileLength;
  private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}

The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics ofgetBlockLocations() to handle directories, or should we introduce a new method (rpc)getBlockLocations(srcDir) returning LocatedBlockMap.
Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Show
Konstantin Shvachko added a comment - 08/May/09 18:14 Currently getBlockLocations(src, offset, length) returns a class called LocatedBlocks, which contains a list of LocatedBlock belonging to the file.
public class LocatedBlocks implements Writable {
  private long fileLength;
  private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}
The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap. Is there a reason to keep current per file getBlockLocations() if we had a more generic method?
Hide
Doug Cutting added a comment - 08/May/09 18:28

> Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Not that I can think of. +1 for replacing it.

Show
Doug Cutting added a comment - 08/May/09 18:28 > Is there a reason to keep current per file getBlockLocations() if we had a more generic method? Not that I can think of. +1 for replacing it.
Hide
dhruba borthakur added a comment - 10/May/09 09:40

If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

Show
dhruba borthakur added a comment - 10/May/09 09:40 If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?
Hide
Arun C Murthy added a comment - 11/May/09 17:50

Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?

Show
Arun C Murthy added a comment - 11/May/09 17:50 Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值