Details
-
Type: New Feature
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 0.22.0
-
Component/s: hdfs client,name-node
-
Labels:None Labels
-
Hadoop Flags:Incompatible change (1)
, Reviewed
Description
Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
The downsides are multiple:
- Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
- The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.
It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...
Attachments
Issue Links
This issue blocks: | ||||
MAPREDUCE-1981 | Improve getSplits performance by using listFiles, the new FileSystem API | Delete this link |
This issue relates to: | ||||
HADOOP-6870 | Add FileSystem#listLocatedStatus to list a directory's content together with each file's block locations | Delete this link |
An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.