48.Choose three reasons why should you run the HDFS balancer periodically?(Choose three)
A. To ensure that there is capacity in HDFS for additional data
B. To ensure that all blocks in the cluster are 128MB in size
C. To help HDFS deliver consistent performance under heavy loads
D. To ensure that there is consistent disk utilization across the DataNodes
E. To improve data locality MapReduce
Answer: C,D,E
Explanation:
https://www.quora.com/It-is-recommended-that-you-run-the-HDFS-balancer-periodically-Why-Choose-3
E: Balancer does not take data locality into consideration unless it is moving a block. In a cluster that is balanced up to its threshold, it will not move a block just because it is violating the locality policy. (Use a setrep +/setrep - process instead.)
稍微解释下,参考下面的oreily,此选项合力。
A: Think of the Towers of Hanoi puzzle. If you move a ring from one peg to another, has the total number of rings changed? No. Same thing with HDFS: balancing has no impact on total capacity, which would be the same if you balanced or not. It does help the NN place new blocks by allowing for more possible places to place data and ensure that it can replicate blocks more evenly -- i.e., better utilize that capacity. But capacity doesn't magically become available by moving it unless your systems are so broken that they are losing blocks during the move.
C: Under heavy loads, rack locality is more important than node locality because newer data is more likely to be read than the older data. (See studies by Y! and others). Given A above, running balancer is less likely to have any significant impact on performance unless the datanodes are extremely unbalanced (either by a major delete, recently added nodes, etc).
B: Clearly balancer doesn't change the block size. distcp, however, can.
D: This was the sole reason that the balancer was written. Balancing the utilization across disks goes back to the explanation given in B. Source: I was there.
oreily:
Over time, the distribution of blocks across datanodes can become unbalanced. An unbalanced cluster can affect locality for MapReduce, and it puts a greater strain on the highly utilized datanodes, so it’s best avoided.
The balancer program is a Hadoop daemon that redistributes blocks by moving them from overutilized datanodes to underutilized datanodes, while adhering to the block replica placement policy that makes data loss unlikely by placing block replicas on different racks