ccah-500 第45题 You want to minimize the chance of data loss in your cluster. What should you do

最新推荐文章于 2018-12-18 21:43:02 发布

worgent

最新推荐文章于 2018-12-18 21:43:02 发布

阅读量1.9k

点赞数

分类专栏： ccah-500 文章标签： ccah ccah500 cloudera hadoop

本文链接：https://blog.csdn.net/tianbaochao/article/details/51698341

版权

ccah-500 专栏收录该内容

31 篇文章 0 订阅

订阅专栏

45.You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do?

A. Add another master node to increase the number of nodes running the JournalNode which increases the number of machines available to HA to create a quorum

B. Set an HDFS replication factor that provides data redundancy, protecting against node failure

C. Run a Secondary NameNode on a different master from the NameNode in order to provide automatic recovery from a NameNode failure.

D. Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing

E. Configure the cluster’s disk drives with an appropriate fault tolerant RAID level

Answer: D --> B

reference:

选择B。

D选项的resourcemanager 是yarn的一部分，和mr相关，与hdfs基本无关。

C选项的secondary namenode不能自动recovery，ha机制的主从+zookeeper可以自动recovery。

A选项的journalnode不需要新增masternode，可以运行在slavenode上。

http://www.aiotestking.com/cloudera/what-should-you-do-5/

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.

oreily:

WHY NOT USE RAID?

HDFS clusters do not benefit from using RAID (redundant array of independent disks) for datanode storage (although RAID is recommended for the namenode’s disks, to protect against corruption of its metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.

Furthermore, RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (just a bunch of disks) configuration used by HDFS, which round-robins HDFS blocks between all disks.

This is because RAID 0 read and write operations are limited by the speed of the slowest-responding disk in the RAID array. In JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk. Disk performance often shows considerable variation in practice, even for disks of the same model. In some benchmarking carried out on a Yahoo! cluster, JBOD performed 10% faster than RAID 0 in one test (Gridmix)

and 30% better in another (HDFS write throughput).

Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to become unavailable.

worgent

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ccah-500 第45题 You want to minimize the chance of data loss in your cluster. What should you do

45.You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do?
复制链接

扫一扫