Why Should HBase RegionServer & Hadoop DataNode Colocate?

Some basic background information first, HBase, as a distributed NoSQL database, its slave (worker) node is named “RegionServer”, all data reading, writing or scanning workloads are on these RegionServers. On the other hand, as a member of Hadoop family, HBase does not re-invent data storage service, it works on HDFS directly, more precisely, the underlying data storage of RegionServer is “DataNode”, see following diagram. 本文原文出处: 本文原文链接: http://blog.csdn.net/bluishglc/article/details/60739183 转载请注明出处。

这里写图片描述

Above diagram is drawn by me, all data for a RegionServer are not 100% read locally, but HBase can make it as most as possible, you can also google by yourselves, here is a reference doc for HBase architecture, you can browse first chart in this article: An In-Depth Look at the HBase Architecture

Both “RegionServer” and “DataNode” are not physical conceptions, they are just “services”, any servers running them can be called “RegionServer” or “DataNode”. Making an analogy with HBase and MySQL or any RMDB, RegionServer is equal to MySQL process, DataNode is equal to Ext4, NTFS or any other file systems. All physical files storing data of HBase are stored on DataNode. So it is easy to imagine that for a HDFS/HBase cluster, each physical node has to install & start RegionServer and DataNode both, in other word, RegionServer & DataNode always co-exist on all slave nodes of a Hadoop cluster. This is also SOH infrastructure architecture followed. Please also check out this thread on the same topic we discussed: Should the HBase region server and Hadoop data node on the same machine?

If an existing Hadoop cluster with dozens or perhaps more than a hundred of nodes want to allocate several instances as dedicated HBase nodes, it will has following trouble:

这里写图片描述

This diagram is very clear, and the trouble is clear too: data storage is not along with data process, so data read, write and scan will go through network, cross multiple servers.

This is totally same as this case: we installed MySQL on a server, but we didn’t let MySQL read/write data from local disk, instead of a REMOTE SHARE FOLDER on other servers, we all know how slow to open a file from a remote share folder, so how could it be terrible if there are 10 TB level data on a remote share folder for read/write? And worse, it must be real time.

It seems there are 2 possible improvements:

  1. Let the RegionServer be a DataNode too.

  2. Let all DataNode be RegionServer too.

For option 1, it is helpless. Let’s say, if the existing Hadoop cluster has 100 instances , 95 instances are pure DataNodes, and 5 instances are DataNode + RegionServer, actually, there are only 5% data can read/write locally, other 95% are still from remote instances, because, to avoid data skew, HDFS has to distribute data on the 100 DataNodes evenly.

For option 2, first, it’s a big action/decision for the existing Hadoop cluster, this is an architecture level change, and even it can do so, the actual result still is sad. here is reason:

Almost the same reason of that RegionServer need co-exist with DataNode, For Yarn, a NodeManager also always coexist with a DataNode, NodeManager take charge of running M/R jobs, it also try best to read/write data locally. Jobs running on Yarn can be allocated and coordinated hardware resources via Yarn, but not for HBase! From resource allocation view, HBase and Yarn are competitive, so generally, we scarcely see cases that installing HBase and Yarn on same instances for production env. This is not the fault of HBase or Yarn, they do totally different jobs, it’s hard to allocate appropriate & balanced resources for them.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Laurence 

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值