Will all HFiles managed by a regionserver kept open

28 篇文章 0 订阅
18 篇文章 0 订阅

code 没看仔细,所以在hbase 的mail list上面问了这么个问题。其实再仔细看一下big table的paper就知道肯定是open的。现在分析的结果是hbase random read的performance决定在几个方面:

1)HDFS的seek操作,平均每次random get导致几次seek?

2)memory copy; 这个问题尤其在data locality差的时候,比如datanode 和regionserver不在一个node上的时候;

3)block cache?

 

hi, I know generally regionserver manages HRegions and in the HDFS layer data in HRegion are stored as HFile format. I want to know whether HFiles are all open and things lke block index are all loaded first to improve lookup performance? If so, what will happen if exceeding memory limit? 

Thanks.
  回复
  转发
回复

Stack

  发送至 user
显示详细信息  1月13日 (6 天前)
Yes, all files are opened on startup and kept open.  Open of an hbase
storefile/hfile includes loading up of the file index and metadata.
In our experience, this overhead has been small.  Its currently not
accounted for in our general memory-counting.
 We should for sure add
it.

St.Ack
- 显示引用文字 -
  回复
  转发
回复

Tao Xie

  发送至 user
显示详细信息  1月13日 (6 天前)
Thanks for your response, Stack. I have a further question when understanding hbase.
In my minds, I think a get is executed in the following process.
 
hbase client <=> RS <=> DN

1) hbase client finds the RS managing the key; 2) RS knows the hfile and fetches data from DataNode, this may be a pread + scanning in the hbase data block; 3) record result is returned to client.

Is this correct? So the most expensive operation is step 2?  Any other time-consuming places?


2011/1/13 Stack  <stack@duboce.net>
- 显示引用文字 -

  回复
  转发
回复

Ryan Rawson

  发送至 user
显示详细信息  1月13日 (6 天前)
retrieving data from disk is the most dominant element , until you are
fully cached in which case other factors inside the regionserver
become dominant. at this point copying memory, gc, algorithmic
complexity, etc become important.
- 显示引用文字 -
  回复
  转发
  邀请 Ryan Rawson 聊天
回复

Tao Xie

  发送至 user
显示详细信息  1月14日 (4 天前)
is hdfs seek the most dominant in retrieving data? If records are small (~1k) and most requests are random Gets,  how many seek will happen in average during a Get. Btw, what do you mean by memory copying?  when will it cause large overhead? thanks.

2011/1/13 Ryan Rawson  <ryanobjc@gmail.com>
- 显示引用文字 -

  回复
  转发
回复

Jean-Daniel Cryans

  发送至 user
显示详细信息  3:08 (7 小时前)
There should be as many seeks as there is store files in the region
that's serving the data.
There's also the family dimension e.g. if you
read from only 1 family then only those store files are read.

So on average, I'd say you'll do 3 seeks since you do a minor
compaction once you reach 4 store files in a family.

What he meant by memory copying is just that the data has to be copied
from the socket when you read from HDFS and then into the outbound
socket for the client
after the region server does whatever processing
it needs to do. I guess the more data you read to longer it takes to
copy in RAM?

J-D
- 显示引用文字 -

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值