Demystifying the Skip Scan in Phoenix

Today the Phoenix blog is brought to you by my esteemed colleague and man of many hats, Mujtaba Chohan, who today is wearing his performance engineer hat.

SKIP SCAN

Phoenix 1.2 uses a Skip Scan for intra-row scanning which allows for significant performance improvement over Multi Gets and Range Scan when rows are retrieved based on a given set of keys.


The Skip Scan leverages SEEK_NEXT_USING_HINT of HBase Filter. It stores information about what set of keys/ranges of keys  are being searched for in each column. It then takes a key (passed to it  during filter evaluation), and figures out if it's in one of the  combinations or range or not. If not, it figures out what the next  highest key is that should be jumped to.

Input to the SkipScanFilter is a List<List<KeyRange>> where the top level list represents each column in the row key (i.e. each primary key part), and the inner list represents ORed together byte array boundaries.

Consider the following query:

SELECT * from T
WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR (KEY1 > 'c' AND KEY1 <= 'e')) AND
KEY2 IN (1, 2)

List<List<KeyRange>> for SkipScanFilter for the above query would be:

  • [[[a - b], [d - e]], [1, 2]]

where  [[a - b], [d - e]] is the range for KEY1 and  [1, 2] keys for KEY2. Consider this running on the following data.


PERFORMANCE

For this performance comparison, we are using simulated data for a real use case outlined on the HBase user mailing list here.

Number of rows: 1 billion rows. 
- Key consists of 50 million  OBJECTID  and 20  FIELDTYPE . Each key has 10  ATTRIBID  and  VALUE  is random integer. 

Phoenix Create Table DML
CREATE TABLE T(
OBJECTID INTEGER NOT NULL, FIELDTYPE CHAR(2) NOT NULL,
CF.ATTRIBID INTEGER,CF.VAL INTEGER 
CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE)) 
COMPRESSION='GZ', BLOCKSIZE='4096'

Query 
SELECT AVG(VAL) FROM T
WHERE OBJECTID IN (250K RANDOM OBJECTIDs) AND 
FIELDTYPE = 'F1' AND 
ATTRIBID='A1'


IN-MEMORY TEST
Time taken to run the query when row are fetched from HBase Block Cache.
TestTime
Phoenix1.7 sec
Batched Gets4.0 sec






DISK READ TEST
Time taken to run the query when data is fetched from disk. 



TestTime
Phoenix37 sec
Batched Gets82 sec
Range Scan12 mins
Hive over HBase20+ mins


SERIAL TEST
To further illustrate the performance gain by using Skip Scan, we will compare Phoenix Serial Skip Scan performance ( phoenix.query.threadPoolSize= 1 ) against Serial Batched Get and Scan Total number of rows are 8M (all rows fit in HBase block cache).  The percentage of random keys passed in IN clause is varied on X axis. 

Phoenix Create Table DML

CREATE TABLE T(
KEY VARCHAR NOT NULL AS KEY,
CF.A BIGINT,CF.B BIGINT, CF2.C BIGINT
Query 
SELECT A FROM T 
WHERE KEY IN (?,?,?...)

Comparison of Serial Skip Scan vs Serial Batched Gets, Scan by varying percentage of keys passed in IN clause

CONCLUSION

Due to Skip Scan use of reseek, it is about 3 times faster than Batched Gets. Skip Scan  can be 20x faster that scans over large data sets that cannot all fit into memory, it's 8x faster even if the data is in memory (when 1% of the rows are selected). This in addition to Phoenix fast performance due to use of server side coprocessor for aggregation, query parallelization which is yet another reason to use the latest Phoenix release!   


CONFIGURATION
HBase 0.94.7
Hadoop 1.04
Region Servers (RS): 4 (6 Core 3GHz, 12GB with 8GB HBase set as HBase heap on each RS)
Total number of regions: 20
Note: All the keys passed in IN clause are present therefore Bloom Filters were not used. 

   http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值