Fetch data from HBASE database from R using rhbase package

Sometimes you may have to perform some analysis on the dataset which is stored in HBASE tables on the Hadoop cluster. Recently, I came across this situation and Revolution Analytics’s packagerhbase came to the rescue. Although, the tutorial given on the rhbase wiki is very well documented but there are some issue which I have faced and I thought I should create a step by step guide for suture references.

The following things are required to be installed on the server and client.

Server: Linux-Ubuntu, Hadoop, HBASE, R, Rstudio-server, thrift

Client: Browser

For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.

1. R

sudo apt-get install r-base

2. Rstudio Server

Install Rstudio server form the instructions given on the download site

3. Apache Thrift

Ubuntu

$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

CentOS5/Rhel5

$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel

Build Thrfit according to instructions - Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:

sudo nano ~/.bashrc

And paste the following line at the end

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

Then source the bashrc via

source ~/.bashrc
  • Verifiy pkg-config path is correct: Type this in terminal

pkg-config --cflags thrift

And it should return -I/usr/local/include/thrift and not something like -I/usr/local/include

  • Copy Thrift library

sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

4. Start HBASE and Thrift server

HBASE

start-hbase.sh

Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following

jps

You should see something like,

  • 6162 DataNode
  • 6739 TaskTracker
  • 502 JobTracker
  • 7029 HMaster
  • 14867 Jps
  • 12245 Main
  • 5924 NameNode
  • 7320 HRegionServer
  • 13740 ThriftServer
  • 6412 SecondaryNameNode
hbase shell
hbase(main):003:0> list #To see list of tables
hbase(main):003:0> describe('TABLE_NAME') #Small description of concerned table
hbase(main):003:0> scan('TABLE_NAME') #Get content of that table

Thrift

hbase thrift start

If it throws an error then try this

hbase thrift start -threadpool

5. rhbase

Installing rhbase package generally requires 2 steps

wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz
R CMD INSTALL rhbase_1.2.0.tar.gz

6. Login into Rstudio server

Type server’s IP in the browser with port 8787. For example, 192.168.20.10:8787

7. Query HBASE from R

require(rhbase)
hostLoc = '192.168.20.10'  #Give your server IP
port = 9090  #Default port for thrift service

hb.init(hostLoc, port)
hb.init(serialize="character")  #If data in table is characters other no need for this step

hb.list.tables()
hb.describe.table("TABLE_NAME")

data <- c()
iter <- hb.scan(tablename='TABLE_NAME', startrow="1", colspec="FamilyName:")
while(length(row <- iter$get(1))>0){
  data <- c(data, row)
}

Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值