Sometimes you may have to perform some analysis on the dataset which is stored in HBASE tables on the Hadoop cluster. Recently, I came across this situation and Revolution Analytics’s packagerhbase came to the rescue. Although, the tutorial given on the rhbase wiki is very well documented but there are some issue which I have faced and I thought I should create a step by step guide for suture references.
The following things are required to be installed on the server and client.
Server: Linux-Ubuntu, Hadoop, HBASE, R, Rstudio-server, thrift
Client: Browser
For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.
1. R
sudo apt-get install r-base
2. Rstudio Server
Install Rstudio server form the instructions given on the download site
3. Apache Thrift
- Download apache thrift version 0.9.0 rather than 0.9.1
- Install all Thrift pre-requisites
Ubuntu
$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
CentOS5/Rhel5
$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
- Build Thrfit according to instructions - Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:
sudo nano ~/.bashrc
And paste the following line at the end
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
Then source the bashrc via
source ~/.bashrc
- Verifiy pkg-config path is correct: Type this in terminal
pkg-config --cflags thrift
And it should return -I/usr/local/include/thrift and not something like -I/usr/local/include
- Copy Thrift library
sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/
4. Start HBASE and Thrift server
HBASE
start-hbase.sh
Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following
jps
You should see something like,
- 6162 DataNode
- 6739 TaskTracker
- 502 JobTracker
- 7029 HMaster
- 14867 Jps
- 12245 Main
- 5924 NameNode
- 7320 HRegionServer
- 13740 ThriftServer
- 6412 SecondaryNameNode
hbase shell
hbase(main):003:0> list #To see list of tables
hbase(main):003:0> describe('TABLE_NAME') #Small description of concerned table
hbase(main):003:0> scan('TABLE_NAME') #Get content of that table
Thrift
hbase thrift start
If it throws an error then try this
hbase thrift start -threadpool
5. rhbase
Installing rhbase package generally requires 2 steps
wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz
R CMD INSTALL rhbase_1.2.0.tar.gz
6. Login into Rstudio server
Type server’s IP in the browser with port 8787. For example, 192.168.20.10:8787
7. Query HBASE from R
require(rhbase)
hostLoc = '192.168.20.10' #Give your server IP
port = 9090 #Default port for thrift service
hb.init(hostLoc, port)
hb.init(serialize="character") #If data in table is characters other no need for this step
hb.list.tables()
hb.describe.table("TABLE_NAME")
data <- c()
iter <- hb.scan(tablename='TABLE_NAME', startrow="1", colspec="FamilyName:")
while(length(row <- iter$get(1))>0){
data <- c(data, row)
}
Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.