文章来源:http://blog.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/
我后面会用最新的Hbase版本0.20.6更新这篇文章!
NOTE (Sep 7 2009): Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files. Also updated to use HBase 0.20.0 Release Candidate 3
Introduction
As someone who has “skipped” Java and wants to learn as little as possible about it, and as one who has not had much experience with Hadoop so far, HBase deployment has a big learning curve. So some of the things I describe below may be obvious to those who have had experience in those domains.
Where’s the docs for HBase 0.20
If you go to the HBase wiki, you will find that there is not much documentation on the 0.20 version. This puzzled me since all the twittering, blog posting and other buzz was talking about people using 0.20 even though its “pre-release”
One of the great things about going to meetups such as the HBase Meetup is you can talk to the folks who actually wrote the thing and ask them “Where is the documentation for HBase 0.20
Turns out its in the HBase 0.20.0 distribution in the docs directory. The easiest thing is to get the pre-built 0.20.0 release candididate 3 . If you download the source from the version control repository you have to build the documentation using Ant. If you are an Java/Ant kind of person it might not be hard. But just to build the docs, you have to meet some dependencies like
What we learnt with 0.19.x
We have been learning a lot about making HBase Cluster work at a basic level. I had a lot of problems getting 0.19.x running beyond a single node in Psuedo Distributed mode. I think a lot of my problems was just not getting how it all fit together with Hadoop and what the different startup/shutdown scripts did.
Then we finally tried the HBase EC2 Scripts even though it uses an AMI based on Fedora 8 and seemed wired to 0.19.0. Its a pretty nice script if you want to have an opionated HBase cluster set up. But it did educate us on how to get a cluster to go. It has a bit of strangeness by having a script in /root/hbase_init that is called at boot time to configure all the hadoop and hbase conf scripts and then call the hadoop and hbase startup scripts. Something like this is kind of needed for Amazon EC2 since you don’t really know what the IP Address/FQDN is until boot time.
The scripts also set up an Amazon Security Group for the cluster master and one for the rest of the cluster. I beleive it then uses this as a way to identify the group as well.
The main thing we did get was by going thru mainly the /root/hbase_init script we were able to figure out what the process was for bringing up Hadoop/HBase as a cluster.
We did build a staging cluster with this script. We were able to pretty easily change the scripts to use 0.19.3 instead of 0.19.0. But its opions were different than ours for many things. Plus after talking to the folks at the HBase Meetup, and having all sort of weird problems with our app on 0.19.3, we were convinced that our future is in HBase 0.20. And 0.20 introduces some new things like using Zookeeper to manage the Master selection so seems like its not worth it for us to continue to use this script. Though it helped in our learning quite a bit!
Building an HBase 0.20.0 Cluster
This post will use the HBase pre-built Release Candidate 3 and the prebuild standard Hadoop 0.20.0.
This post will show how to do all this “by hand”. Hopefully we’ll have an article on how to do all this with Chef sometime soon.
The Hbase folks say that you really should have at least 5 regionservers and one master. The master and several of the regionservers can also run the zookeeper quorum. Of course the master serveris also going to run the Hadoop Nameserver Secondary name server. Then the 5 other nodes are running the Hadoop HDFS Data nodes as well as the HBase region servers. When you build out larger clusters, you will probably want to dedicate machines to Zookeepers and hot-standby Hbase Masters. Name Servers are still the Single Point of Failure (SPOF). Rumour has it that this will be fixed in Hadoop 0.21.
We’re not using Map / Reduce yet so won’t go into that, but its just a mater of different startup scripts to make the same nodes do Map/Reduce as HDFS and HBase.
In this example, we’re installing and running everything as Root. It can also be done as a special user like hadoop as described in the earlier blog post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard
Getting the pre-requisites in order
We started with the vanilla alestic Ubuntu 9.04 Jaunty 64Bit Server AMI: ami-5b46a732 and instantiated 6 High CPU Large Instances. You really want as much memory and cores as you can get. You can do the following by hand or combine it with the shell scripting described below in the section Installing Hadoop and HBase.
apt-get update
apt-get upgrade
Then added via apt-get install:
apt-get install sun-java6-jdk
Downloading Hadoop and HBase
You can use the production Hadoop 0.20.0 release. You can find them at the mirrors at http://www.apache.org/dyn/closer.cgi/hadoop/core/. The examples show from one mirror:
wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz
You can download the HBase 0.20.0 Release Candidate 3 in a prebuilt form from http://people.apache.org/~stack/hbase-0.20.0-candidate-3/ (You can get the source out of Version Control:http://hadoop.apache.org/hbase/version_control.html but you'll have to figure out how to build it.)
wget http://people.apache.org/~stack/hbase-0.20.0-candidate-3/hbase-0.20.0.tar.gz
Installing Hadoop and HBase
Assuming that you are running in your home directory on the master server and that the target for the versioned packages is in /mnt/pkgs and that there will be a link in /mnt for the path to the home for hadoop and hbase:
You can do a some simple scripting to do the following on all the nodes at once:
Create a file named servers with the list of the fully qualified domain names of all your servers including “localhost” for the master and call the file “servers”.
Make sure you can ssh to all the servers from the master. Ideally you are using ssh keys. On master:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
On each of your region servers make sure that the id_dsa.pub is also in their authorized_keys (don’t delete any other keys you have in the authorized keys!)
Now with a bit of shell command line scripting you can install on all your servers at once:
for host in `cat servers`
do
echo $host
ssh $host 'apt-get update; apt-get upgrade; apt-get install sun-java6-jdk'
scp ~/hadoop-0.20.0.tar.gz ~/hbase-0.20.0.tar.gz $host:
ssh $host 'mkdir -p /mnt/pkgs; cd /mnt/pkgs; tar xzf ~/hadoop-0.20.0.tar.gz; tar xzf ~/hbase-0.20.0.tar.gz; ln -s /mnt/pkgs/hadoop-0.20.0 /mnt/hadoop; ln -s /mnt/pkgs/hbase-0.20.0 /mnt/hbase'
done
Use Amazon Private DNS Names in Config files
So far I have found that its best to use the Amazon Private DNS names in the hadoop and hbase config files. It looks like HBase uses the system hostname to determine various things at runtime. Thie is always the Private DNS name. It also means that its difficult to use the Web GUI interfaces to HBase from outside of the Amazon Cloud. I set up a “desktop” version of Ubuntu that is running in the Amazon Cloud that I VNC (or NX) into and use its browser to view the Web Interface.
In any case, Amazon instances normally have limited TCP/UDP access to the outside world due to the default security group settings. You would have to add the various ports used by HBase and Hadoop to the security group to allow outside access.
If you do use the Amazon Public DNS names in the config files, there will be startup errors like the following for each instance that is assigned to the zookeeper quorum (there may be other errors as well, but these are the most obvious):
ec2-75-101-104-121.compute-1.amazonaws.com: java.io.IOException: Could not find my address: domU-12-31-39-06-9D-51.compute-1.internal in list of ZooKeeper quorum servers
ec2-75-101-104-121.compute-1.amazonaws.com: at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.writeMyID(HQuorumPeer.java:128)
ec2-75-101-104-121.compute-1.amazonaws.com: at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.main(HQuorumPeer.java:67)
Configuring Hadoop
Now you have to configure the hadoop on master in /mnt/hadoop/conf:
hadoop-env.sh:
The minimal things to change are:
Set your JAVA_HOME to where the java package is installed. On Ubuntu:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
Add the hbase path to the HADOOP_CLASSPATH:
export HADOOP_CLASSPATH=/mnt/hbase/hbase-0.20.0.jar:/mnt/hbase/hbase-0.20.0-test.jar:/conf
core-site.xml:
Here is what we used. Primarily setting where the hadoop files are and the nameserver path and port:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/hadoop</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001</value>
</property>
<property>
<name>tasktracker.http.threads</name>
<value>80</value>
</property>
</configuration>
mapred-site.xml:
Even though we are not currently using Map/Reduce this is a basic config:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>domU-12-31-39-06-9D-51.compute-1.internal:50002</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
</configuration>
hdfs-site.xml:
The main thing to change based on your config is the dfs.replication. It should be less than the total number of data-nodes / region-servers.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.client.block.write.retries</name>
<value>3</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
Put the Fully qualified domain name of your master in the file masters and the names of the data-nodes in the file slaves.
masters:
domU-12-31-39-06-9D-51.compute-1.internal
slaves:
domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal
We did not change any of the other files so far.
Now copy these files to the data-nodes:
for host in `cat slaves`
do
echo $host
scp slaves masters hdfs-site.xml hadoop-env.sh core-site.xml ${host}:/mnt/hadoop/conf
ssh $host '/mnt/hadoop/bin/hadoop namenode -format'
done
And also format the hdfs on the master
/mnt/hadoop/bin/hadoop namenode -format
Configuring HBase
hbase-env.sh:
Similar to the hadoop-env.sh, you must set the JAVA_HOME:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
and add the hadoop conf directory to the HBASE_CLASSPATH:
export HBASE_CLASSPATH=/mnt/hadoop/conf
And for the master you will want to say:
export HBASE_MANAGES_ZK=true
hbase-site.xml:
Mainly need to define the hbase master, hbase rootdir and the list of zookeepers. We also had to bump up the hbase.zookeeper.property.maxClientCnxns from the default of 30 to 300.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.master</name>
<value>domU-12-31-39-06-9D-51.compute-1.internal:60000</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>domU-12-31-39-06-9D-51.compute-1.internal,domU-12-31-39-06-9D-C1.compute-1.internal,domU-12-31-39-06-9D-51.compute-1.internal</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>300</value>
</property>
</configuration>
You will also need to have a file called regionservers. Normally it contains the same hostnames as the hadoop slaves:
regionservers:
domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal
Copy the files to the region-servers:
for host in `cat regionservers`
do
echo $host
scp hbase-env.sh hbase-site.xml regionservers ${host}:/mnt/hbase/conf
done
Starting Hadoop and HBase
On the master:
(This just starts the Hadoop File System services, not Map/Reduce services)
/mnt/hadoop/bin/start-dfs.sh
Then start hbase:
/mnt/hbase/bin/start-hbase.sh
You can shut things down by doing the reverse:
/mnt/hbase/bin/stop-hbase.sh
/mnt/hadoop/bin/stop-dfs.sh
It is advisable to set up init scripts. This is described in the Ubuntu /etc/init.d style startup scripts section of the earlier blog post:Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard