为何使用HBase?
Hbase 称为Hadoop database,设计理念来自于google的bigtable(基于GFS上一款NoSQL数据库)论文。HDFS支持海量数据的存储,不支持数据修改(记录级别)不支持对于海量数据的随即访问。一般如果想针对于海量数据随机读写在不考虑时间的情况下可以配合Map Reduce实现对数据ETL(耗时)。Hbase是基于HDFS上的一款NoSQL数据库实现对HDFS上的数据随机读写。
HBase和HDFS关系?
Hbase介绍
HBase是一个分布式的、可扩展、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System)所提供的分布式数据存储一样,HBase在Hadoop之上提供了类似于Bigtable的能力。HBase是Apache的Hadoop项目的子项目。HBase不同于一般的关系数据库,它是一个适合于非结构化数据存储的数据库。另一个不同的是HBase基于列的而不是基于行的模式。This project's goal is the hostingof very large tables -- billions of rows X millions of columns -- atop clustersof commodity hardware.
Hbase使用场景?
【First】, make sure you have enough data.If you have hundreds of millions or billions of rows, then HBase is a goodcandidate。
【Second】, make sure you can live withoutall the extra features that an RDBMS provides (e.g., typed columns, secondaryindexes, transactions, advanced query languages, etc.)
【Third】, make sure you have enoughhardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due tothings such as HDFS block replication which has a default of 3), plus aNameNode.
什么是面向列存储
1. 行存储问题
RDBMS | 1.不支持稀疏存储(磁盘) | |||||
test:t_user | ||||||
id | name | pwd | sex | info | … | |
1 | zs | *** | TRUE |
|
| |
2 | ls | *** |
|
|
| |
3 | ww | *** |
| XXX |
|
test:t_user_base | test:t_user_info | ||||||
id | name | pwd | id | sex | info | … | |
1 | zs | *** | 1 | TRUE |
|
| |
2 | ls | *** | 3 |
| XXX |
| |
3 | ww | *** | |||||
提升磁盘和IO的利用率,表链接增多 select id,name from t_user where id =1 IO利用率不高 |
2. 面向列存储
Hbase | 1、列簇是将所有IO操作特性相近的列放置在同一个物理文件中。 | |||
test:t_user | ||||
rowkey | Column Family | |||
column | value | timestamp | ||
1 | cf1:name | zs | 1 | |
1 | cf1:name | 张三 | 2 | |
1 | cf1:pwd | *** | 1 | |
1 | cf2:sex | TRUE | 1 | |
2 | cf1:name | ls | 1 | |
2 | cf1:pwd | *** | 1 | |
3 | cf1:name | ww | 1 | |
3 | cf1:pwd | *** | 1 | |
3 | cf2:info | XXX | 1 |
名词解释
HBase环境搭建
1. 确保Hadoop的HDFS必须正常运行(略)
2. 启动zookeeper(略)
3. 上传Hbase安装包hbase-1.2.4-bin.tar.gz解压在/usr目录下配置HBASE_HOME
[root@CentOS ~]# tar -zxf hbase-1.2.4-bin.tar.gz -C /usr/
[root@CentOS ~]# vim .bashrc
HBASE_HOME=/usr/hbase-1.2.4
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export HBASE_HOME
HBASE_HOME=/usr/hbase-1.2.4
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export HBASE_HOME
[root@CentOS ~]# source .bashrc
4. 修改hbase的配置文件hbase-site.xml
[root@CentOS ~]# vim /usr/hbase-1.2.4/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<!-- 与hdfs配置一致 -->
<value>hdfs://CentOS:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>CentOS</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
5. 修改regionservers文本文件(确保和hadoop中slaves文件一致)
[root@CentOS ~]# vim /usr/hbase-1.2.4/conf/regionservers
CentOS
6. 修改hbase-env.sh文件
使用外部zookeeper管理集群元数据
[root@CentOS ~]# vim /usr/hbase-1.2.4/conf/hbase-env.sh
127 # Tell HBase whether it should manage it's own instance of Zookeeper or not.
128 export HBASE_MANAGES_ZK=false
7. 启动|关闭Hbase
[root@CentOS ~]# start|stop-hbase.sh
starting master, logging to /usr/hbase-1.2.4/logs/hbase-root-master-CentOS.out
CentOS: starting regionserver, logging to /usr/hbase-1.2.4/logs/hbase-root-regionserver-CentOS.out
[root@CentOS ~]# jps
11971 NameNode
12750 Jps
1425 QuorumPeerMain
12536 HMaster
12659 HRegionServer
12054 DataNode
12255 SecondaryNameNode
可以访问:http://centos:16010/master-status#userTables
Hbase shell 命令基本使用
1) Namespace操作 (数据库操作)
2) Table操作
3) HBASE (Create Retrive update delete【 CRUD】 ) DML
--进入HBASE shell 窗口
[root@CentOS ~]# hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017
hbase(main):001:0>
注意:hbase shell中删除使用ctrl+backspace
- namespace(数据库操作):
1)list_namespace(查看所有namespace)
hbase(main):003:0> list_namespace
NAMESPACE
default
hbase
2 row(s) in 1.1400 seconds
2)create_namespace (创建namespace)
create_namespace 'test',{'key'=>'value'}
例:
hbase(main):007:0> create_namespace 'test',{'creator'=>'jeffery'}
0 row(s) in 0.2650 seconds
3)list_namespace_tables(查看namespace下的表)
hbase(main):004:0> list_namespace_tables 'test'
TABLE
t_people
t_user
2 row(s) in 1.7320 seconds
4)describe_namespace (查看一个数据的信息)
hbase(main):008:0> describe_namespace 'test'
DESCRIPTION
{NAME => 'test', creator => 'jeffery'}
1 row(s) in 0.1760 seconds
5)alter_namespace (修改namespcae属性)添加一个属性:
hbase> alter_namespace 'namespace', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
删除一个属性:
hbase> alter_namespace 'namespace', {METHOD => 'unset', NAME=>'PROPERTY_NAME'}
例:
修改一个属性
hbase(main):009:0> alter_namespace 'test',{METHOD=>'set','creator'=>'tom'}
0 row(s) in 2.0060 seconds
hbase(main):010:0> describe_namespace 'test'
DESCRIPTION
{NAME => 'test', creator => 'tom'}
1 row(s) in 0.0420 seconds
添加一个属性
hbase(main):012:0> alter_namespace 'test',{METHOD=>'set','time'=>'2018-07-09'}
0 row(s) in 0.4710 seconds
hbase(main):013:0> describe_namespace 'test'
DESCRIPTION
{NAME => 'test', creator => 'tom', time => '2018-07-09'}
1 row(s) in 0.0230 seconds
删除一个属性
hbase(main):015:0> alter_namespace 'test',{METHOD=>'unset',NAME=>'time'}
0 row(s) in 0.2440 seconds
hbase(main):016:0> describe_namespace 'test'
DESCRIPTION
{NAME => 'test', creator => 'tom'}
1 row(s) in 0.0340 seconds
6)drop_namespace (删除namespace,只可以删除空的database)
hbase(main):017:0> drop_namespace 'test'
0 row(s) in 0.7570 seconds
- table(表操作):
1)create (建表)
指定versions(最多保留几个版本)
hbase(main):024:0> create 'test:t_user',{NAME=>'cf1',VERSIONS=>3},{NAME=>'cf2',VERSIONS=>3}
0 row(s) in 6.6200 seconds
=> Hbase::Table - test:t_user
不指定versions(默认versions保留版本为1)
hbase(main):025:0> create 'test:t_user2','cf1','cf2'
0 row(s) in 5.3320 seconds
=> Hbase::Table - test:t_user2
2)list (展示所有用户表)
hbase(main):026:0> list
TABLE
test:t_user
test:t_user2
2 row(s) in 0.5430 seconds
=> ["test:t_user", "test:t_user2"]
3)describe (查看表)
hbase(main):027:0> describe 'test:t_user'
Table test:t_user is ENABLED
test:t_user
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCK
CACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCK
CACHE => 'true'}
2 row(s) in 1.8370 seconds
hbase(main):028:0> describe 'test:t_user2'
Table test:t_user2 is ENABLED
test:t_user2
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => '
NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCA
CHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => '
NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCA
CHE => 'true'}
2 row(s) in 0.0940 seconds
4)drop (删除表,hbase中不可以直接删除表需要先disable)
hbase(main):030:0> disable 'test:t_user2'
0 row(s) in 14.1030 seconds
hbase(main):031:0> drop 'test:t_user2'
0 row(s) in 2.8550 seconds
5)enable (启动一个disabled的表)
hbase(main):032:0> enable 'test:t_user'
0 row(s) in 0.0470 seconds
hbase(main):033:0> is_enabled 'test:t_user'
true
0 row(s) in 0.0400 seconds
hbase(main):034:0> is_disabled 'test:t_user'
false
0 row(s) in 0.0520 seconds
6)exists 判断表是否存在
hbase(main):035:0> exists 'test:t_user'
Table test:t_user does exist
0 row(s) in 0.0660 seconds
hbase(main):036:0> exists 'test:t_user2'
Table test:t_user2 does not exist
0 row(s) in 0.0520 seconds
- 数据管理
1)put(插入、更新)
语法:put'ns:table','rowkey','cf:column',value,[ts]
若不存在则插入
hbase(main):043:0> put 'test:t_user','user:001','cf1:name','zhangsan'
0 row(s) in 0.0610 seconds
hbase(main):042:0> get 'test:t_user','user:001'
COLUMN CELL
cf1:name timestamp=1527399295854, value=zhangsan
1 row(s) in 0.4010 seconds
若存在则覆盖数据
hbase(main):043:0> put 'test:t_user','user:001','cf1:name','lisi'
0 row(s) in 0.0610 seconds
hbase(main):044:0> get 'test:t_user','user:001'
COLUMN CELL
cf1:name timestamp=1527399460905, value=lisi
1 row(s) in 0.0850 seconds
2)get(获取一个列数据)
语法:get 'ns:table','rowkey' ...
① 获取最新版本的一个列数据
hbase(main):045:0> get 'test:t_user','user:001'
COLUMN CELL
cf1:name timestamp=1527399460905, value=lisi
1 row(s) in 0.9440 seconds
② TIMERANGE 取timestamp区间内的数据(左闭右开区间,左小时间节点,右大时间节点)
get 'test:t_user','user:001',{COLUMN=>'cf1',TIMERANGE=>[1527401460239,1527401478060],VERSIONS=>10}
hbase(main):092:0> get 'test:t_user','user:001',{COLUMN=>'cf1',TIMERANGE=>[1527401460239,1527401478060],VERSIONS=>10}
COLUMN CELL
cf1:name timestamp=1527401471106, value=wangwu
cf1:name timestamp=1527401466085, value=lisi
cf1:name timestamp=1527401460239, value=zhangsan
3 row(s) in 0.0630 seconds
注意:使用TIMERANGE要和VERSIONS连用,否则拿到的是区间内最新的一个数据
③ 获取最新两个版本的列数据
hbase(main):095:0> get 'test:t_user','user:001',{COLUMN=>'cf1',VERSIONS=>2}
COLUMN CELL
cf1:name timestamp=1527401478060, value=zhaoliu
cf1:name timestamp=1527401471106, value=wangwu
2 row(s) in 0.2140 seconds
④ 获取指定timestamp的列数据
hbase(main):096:0> get 'test:t_user','user:001',{COLUMN=>'cf1',TIMESTAMP=>1527401471106}
COLUMN CELL
cf1:name timestamp=1527401471106, value=wangwu
1 row(s) in 0.8810 seconds
⑤ 获取多个列数据
准备数据
hbase(main):099:0> put 'test:t_user','user:001','cf1:age','10'
0 row(s) in 0.4190 seconds
hbase(main):102:0> get 'test:t_user','user:001',{COLUMN => ['cf1:age','cf1:name'],VERSIONS=>1}
COLUMN CELL
cf1:age timestamp=1527403480511, value=10
cf1:name timestamp=1527401478060, value=zhaoliu
2 row(s) in 0.1850 seconds
3)scan(获取一批)
① 获取一批最新版本的列数据
hbase(main):105:0> scan 'test:t_user'
ROW COLUMN+CELL
user:001 column=cf1:age, timestamp=1527403480511, value=10
user:001 column=cf1:name, timestamp=1527401478060, value=zhaoliu
1 row(s) in 0.6840 seconds
②分页获取一批列数据(LIMIT获取几个rowkey对应的数据,STARTROW代表起始rowkey的数据)
准备数据
hbase(main):106:0> put 'test:t_user','user:002','cf1:name','jeffery'
0 row(s) in 0.3810 seconds
hbase(main):107:0> scan 'test:t_user', {COLUMNS => ['cf1'], LIMIT =>1, STARTROW => 'user:002'}
ROW COLUMN+CELL
user:002 column=cf1:name, timestamp=1527404037898, value=jeffery
1 row(s) in 0.3700 seconds
③ 倒序分页获取一批列数据
hbase(main):111:0> scan 'test:t_user', {COLUMNS => ['cf1'], LIMIT =>11, STARTROW => 'user:002',REVERSED => true}
ROW COLUMN+CELL
user:002 column=cf1:name, timestamp=1527404037898, value=jeffery
user:001 column=cf1:age, timestamp=1527403480511, value=10
user:001 column=cf1:name, timestamp=1527401478060, value=zhaoliu
2 row(s) in 0.5790 seconds
4)delete(删除)
测试前数据
hbase(main):113:0> get 'test:t_user','user:001',{COLUMN=>'cf1',VERSIONS=>10}
COLUMN CELL
cf1:age timestamp=1527403480511, value=10
cf1:name timestamp=1527401478060, value=zhaoliu
cf1:name timestamp=1527401471106, value=wangwu
cf1:name timestamp=1527401466085, value=lisi
cf1:name timestamp=1527401460239, value=zhangsan
5 row(s) in 0.1440 seconds
① 指定timestamp删除(删除一个版本的列数据)
hbase(main):114:0> delete 'test:t_user','user:001','cf1:name',1527401460239
0 row(s) in 0.5630 seconds
hbase(main):115:0> get 'test:t_user','user:001',{COLUMN=>'cf1',VERSIONS=>10}
COLUMN CELL
cf1:age timestamp=1527403480511, value=10
cf1:name timestamp=1527401478060, value=zhaoliu
cf1:name timestamp=1527401471106, value=wangwu
cf1:name timestamp=1527401466085, value=lisi
4 row(s) in 4.9890 seconds
② 不指定timestamp删除(删除所有版本的列数据)
hbase(main):116:0> delete 'test:t_user','user:001','cf1:name'
0 row(s) in 0.4970 seconds
hbase(main):117:0> get 'test:t_user','user:001',{COLUMN=>'cf1',VERSIONS=>10}
COLUMN CELL
cf1:age timestamp=1527403480511, value=10
1 row(s) in 1.2470 second
5)truncate(截断)(类似与RDBMS truncate语句,快速删除所有数据但保留表结构)
hbase(main):118:0> truncate
truncate truncate_preserve
hbase(main):118:0> truncate 'test:t_user'
Truncating 'test:t_user' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 16.7690 seconds