Hbase（总）

最新推荐文章于 2024-07-28 23:55:21 发布

luoyunfan6

最新推荐文章于 2024-07-28 23:55:21 发布

阅读量549

点赞数

分类专栏： Hbase 文章标签： Hbase 大纲

本文链接：https://blog.csdn.net/luoyunfan6/article/details/102740266

版权

本文详细介绍了HBase的来源、架构、集群搭建、Shell连接、DDL和DML操作、Java API使用、过滤器机制、读写流程、存储机制、寻址机制、MapReduce与Hive的整合以及企业级调优。涵盖了从基础概念到高级应用的全面内容，是学习和理解HBase的宝贵资料。

摘要由CSDN通过智能技术生成

1 HBase文档

1.1 Hbase来源

1. hbase是一个开源的、分布式的、多版本的、可扩展的、非关系型的数据库。
2. hbase是big table的开源的java版本，建立在hdfs基础之上，提供高可靠性、高性能的、列式存储、可伸缩、近实时读写的nosql的数据库系统
3. 数据量越来越大，传统的关系型数据库不能满足存储和查询的需求。而hive虽然能够满足存储的要求，但是hive的本质也是利用底层的mr程序，所以读写速度不快。而且hive不能满足非结构化的、半结构化的存储，hive的主要作用是做分析和统计，hive用于存储是无意义的。

1.2 Hbase的架构

1.2.1 概念

HBASE是一个数据库----可以提供数据的实时随机读写
HBASE与mysql、oralce、db2、sqlserver等关系型数据库不同，它是一个NoSQL数据库（非关系型数据库）

- Hbase的表模型与关系型数据库的表模型不同：
- Hbase的表没有固定的字段定义；
- Hbase的表中每行存储的都是一些key-value对
- Hbase的表中有列簇的划分，用户可以指定将哪些kv插入哪个列族
- Hbase的表在物理存储上，是按照列簇来分割的，不同列簇的数据一定存储在不同的文件中
- Hase的表中的每一行都固定有一个行键，而且每一行的行键在表中不能重复
- Hbase中的数据，包含行键，包含key，包含value，都是byte[]类型，hbase不负责为用户维护数据类型
- HBASE对事务的支持很差

HBASE相比于其他nosql数据库(mongodb、redis、cassendra、hazelcast)的特点：

Hbase的表数据存储在HDFS文件系统中。
从而，hbase具备如下特性：存储容量可以线性扩展； 数据存储的安全性可靠性极高！

1.2.2 HBase的表模型

在这里插入图片描述

- hbase的表模型跟mysql之类的关系型数据库的表模型差别巨大
- hbase的表模型中有：行的概念；但没有字段的概念
- 行中存的都是key-value对，每行中的key-value对中的key可以是各种各样，每行中的key-value对的数量也可以是各种各样

表模型特点

1、一个表，有表名
2、一个表可以分为多个列簇（不同列簇的数据会存储在不同文件中）
3、表中的每一行有一个“行键rowkey”，而且行键在表中不能重复
4、表中的每一对kv数据称作一个cell
5、hbase可以对数据存储多个历史版本（历史版本数量可配置）
6、整张表由于数据量过大，会被横向切分成若干个region（用rowkey范围标识），不同region的数据也存储在不同文件中
7、hbase会对插入的数据按顺序存储：
	- 要点一：首先会按行键排序
	- 要点二：同一行里面的kv会按列簇排序，再按k排序

HBase存储的数据类型

- hbase中只支持byte[] 
- 此处的byte[] 包括了： rowkey,key,value,列簇名,表名

和Hadoop之间的关系

HBase基于hadoop : HBase的存储依赖于HDFS

1.2.3 官网介绍

在这里插入图片描述

适用场景描述

1 需要对海量非结构化的数据进行存储
2 需要随机近实时的读写管理数据

1.2.4 HBase架构

在这里插入图片描述

- Client : hbase客户端，
	1.包含访问hbase的接口。比如，linux shell，java api。
	2.除此之外，它会维护缓存来加速访问hbase的速度。比如region的位置信息。
- Zookeeper ： 
	1.监控Hmaster的状态，保证有且仅有一个活跃的Hmaster。达到高可用。
	2.它可以存储所有region的寻址入口。如：root表在哪一台服务器上。
	3. 实时监控HregionServer的状态，感知HRegionServer的上下线信息，并实时通知给Hmaster。
	4. 存储hbase的部分元数据。
- HMaster : 
	1. 为HRegionServer分配Region（新建表等）。
	2. 负责HRegionServer的负载均衡。
	3. 负责Region的重新分配（HRegionServer宕机之后的Region分配，HRegion裂变：当Region过大之后的拆分）。
	4. Hdfs上的垃圾回收。
	5. 处理schema的更新请求
- HRegionServer ：
	1. 维护HMaster分配给的Region（管理本机的Region）。
	2. 处理client对这些region的读写请求，并和HDFS进行交互。
	3. 负责切分在运行过程中组件变大的Region。
- HLog ： 
	1. 对HBase的操作进行记录，使用WAL写数据，优先写入log（put操作：先写日志再写memstore，这样可以防止数据丢失，即使丢失也可以回滚）。
- HRegion ： 
	1. HBase中分布式存储和负载均衡的最小单元，它是表或者表的一部分。
- Store ： 
	1. 相当于一个列簇
- Memstore ： 
	1. 内存缓冲区，用于将数据批量刷新到hdfs中，默认大小为128M
- HStoreFile : 
	1. 和HFile概念意义，不过是一个逻辑概念。HBase中的数据是以HFile存储在Hdfs上。

1.2.5 各个组件之间的关系

hmaster:hregionserver=1:*
hregionserver:hregion=1:*
hregionserver:hlog=1:1
hregion:hstore=1:*
store:memstore=1:1
store:storefile=1:*
storefile:hfile=1:1

在这里插入图片描述

1.2.6 总结

rowkey:行键，和mysql的主键同理，不允许重复。
columnfamily: 列簇，列的集合之意。
column:列
timestamp:时间戳，默认显示最新的时间戳，可用于控制k对应的多个版本值，默认查最新的数据
version:版本号，表示记录数据的版本
cell:单元格，kv就是cell

模式：无
数据类型:只存储byte[]
多版本：每个值都可以有多个版本
列式存储：一个列簇存储到一个目录
稀疏存储：如果一个kv为null，不占用存储空间

1.3 Hbase集群搭建

1.3.1 单机节点安装

解压

[root@centos1 home]# tar -zxvf hbase-1.2.1-bin.tar.gz -C /usr/local/

配置环境变量

export HBASE_HOME=/usr/local/hbase-1.2.1
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin

hbase-env.sh

# The java implementation to use.  Java 1.7+ required.
export JAVA_HOME=/usr/local/java/jdk1.8.0_45

# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true

hbase-site.xml

<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>file:///usr/local/hbase-1.2.1/hbasedata</value>
        </property>
        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/usr/local/hbase-1.2.1/zkdata</value>
        </property>
</configuration>

启动HBase服务

[root@centos1 conf]# start-hbase.sh

[root@centos1 hbase-1.2.1]# jps
4593 HMaster
3272 ResourceManager
4666 Jps
2923 NameNode
3116 SecondaryNameNode

hbase的客户端连接

[root@centos1 logs]# hbase shell
hbase(main):002:0> status
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load
----------------------------------------------------------------------------------------
hbase(main):003:0> version
1.2.1, r8d8a7107dc4ccbf36a92f64675dc60392f85c015, Wed Mar 30 11:19:21 CDT 2016
----------------------------------------------------------------------------------------
hbase(main):004:0> whoami
root (auth:SIMPLE)
    groups: root
----------------------------------------------------------------------------------------  
hbase(main):005:0> help 'table_help'
Help for table-reference commands.

You can either create a table via 'create' and then manipulate the table via commands like 'put', 'get', etc.
See the standard help information for how to use each of these commands.

However, as of 0.96, you can also get a reference to a table, on which you can invoke commands.
For instance, you can get create a table and keep around a reference to it via:

   hbase> t = create 't', 'cf'

Or, if you have already created the table, you can get a reference to it:

   hbase> t = get_table 't'

You can do things like call 'put' on the table:

  hbase> t.put 'r', 'cf:q', 'v'

which puts a row 'r' with column family 'cf', qualifier 'q' and value 'v' into table t.

To read the data out, you can scan the table:

  hbase> t.scan

which will read all the rows in table 't'.

Essentially, any command that takes a table name can also be done via table reference.
Other commands include things like: get, delete, deleteall,
get_all_columns, get_counter, count, incr. These functions, along with
the standard JRuby object methods are also available via tab completion.

For more information on how to use each of these commands, you can also just type:

   hbase> t.help 'scan'

which will output more information on how to use that command.

You can also do general admin actions directly on a table; things like enable, disable,
flush and drop just by typing:

   hbase> t.enable
   hbase> t.flush
   hbase> t.disable
   hbase> t.drop

Note that after dropping a table, your reference to it becomes useless and further usage
is undefined (and not recommended).
----------------------------------------------------------------------------------------
hbase(main):017:0> help 'ddl'
----------------------------------------------------------------------------------------
hbase(main):026:0> create 't1', {NAME => 'f1', VERSIONS => 5}
hbase(main):026:0> list
hbase(main):026:0> disable 't1'
hbase(main):026:0> drop 't1'

1.3.2 伪分布式安装

作用

	在快速启动单机模式之后，可以重新配置HBase，使其在伪分布式模式下运行。伪分布式模式意味着HBase仍然完全运行在单个主机上，但是每个HBase守护进程(HMaster、HRegionServer和ZooKeeper)都作为单独的进程运行:在独立模式下，所有守护进程都运行在一个jvm进程/实例中。默认情况下，除非您配置了hbase。如quickstart中所述，您的数据仍然存储在/tmp/中。在本演练中，我们将数据存储在HDFS中，假设您有可用的HDFS。您可以跳过HDFS配置，继续在本地文件系统中存储数据。

hbase-site.xml

<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://centos1:9000/hbase</value>
        </property>
        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/usr/local/hbase-1.2.1/zkdata</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
</configuration>

启动

[root@centos1 conf]# start-hbase.sh
[root@centos1 conf]# jps
6352 HRegionServer
6232 HMaster
3272 ResourceManager
6169 HQuorumPeer
2923 NameNode
3116 SecondaryNameNode
6445 Jps

在这里插入图片描述

1.3.3 全分布式安装

作用

HBASE是一个分布式系统
其中有一个管理角色：  HMaster(一般2台，一台active，一台backup)
其他的数据节点角色：  HRegionServer(很多台，看数据容量)

实际上，您需要一个完全分布式的配置来全面测试HBase，并在实际场景中使用它。在分布式配置中，集群包含多个节点，每个节点运行一个或多个HBase守护进程。这些包括主实例和备份主实例、多个ZooKeeper节点和多个RegionServer节点。

角色分配

centos1:namenode hmaster 
centos2:datanode regionserver zookeeper backup master
centos3:datanode regionserver zookeeper
centos4:datanode regionserver zookeeper

安装zookeeper
hbase-env.sh

# The java implementation to use.  Java 1.7+ required.
export JAVA_HOME=/usr/local/java/jdk1.8.0_45
export HBASE_MANAGES_ZK=false

hbase-site.xml

<configuration>
		<!-- 指定hbase在HDFS上存储的路径 -->
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://centos1:9000/hbase</value>
        </property>
		<!-- 指定hbase是分布式的 -->
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
		<!-- 指定zk的地址，多个用“,”分割 -->
        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>mini1:2181,mini2:2181,mini3:2181</value>
        </property>
        <!--还可以这么写
         <property>
     		  <name>hbase.zookeeper.quorum</name>
    		  <value>mini1,mini2,mini3</value>
		  </property>
		<property>
	         <name>hbase.zookeeper.quorum.clientPort</name>
       		 <value>2181</value>
    	</property>
		-->
</configuration>

regionservers

centos2
centos3
centos4

在hbase的conf创建backup-master的文件，并在其中添加主机名centos2

centos2

分发

scp -r hbase-1.2.1/ centos2:/usr/local/
scp -r hbase-1.2.1/ centos3:/usr/local/
scp -r hbase-1.2.1/ centos4:/usr/local/

启动hbase

1. 现启动hadoop和zookeeper
2. start-hbase.sh

解决时间差

yum -y install ntpdate
ntpdate ntp1.aliyun.com

查看hdfs

在这里插入图片描述

查看hbase

[root@centos1 apps]# jps
1888 QuorumPeerMain
2438 ResourceManager
2822 HMaster
2090 NameNode
2284 SecondaryNameNode
3373 Jps
2943 HRegionServer


[root@centos1 apps]# netstat -nltp | grep 2822
tcp        0      0 ::ffff:192.168.49.250:16000 :::*                        LISTEN      2822/java
tcp        0      0 :::16010                    :::*                        LISTEN      2822/java

在这里插入图片描述

在zookeeper上的记录

[root@hive1 zookeeper-3.4.5]# bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper, hbase]

[zk: localhost:2181(CONNECTED) 1] ls /hbase
[replication, meta-region-server, rs, splitWAL, backup-masters, table-lock, flush-table-proc, region-in-transition, online-snapshot, master, running, recovering-regions, draining, namespace, hbaseid, table]

1.4 Hbase的shell连接

1.4.1 普通连接

启动客户端

[root@centos1 bin]# ./hbase shell

帮助语法

help '命令组'
e.g.	help 'create'

1.4.2 测试Namespace

1. list_namespace:查询所有命名空间
hbase(main):008:0> list_namespace
NAMESPACE
default
hbase

2. list_namespace_tables : 查询指定命名空间的表
hbase(main):014:0> list_namespace_tables 'hbase'
TABLE
meta
namespace

3. create_namespace : 创建指定的命名空间
hbase(main):018:0> create_namespace 'ns1'
hbase(main):019:0> list_namespace
NAMESPACE
default
hbase
ns1

4. describe_namespace : 查询指定命名空间的结构
hbase(main):021:0> describe_namespace 'ns1'
DESCRIPTION
{NAME => 'ns1'}

5. alter_namespace ：修改命名空间的结构
hbase(main):022:0>  alter_namespace 'ns1', {METHOD => 'set', 'name' => 'lixi'}

hbase(main):023:0> describe_namespace 'ns1'
DESCRIPTION
{NAME => 'ns1', name => 'lixi'}

hbase(main):022:0> alter_namespace 'ns1', {METHOD => 'unset', NAME => 'name'}
hbase(main):023:0> describe_namespace 'ns1'

6. 删除命名空间
hbase(main):026:0> drop_namespace 'ns1'

hbase(main):027:0> list_namespace
NAMESPACE
default
hbase

7. 利用新添加的命名空间建表
hbase(main):032:0> create 'ns1:t1', 'f1', 'f2'

=> Hbase::Table - ns1:t1
hbase(main):033:0> list
TABLE
ns1:t1

=> ["ns1:t1"]

1.5 DDL和DML的操作

1.5.1 DDL

1.5.1.1 建表

create ：建表

hbase(main):010:0> create 'user_info','base_info','extra_info'

=> Hbase::Table - user_info

hbase(main):043:0> create 'ns1:user_info', {NAME=>'base_info', BLOOMFILTER=>'ROWCOL',VERSIONS=>'3'}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dMsmVrdW-1571973906790)(009.png)]

1.5.2 list : 查询所有的表

hbase(main):002:0> list
TABLE
ns1:t1
ns1:user_info
2 row(s) in 0.2830 seconds

=> ["ns1:t1", "ns1:user_info"]

1.5.3 describe : 查询表结构

hbase(main):003:0> describe 'ns1:user_info'
Table ns1:user_info is ENABLED
ns1:user_info
COLUMN FAMILIES DESCRIPTION
{NAME => 'base_info', BLOOMFILTER => 'ROWCOL', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS
 => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1.5.4 create , splits : 创建表分片

hbase(main):007:0> create 'ns1:t2', 'f1', SPLITS => ['10', '20', '30', '40']

1.5.5 修改表

alter : 修改表，添加修改列簇信息

hbase(main):009:0> alter 'ns1:t1', {NAME=>'lixi_info'}

hbase(main):010:0> describe 'ns1:t1'
Table ns1:t1 is ENABLED
ns1:t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', B
LOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'f2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', B
LOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'lixi_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS =>
 '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
3 row(s) in 0.0250 seconds

删除列簇

hbase(main):014:0> alter 'ns1:t1', 'delete' => 'lixi_info'

hbase(main):015:0> describe 'ns1:t1'
Table ns1:t1 is ENABLED
ns1:t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', B
LOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'f2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', B
LOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0170 seconds

删除表(先要disable表，再删除表)

hbase(main):016:0> disable 'ns1:t1'
0 row(s) in 2.2790 seconds

hbase(main):017:0> drop 'ns1:t1'
0 row(s) in 1.2900 seconds

hbase(main):018:0> list
TABLE
ns1:t2
ns1:user_info
2 row(s) in 0.0090 seconds

=> ["ns1:t2", "ns1:user_info"]

1.5.2 DML

1.5.2.1 插入数据（put命令，不能一次性插入多条）

hbase(main):012:0> put 'user_info','001','base_info:username','lixi'
0 row(s) in 0.9800 seconds

1.5.2.2 scan扫描

hbase(main):024:0> scan 'user_info'
ROW                                                  COLUMN+CELL
 001                                                 column=base_info:love, timestamp=1538897913186, value=basketball
 001                                                 column=base_info:username, timestamp=1538897633942, value=lixi
 002                                                 column=base_info:username, timestamp=1538898168513, value=lishiming
2 row(s) in 0.0520 seconds

通过指定版本查询

hbase(main):024:0> scan 'user_info', {RAW => true, VERSIONS => 1}
ROW                                                 COLUMN+CELL
 001                                                column=base_info:age, timestamp=1546922817429, value=32
 001                                                column=base_info:name, timestamp=1546923712904, value=rock
 001                                                column=extra_info:feature, timestamp=1546922881922, value=shuai
 001                                                column=super_info:size, timestamp=1546922931075, value=111
1 row(s) in 0.0160 seconds

hbase(main):025:0> scan 'user_info', {RAW => true, VERSIONS => 2}
ROW                                                 COLUMN+CELL
 001                                                column=base_info:age, timestamp=1546922817429, value=32
 001                                                column=base_info:name, timestamp=1546923712904, value=rock
 001                                                column=base_info:name, timestamp=1546922810789, value=lixi
 001                                                column=extra_info:feature, timestamp=1546922881922, value=shuai
 001                                                column=super_info:size, timestamp=1546922931075, value=111
1 row(s) in 0.0180 seconds

查询指定列的数据

hbase(main):014:0> scan 'user_info',{COLUMNS => 'base_info:name'}
ROW                                                 COLUMN+CELL
 001                                                column=base_info:name, timestamp=1546923712904, value=rock

分页查询

hbase(main):021:0> scan 'user_info', {COLUMNS

最低0.47元/天解锁文章

luoyunfan6

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录