hbaes实战

最新推荐文章于 2022-04-06 15:36:12 发布

sparkjvm

最新推荐文章于 2022-04-06 15:36:12 发布

阅读量909

点赞数

分类专栏： hbase

本文链接：https://blog.csdn.net/sparkjvm/article/details/42387841

版权

hbase 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1、HBase 的数据模型(NoSQL系列数据库)

HBase基本介绍

1.1 表(table)，是存储管理数据的

1.2 行键(row key),类似于mysql中的主键

行键是HBase天然自带的。

1.3 列族(column family)，列的集合

HBase是需要在定义表时指定的，列是在插入记录时动态增加的。

HBase表中的数据，每个列族单独一个文件

1.4 时间戳(timestamp)，是列(标签/修饰符)的一个属性

行键和列确定的单元格,可以存储多个数据，每个数据含有时间戳属性，也就是数据具有版本特性，

如果不指定时间戳，或者版本默认取最新的数值。

1.5 HBase中存储的数据都是字节数组

1.6 表中的数据时按照行键的顺序存储物理存储的

2、HBase 的物理模型

2.1 HBase是适合海量数据(如20PB)的秒级简单查询的数据库

2.2 HBase表中的记录，按照行键进行拆分，拆分成一个个的region(区域)

许多个region存储在region server(单独的物理机器)中的。

这样对表的操作转化为对多台region server的并行查询。 --->方便并行查询

2.3 Hbase-default.xml

文件中有一个配置hbase.hregion.filesize定义存储族数据大小

默认值不能超过10G，超过就会分列(拆分)，拆分按照行键的物理顺序存储(1列拆分成2列,

列族下面可以有多个列)

3、HBase体系结构,主从式结构,HMaster(主)、HRegion Server(从)

3.1 允许有多个HMaster，同一个时间只能有一个HMaster运行,备用方案，当这个HMaster宕机,

a.另一个则启动起来.zookeeper来保证总有一个HMaster在运行，zk的Master Election

机制保证的

.zk为Region server分配region

.zk负责region server的负载均衡

.zk发现其上的region server失效，会重新分配其上的region

.zk存储region的寻址入口

.zk保证任何时候集群中只有一个running master

.zk实时监控region server的状态，将region server 的上线和下线信息,实时通知Master

.zk存储hbase的schema(纲要),包括有哪些table,每个table有哪些column family

b.zk保持数据在集群之间的事物性一致,用它存储hbase的信息,机子master宕机数据丢失,

其他集群zk中一样会保存有这些hbase的table信息(把数据存储到zk中意味着把数据存储到

很多的服务器上,非常安全)

c.master基本上就是做决策的任务

d.hbase的核心就是保证region server的安全

3.2、Region Server架构体系

.Master（主）是做决策的,Region Server(从)执行操作的.

(master:那个文件太大了,需要拆分了Region Server?)

.维护master分配给它的region任务，处理对这些region的IO请求

.负责切分在运行过程中变得过大的region

3.3 hbase中有两张特殊的表,-ROOT-、-META-

3.3.1 -META-记录了用户表Region的信息，-META-可以有多个region

3.3.2 -ROOT-记录了-META-表的Region信息，-ROOT-只有一个Region

zk中记录了-ROOT-表的location

client访问用户数据之前需要先访问zk，然后访问-ROOT-表,接着访问-META-表，最后才能找到

用户数据去访问

************************************************************************

4、HBase的伪分布安装

4.1 解压hbase-0.94.7-security.tar.gz

[root@hadoop0 Downloads]# cp hbase-0.94.7-security.tar.gz /usr/local/ 拷贝到/usr/local

[root@hadoop0 Downloads]# cd /usr/local //到拷贝的这个目录下

[root@hadoop0 local]# ls //查看文件列表

hadoop jdk zookeeper-3.4.5.tar.gz

hadoop-1.1.2.tar.gz jdk-6u24-linux-i586.bin

hbase-0.94.7-security.tar.gz zk

[root@hadoop0 local]# tar -zxvf hbase-0.94.7-security.tar.gz /usr/local/ //解压

[root@hadoop0 local]# mv hbase-0.94.7-security hbase //重命名

4.2 设置环境变量

[root@hadoop0 local]# vi /etc/profile //文件中增加环境变量

[root@hadoop0 local]# source /etc/profile //让配置立即生效

4.3 修改$HBASE_HOME/conf/下配置文件(让hbase适合伪分布模式)

[root@hadoop0 hbase]# cd conf //进入hbase的conf目录

[root@hadoop0 conf]# vi hbase-env.sh //修改

.对javahome的修改

export JAVA_HOME=/usr/local/jdk/

.开启zookeeper的hbase管理实例

# Tell HBase whether it should manage it’’s own instance of Zookeeper or not.

export HBASE_MANAGES_ZK=true //hbase要自己管理自己的zookeeper实例

****************************************************************************

[root@hadoop0 conf]# vi hbase-site.xml //修改

<name>hbase.rootdir</name> //hbase存储在hdfs的根路径

<value>hdfs://hadoop0:9000/hbase</value> //hbase数据存储的hdfs路径

</property>

<name>hbase.cluster.distributed</name> //hbase指定是否要安装到一个分布式的环境中

</property>

<name>hbase.zookeeper.quorum</name> //hbase的zk实例

<value>hadoop0</value> //zk的主机地址

</property>

<name>dfs.replication</name> //副本数1,伪分布模式

</property>

****************************************************************

注:(可选)文件regionserver配置的内容为所在主机地址

[root@hadoop0 conf]# vi regionservers //修改区域服务器所在主机

#localhost //指本地主机

hadoop0 //这里我的hbase刚好就在本地主机上,可以不修改,沿用localhost

4.4 启动hbase

.在启动hbase之前一定要确保hadoop是启动的,并且可以写入文件,因为它依赖hadoop的一些信息,

hadoop的hdfs存储hbase数据

*****************************

[root@hadoop0 bin]# jps //启动hbase前

2473 DataNode

2702 TaskTracker

3039 Jps

2590 JobTracker

2364 NameNode

[root@hadoop0 bin]# start-hbase.sh //启动

hadoop0: starting zookeeper, logging to /usr/local/hbase/bin/../logs/ zk启动,有log

hbase-root-zookeeper-hadoop0.out

//master启动

starting master, logging to /usr/local/hbase/logs/hbase-root-master-hadoop0.out

//regionserver启动

hadoop0: starting regionserver, logging to /usr/local/hbase/bin/../logs/

hbase-root-regionserver-hadoop0.out

[root@hadoop0 bin]# jps //验证

2473 DataNode

3688 Jps

2702 TaskTracker

3280 HQuorumPeer //新增

3466 HRegionServer //新增hregionserver区域服务器进程从节点

2590 JobTracker

3334 HMaster //新增master进程主节点

2364 NameNode

=====================查看hbase的启动情况

http://hadoop0:60010/master-status

********************************************************************

5、hbase的shell操作

5.1 进入hbase的shell操作模式

[root@hadoop0 conf]# hbase shell //进入hbase的shell操作模式

HBase Shell; enter ’’help<RETURN>’’ for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.7, r1471806, Wed Apr 24 18:44:36 PDT 2013

5.2 shell命令

名称命令表达式

创建表 create ’’表名称’’, ’’列族名称1’’,’’列族名称2’’,’’列族名称N’’

添加记录 put ’’表名称’’, ’’行名称’’, ’’列名称:’’, ’’值’’

查看记录 get ’’表名称’’, ’’行名称’’

查看表中的记录总数 count ’’表名称’’

删除记录 delete ’’表名’’ ,’’行名称’’ , ’’列名称’’

删除一张表先要屏蔽该表，才能对该表进行删除，第一步 disable ’’表名称’’ 第二步 drop ’’表名称’’

查看所有记录 scan "表名称"

查看某个表某个列中所有数据 scan "表名称" , {COLUMNS=>’’列族名称:列名称’’}

更新记录就是重写一遍进行覆盖

例子:

hbase(main):001:0> create ’’user’’,’’user_id’’,’’address’’,’’info’’ //创建一张表

0 row(s) in 5.7680 seconds

hbase(main):002:0> list //列表所有的表

TABLE

user

1 row(s) in 0.1300 seconds

***********************************************

hbase(main):001:0> describe ’’user’’ //查看表描述信息(表内容)

DESCRIPTION(描述信息) ENABLED(启用)

’’user’’, {NAME => ’’address’’, DATA_BLOCK_ENCODING => true //大括号开始描述列族信息

’’NONE’’, BLOOMFILTER => ’’NONE’’, REPLICATION_SCOPE =>

’’0’’, VERSIONS => ’’3’’, COMPRESSION => ’’NONE’’, MIN_V

ERSIONS => ’’0’’, TTL => ’’2147483647’’, KEEP_DELETED_C

ELLS => ’’false’’, BLOCKSIZE => ’’65536’’, IN_MEMORY =>

’’false’’, ENCODE_ON_DISK => ’’true’’, BLOCKCACHE => ’’

true’’}, {NAME => ’’info’’, DATA_BLOCK_ENCODING => ’’NO

NE’’, BLOOMFILTER => ’’NONE’’, REPLICATION_SCOPE => ’’0

’’, VERSIONS => ’’3’’, COMPRESSION => ’’NONE’’, MIN_VERS //VERSIONS=3说明只会存储三个版本的数据

IONS => ’’0’’, TTL => ’’2147483647’’, KEEP_DELETED_CELL

S => ’’false’’, BLOCKSIZE => ’’65536’’, IN_MEMORY => ’’f

alse’’, ENCODE_ON_DISK => ’’true’’, BLOCKCACHE => ’’tru

e’’}, {NAME => ’’user_id’’, DATA_BLOCK_ENCODING => ’’NO

NE’’, BLOOMFILTER => ’’NONE’’, REPLICATION_SCOPE => ’’0

’’, VERSIONS => ’’3’’, COMPRESSION => ’’NONE’’, MIN_VERS

IONS => ’’0’’, TTL => ’’2147483647’’, KEEP_DELETED_CELL

S => ’’false’’, BLOCKSIZE => ’’65536’’, IN_MEMORY => ’’f

alse’’, ENCODE_ON_DISK => ’’true’’, BLOCKCACHE => ’’tru

e’’}

1 row(s) in 2.3210 seconds

********************************************************

hbase(main):001:0> drop ’’user’’ //直接drop表,是会报错的

ERROR: Table user is enabled. Disable it first.’’ //这张表被启用了,需要先关闭它

Here is some help for this command:

Drop the named table. Table must first be disabled: e.g. "hbase> drop ’’t1’’"

hbase(main):002:0> disable ’’user’’ //禁用表

0 row(s) in 2.0870 seconds

hbase(main):003:0> drop ’’user’’ //删除表,就不会报错

0 row(s) in 1.4010 seconds

使用:list命令验证

***************************************************************

put ’’users’’,’’xiaoming’’,’’info:age’’,’’24’’ 向表中插入一条数据

get ’’users’’,’’xiaoming’’ //查询表里面的关于xiaoming所有的信息

get ’’users,’’xiaoming’’,’’address’’ //查询表里面关于xiaoming的address所有信息

get ’’users’’,’’xiaoming’’,’’address:city’’ //过滤address的信息,只需要查找city相关的地址

put ’’users’’,’’xiaoming’’,’’info:age’’,’’25’’ //插入一条数据,小明25岁

get ’’users’’,’’xiaoming’’,’’info:age’’ //这个时候小明的年龄就变了,时间戳不同

//存储带多个时间戳的数据,info:age对应的小明有很多值,里面是按照时间戳已经版本来区分的

*****************************************************************************

更新记录

>put ’’users’’,’’xiaoming’’,’’info:age’’ ,’’29’’

>get ’’users’’,’’xiaoming’’,’’info:age’’

>put ’’users’’,’’xiaoming’’,’’info:age’’ ,’’30’’

>get ’’users’’,’’xiaoming’’,’’info:age’’

获取单元格数据的版本数据

(只会存储三个版本的最新信息,这个由describe 表查看里面的描述信息VERSIONS=’’3’’决定)

>get ’’users’’,’’xiaoming’’,{COLUMN=>’’info:age’’,VERSIONS=>1}

>get ’’users’’,’’xiaoming’’,{COLUMN=>’’info:age’’,VERSIONS=>2}

>get ’’users’’,’’xiaoming’’,{COLUMN=>’’info:age’’,VERSIONS=>3}

获取单元格数据的某个版本数据(根据时间戳获取某个版本的数据信息)

>get ’’users’’,’’xiaoming’’,{COLUMN=>’’info:age’’,TIMESTAMP=>1364874937056}

全表扫描

>scan ’’users’’

******************************************

删除xiaoming值的’’info:age’’字段

>delete ’’users’’,’’xiaoming’’,’’info:age’’

>get ’’users’’,’’xiaoming’’

>scan ’’users’’ 全表扫描

删除整行

>deleteall ’’users’’,’’xiaoming’’

统计表的行数

>count ’’users’’

清空表

>truncate ’’users’’

6、hbase的javaAPI操作

/**

* javaAPI对hbase的操作(shell)

* @author Andrew

public class HBaseApp {

public static String TABLE_NAME = "table1";

public static String FAMILY_NAME = "family1";

public static String ROW_KEY = "rowkey1";

//创建表，删除表，插入记录，查询一条记录，遍历所有记录

public static void main(String[] args) throws Exception{

Configuration conf = HBaseConfiguration.create();

conf.set("hbase.rootdir", "hdfs://hadoop0:9000/hbase");

//使用eclipse时，必须加zookeeper所在主机,否则无法定位

conf.set("hbase.zookeeper.quorum", "hadoop0");

//1、创建表,删除表使用HBaseAdmin

final HBaseAdmin baseAdmin = new HBaseAdmin(conf);

//如果表不存在才创建

createTable(baseAdmin);

//deleteTable(baseAdmin);

//2、HTable调用方法,插入记录，查询一条记录，遍历所有记录

HTable hTable = new HTable(conf, TABLE_NAME);

//putRecord(hTable); //插入一条记录

//getRecord(hTable); //查询一条记录

scanTable(hTable);

}

private static void scanTable(HTable hTable) throws IOException {

Scan scan = new Scan();

final ResultScanner scanner = hTable.getScanner(scan); //全表扫描

for (Result result : scanner) {

final byte[] value = result.getValue(FAMILY_NAME.getBytes(), "age".getBytes()); //列族,列明

System.out.println(result+"\t"+new String(value));

}

private static void getRecord(HTable hTable) throws IOException {

Get get = new Get(ROW_KEY.getBytes()); //传入行键

final Result result = hTable.get(get); //返回结果

final byte[] value = result.getValue(FAMILY_NAME.getBytes(), "age".getBytes()); //列族,列明

System.out.println(result+"\t"+new String(value));

}

private static void putRecord(final HTable hTable) throws IOException {

Put put = new Put(ROW_KEY.getBytes());

put.add(FAMILY_NAME.getBytes(), "age".getBytes(), "25".getBytes());

hTable.put(put);

hTable.close();

}

private static void deleteTable(final HBaseAdmin baseAdmin)

throws IOException {

baseAdmin.disableTable(TABLE_NAME); //删除表之前需要禁用这张表

//删除表

baseAdmin.deleteTable(TABLE_NAME);

}

private static void createTable(final HBaseAdmin baseAdmin)

throws IOException {

if(!baseAdmin.tableExists(TABLE_NAME)) {

//如果表不存在才创建

final HTableDescriptor tableDescriptor = new HTableDescriptor(TABLE_NAME);

HColumnDescriptor family = new HColumnDescriptor(FAMILY_NAME);

tableDescriptor.addFamily(family);

baseAdmin.createTable(tableDescriptor); //创建表

}

7、使用MapReduce把HDFS中的数据导入到HBase操作

/**

* 使用MapReduce把HDFS中的数据导入到HBase操作

* @author Andrew

public class BatchImport {

//map

static class BatchImportMapper extends Mapper<LongWritable, Text, LongWritable, Text>{

SimpleDateFormat dateformat1=new SimpleDateFormat("yyyyMMddHHmmss");

Text v2 = new Text();

protected void map(LongWritable key, Text value, Context context) throws java.io.IOException ,InterruptedException {

final String[] splited = value.toString().split("\t");

try {

final Date date = new Date(Long.parseLong(splited[0].trim()));

final String dateFormat = dateformat1.format(date);

String rowKey = splited[1]+":"+dateFormat;

v2.set(rowKey+"\t"+value.toString());

context.write(key, v2);

} catch (NumberFormatException e) {

final Counter counter = context.getCounter("BatchImport", "ErrorFormat");

counter.increment(1L);

System.out.println("出错了"+splited[0]+" "+e.getMessage());

}

};

}

//reduce

static class BatchImportReducer extends TableReducer<LongWritable, Text, NullWritable>{

protected void reduce(LongWritable key, java.lang.Iterable<Text> values, Context context) throws java.io.IOException ,InterruptedException {

for (Text text : values) {

final String[] splited = text.toString().split("\t");

final Put put = new Put(Bytes.toBytes(splited[0]));

put.add(Bytes.toBytes("cf"), Bytes.toBytes("date"), Bytes.toBytes(splited[1]));

put.add(Bytes.toBytes("cf"), Bytes.toBytes("msisdn"), Bytes.toBytes(splited[2]));

//省略其他字段，调用put.add(....)即可

context.write(NullWritable.get(), put);

}

};

}

public static void main(String[] args) throws Exception {

final Configuration configuration = new Configuration();

//设置zookeeper

configuration.set("hbase.zookeeper.quorum", "hadoop0");

//设置hbase表名称

configuration.set(TableOutputFormat.OUTPUT_TABLE, "wlan_log");

//将该值改大，防止hbase超时退出

configuration.set("dfs.socket.timeout", "180000");

final Job job = new Job(configuration, "HBaseBatchImport");

job.setMapperClass(BatchImportMapper.class);

job.setReducerClass(BatchImportReducer.class);

//设置map的输出，不设置reduce的输出类型

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setInputFormatClass(TextInputFormat.class);

//不再设置输出路径，而是设置输出格式类型

job.setOutputFormatClass(TableOutputFormat.class);

FileInputFormat.setInputPaths(job, "hdfs://hadoop0:9000/input");

job.waitForCompletion(true);

}

sparkjvm

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录