day07hbase2

最新推荐文章于 2024-10-10 11:10:29 发布

lhh123lhh123

最新推荐文章于 2024-10-10 11:10:29 发布

阅读量876

点赞数

文章标签： hbase 数据库 big data

本文链接：https://blog.csdn.net/lhhaini/article/details/123113863

版权

一.回顾

功能：提供随机实时大数据量的读写。主要用于离线是为了提高离线存储和计算性能，实时是存储大量实时ETL的结果
基本概念：namespace：命名空间，当做数据库，每一张表都必须属于某个ns；表是分布式结构；rowkey行键唯一标识一行，作为hbase的唯一索引，每张表自带这一列，这一列需要足迹设计；columnFamily：列族对列的分组提高读的性能
hbase架构：主从架构；hmaster：管理节点，管理从节点，管理region分配，管理元数据；HRegionServer：管理数据存储，存储所有表分区Region，构建分布式内存

二.HBASE的dml命令

scan：扫描全表 scan ‘itcast:t2’
插入模拟数据：

put 'itcast:t2','20210201_001','cf1:name','laoda'
put 'itcast:t2','20210201_001','cf1:age',18
put 'itcast:t2','20210201_001','cf3:phone','110'
put 'itcast:t2','20210201_001','cf3:addr','shanghai'
put 'itcast:t2','20210201_001','cf1:id','001'
put 'itcast:t2','20210101_000','cf1:name','laoer'
put 'itcast:t2','20210101_000','cf3:addr','bejing'
put 'itcast:t2','20210901_007','cf1:name','laosan'
put 'itcast:t2','20210901_007','cf3:addr','bejing'
put 'itcast:t2','20200101_004','cf1:name','laosi'
put 'itcast:t2','20200101_004','cf3:addr','bejing'
put 'itcast:t2','20201201_005','cf1:name','laowu'
put 'itcast:t2','20201201_005','cf3:addr','bejing'

在这里插入图片描述

过滤器：
2.1 scan ‘itcast:t2’,{ROWPREFIXFILTER=>‘2021’}rowkey的前缀是2021的
2.2rowkey的起始位置和终止位置：scan ‘itcast:t2’,{STARTROW=>‘20210201_001’,STOPROW=>‘20210901_007’}
2.3实时不建议使用过滤器
incr(计数自增)
举例：create ‘itcast:NEWS_VISIT_CNT’,‘c1’
incr ‘itcast:NEWS_VISIT_CNT’,‘2022_01’,‘c1:cnt’,12
get_counter ‘itcast:NEWS_VISIT_CNT’,‘2022_01’,‘c1:cnt’
incr ‘itcast:NEWS_VISIT_CNT’,‘2022_01’,‘c1:cnt’ //自增1
count（统计rowkey）
count ‘itcast:t2’
HBASE如何统计一张表的行数最快
5.1分布式计算程序，读取Hbase数据统计rowkey

start-yarn.sh
hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'itcast:t2'

5.2count命令速度中等
count ‘itcast:t2’
5.3协处理器，最快的方式
类似于hive的UDF，自己开发一个协处理器，监听表，表中多一条数据就加一，直接读取就可以得到行数

三.HBASE的API

添加依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>day07Hbase</artifactId>
    <version>1.0-SNAPSHOT</version>
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public</url>
        </repository>
    </repositories>
    <dependencies>
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-client</artifactId>
        <version>2.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-mapreduce</artifactId>
        <version>2.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>3.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>3.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-auth</artifactId>
        <version> 3.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version> 3.1.1</version>
    </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <!-- JUnit 4 依赖 -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13</version>
        </dependency>
    </dependencies>
    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

</project>

DDL API：

 package bigdata.itcast.cn.hbase.client;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.NamespaceDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class HbaseDDLClientTest {
    //构建连接
    Connection conn=null;
    @Before
    public  void getConnection() throws IOException {
        //构建配置
        Configuration conf = HBaseConfiguration.create();
        //指定服务端地址，所有客户端都需要连接zk
        conf.set("hbase.zookeeper.quorum", "node1,node2,node3");
        //构建连接实例
        conn= ConnectionFactory.createConnection(conf);
    }
    //实现操作所有ddl操作需要构建管理员对象
    public HBaseAdmin getAdmin() throws IOException {
        //从连接中获取管理员
        HBaseAdmin admin = (HBaseAdmin)conn.getAdmin();
        return admin;
    }
    //创建namespace
    @Test
    public void testCreat() throws IOException {
        HBaseAdmin admin=getAdmin();
        //构建ns配置对象
        NamespaceDescriptor descriptor=NamespaceDescriptor.create("lhh").build();//创建“lhh”namespace
        //创建
        admin.createNamespace(descriptor);
        admin.close();
    }
    //删除namespace
    @Test
    public void testDel() throws IOException {
        HBaseAdmin admin=getAdmin();
        admin.deleteNamespace("lhh");
        admin.close();
    }
    //创建表itcast:t1 ,两个列族basic，other
    @Test
    public void testCrTab() throws IOException {
        HBaseAdmin admin=getAdmin();
        TableName tableName=TableName.valueOf("itcast:t1");
        boolean b = admin.tableExists(tableName);
        if (b){
            admin.disableTable(tableName);
            admin.deleteTable(tableName);
        }
        //创建表
        //构建列族实例
        ColumnFamilyDescriptor  basic=ColumnFamilyDescriptorBuilder.
                newBuilder(Bytes.toBytes("basic"))
                .setMaxVersions(3)//设置列族属性
                .build();
        ColumnFamilyDescriptor  other=ColumnFamilyDescriptorBuilder.
                newBuilder(Bytes.toBytes("other"))
                .build();
        TableDescriptor des=TableDescriptorBuilder.newBuilder(tableName)
                .setColumnFamily(basic)
                .setColumnFamily(other)
                .build();
        admin.createTable(des);


    }

    //释放连接
    @After
    public void  closeHba() throws IOException {
        conn.close();
    }
    //
    //
    //
}

DML API

package bigdata.itcast.cn.hbase.client;

import javafx.scene.control.Tab;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Addressing;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.util.Scanner;

public class HbaseDMLClientTest {
    //构建连接
    Connection conn=null;
    @Before
    public  void getConnection() throws IOException {
        //构建配置
        Configuration conf = HBaseConfiguration.create();
        //指定服务端地址，所有客户端都需要连接zk
        conf.set("hbase.zookeeper.quorum", "node1,node2,node3");
        //构建连接实例
        conn= ConnectionFactory.createConnection(conf);
    }
    //实现dml操作必须构建表对象
    public Table getHbaseTa() throws IOException {
        Table table = conn.getTable(TableName.valueOf("itcast:t1"));
        return table;
    }
    //实现操作put
    @Test
    public void  putTest() throws IOException {
        Table table=getHbaseTa();
        //实现put插入或修改
        Put put = new Put(Bytes.toBytes("2022_0225"));
        put.addColumn(Bytes.toBytes("basic"),Bytes.toBytes("name"),Bytes.toBytes("laoda") );
        put.addColumn(Bytes.toBytes("basic"),Bytes.toBytes("name"),Bytes.toBytes("laosan") );
        put.addColumn(Bytes.toBytes("basic"),Bytes.toBytes("age"),Bytes.toBytes("18") );
        put.addColumn(Bytes.toBytes("other"),Bytes.toBytes("addr"),Bytes.toBytes("sh") );
        //表执行操作
        table.put(put);
        table.close();
    }
    //get
    @Test
    public void  testGet() throws IOException {
        Table table=getHbaseTa();
        Get get=new Get(Bytes.toBytes("20210101_001"));
       // get.addColumn(Bytes.toBytes("other"),Bytes.toBytes("addr"));//指定查询的列族和列

        Result result = table.get(get);//result专门用来存储rowkey的数据对象
        //cell单元格，专门用来存储一列的数据对象，由于一个rowkey可以有多列，所以一个result包含一个cell数组
        for (Cell cell:result.rawCells()){
          //chuli
            System.out.println(
                    Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                            Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                            Bytes.toString(CellUtil.cloneFamily(cell))+"\t"+
                            Bytes.toString(CellUtil.cloneQualifier(cell))+"\t"+
                            Bytes.toString(CellUtil.cloneValue(cell))+"\t"+
                            cell.getTimestamp()
            );
        }
        table.close();
    }
    //deleter
    @Test
    public  void deletetTest() throws IOException {
        Table table=getHbaseTa();
        Delete delete=new Delete(Bytes.toBytes("20210101_001"));
        delete.addColumns(Bytes.toBytes("basic"), Bytes.toBytes("name"));//addColumns删除新旧版本，addColumn只删除最新版本
        table.delete(delete);
        table.close();
    }
    //scan全表扫描
    @Test
    public void  scanTest() throws IOException {
        Table table=getHbaseTa();
        Scan scan=new Scan();

        ResultScanner scanner = table.getScanner(scan);// ResultScanner 多个rowkey的结果，result单个rowkey的结果
        for (Result result:scanner){
            //输出当前rowkey的内容
            System.out.println(Bytes.toString(result.getRow()));
            for (Cell cell:result.rawCells()){
                //
                System.out.println(
                        Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneFamily(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneQualifier(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneValue(cell))+"\t"+
                                cell.getTimestamp()
                );
            }
            System.out.println("==================当前rowkey结束===================");
        }
        table.close();
    }
    //释放连接
    @After
    public void  closeHba() throws IOException {
        conn.close();
    }
}

常用过滤器API

package bigdata.itcast.cn.hbase.client;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import javax.swing.*;
import java.io.IOException;

public class HbaseScanFilterTest {
    //构建连接
    Connection conn=null;
    @Before
    public  void getConnection() throws IOException {
        //构建配置
        Configuration conf = HBaseConfiguration.create();
        //指定服务端地址，所有客户端都需要连接zk
        conf.set("hbase.zookeeper.quorum", "node1,node2,node3");
        //构建连接实例
        conn= ConnectionFactory.createConnection(conf);
    }
    //实现dml操作必须构建表对象
    public Table getHbaseTa() throws IOException {
        Table table = conn.getTable(TableName.valueOf("itcast:t1"));
        return table;
    }

    //scan全表扫描
    @Test
    public void  scanTest() throws IOException {
        Table table=getHbaseTa();
        Scan scan=new Scan();
        //指定过滤的条件：查询2021年1月和2月的数据(rowkey的范围过滤)
//        scan.withStartRow(Bytes.toBytes("202101"));
//        scan.withStopRow(Bytes.toBytes("202103"));
        //需求查询2021年的所有数据：rowkey的前缀匹配
//        Filter filter=new PrefixFilter(Bytes.toBytes("2021"));
//        scan.setFilter(filter);
        //查询所有age=20；
        Filter singleColumnValueExcludeFilter = new SingleColumnValueExcludeFilter(Bytes.toBytes("basic"),
                Bytes.toBytes("age"), CompareOperator.EQUAL,Bytes.toBytes("20"));//返回整个rowkey的值
//        scan.setFilter(singleColumnValueExcludeFilter);
        //查询所有数据的name和age
        byte [][]prefixes={Bytes.toBytes("name"),Bytes.toBytes("age")};
        Filter multipleColumnPrefixFilter = new MultipleColumnPrefixFilter(prefixes);//列名前缀
//        scan.setFilter(multipleColumnPrefixFilter);
        //多条件过滤
        FilterList filterList=new FilterList(FilterList.Operator.MUST_PASS_ALL);//and or参数
        filterList.addFilter(singleColumnValueExcludeFilter);
        filterList.addFilter(multipleColumnPrefixFilter);
        scan.setFilter(filterList);
        ResultScanner scanner = table.getScanner(scan);// ResultScanner 多个rowkey的结果，result单个rowkey的结果
        for (Result result:scanner){
            //输出当前rowkey的内容
            System.out.println(Bytes.toString(result.getRow()));
            for (Cell cell:result.rawCells()){
                //
                System.out.println(
                        Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneRow(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneFamily(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneQualifier(cell))+"\t"+
                                Bytes.toString(CellUtil.cloneValue(cell))+"\t"+
                                cell.getTimestamp()
                );
            }
            System.out.println("==================当前rowkey结束===================");
        }
        table.close();
    }
    //释放连接
    @After
    public void  closeHba() throws IOException {
        conn.close();
    }
}

在这里插入图片描述

测试添加的数据
put ‘itcast:t1’,‘20210201_000’,‘basic:name’,‘laoda’
put ‘itcast:t1’,‘20210201_000’,‘basic:age’,18
put ‘itcast:t1’,‘20210101_001’,‘basic:name’,‘laoer’
put ‘itcast:t1’,‘20210101_001’,‘basic:age’,20
put ‘itcast:t1’,‘20210101_001’,‘basic:sex’,‘male’
put’itcast:t1’,‘20210228_002’,‘basic:name’,‘laosan’
put ‘itcast:t1’,‘20210228_002’,‘basic:age’,22
put ‘itcast:t1’,‘20210228_002’,‘other:phone’,‘110’
put ‘itcast:t1’,‘20210301_003’,‘basic:name’,‘laosi’
put ‘itcast:t1’,‘20210301_003’,‘basic:age’,20
put ‘itcast:t1’,‘20210301_003’,‘other:phone’,‘120’
put ‘itcast:t1’,‘20210301_003’,‘other:addr’,‘shanghai’

四.HBASE的读写原理

HBASE的存储结构：
1.1 Table：是一个逻辑对象，物理上不存在，供用户实现逻辑操作，存储在元数据的一个概念（数据写入表以后的物理存储为分区；一张表会有多个分区region，每个分区存储在不同的机器上；默认每张表只有1个region分区）
1.2 Region：Hbase中数据负载均衡的最小单位（类似于HDFS中的Block，用于实现Hbase中分布式；每张表都可以划分为多个Region，实现分布式存储，默认只有一个；每个region由一台regionserver所管理）
1.3RegionServer：是一个物理对象，hbase中的一个进程，管理一台机器的存储（类似于HDFS中的datanode；一个regionserver可以管理多个region）
一张表有多个分区，分区存储在不同的region servers上
1.4分区与数据的写入：
创建表时可以指定有多少个分区和每个分区的范围：
create ‘itcast：t3’ ，{SPLITS=>[50]}
数据分配的规则：根据rowkey属于哪个范围就写入哪个分区

字母的ASCII码大于数字所以分在region3；

1.5手动创建分区： create ‘itcast:t3’,‘cf’,SPLITS=>[‘20’,‘40’,‘60’,‘80’]
在这里插入图片描述写入数据验证：put
put ‘itcast:t3’,‘0300000’,‘cf:name’,‘laoda’
put ‘itcast:t3’,‘8’,‘cf:name’,‘laoda’
根据rowkey设计，划分分区；

数据在region的内部是如何存储的
2.1 table/regionserver：数据指定写入那张表，提交给对应的某台regionserver
2.2 region：对整张表的数据划分，按照范围划分，实现分布式存储
2.3 store：对分区的数据进行划分，按照列族划分，一个列族对应一个store（不同列族的数据写入不同的store，实现了按照列族将列进行分组，根据用户查询时指定的列族，可以快速的读取对应的store）
2.4 memstore：每个store都有一个内存存储区域（数据写入memstore就直接返回）
2.5 storefile：每个store可能有0个或者多个storefile文件（逻辑上store，物理上：HDFS：HFile二进制文件）
2.6 flush ‘itcast：t3’将内存的数据刷到storefile文件
Hbase数据写入
3.1当执行一条put操作，数据是如何写入对应的所有region的信息
3.1.1根据表名获取当前这张表的所有region的信息（每张表可以有多个region）
3.1.2根据rowkey判断具体写入哪个regionserver（根据rowkey与region来比较）
3.1.3 将put请求提交给这个region所在的regionserver
3.1.4regionserver将数据写入region，根据列族判断写入哪个store
3.1.5将数据写入memstore

3.2客户端如何知道这张表对应的region，每个region的范围，以及regionserver的地址
管理元数据：zk中
表的元数据：region，表的信息，存在于meta中

meta表的rowkey数据可以拿到region的范围和regionserver的地址
itcast:t3,80,1646016140617.c98e34c720207e6893af9ab80 column=info:server, timestamp=1646016141267, value=node1:16020
91c4eaf.
根据rowkey的前缀匹配获取这张表的所有region；用rowkey与region名称中的startkey进行比较，判断当前这条rowkey要写入哪个分区
3.3meta表的位置：meta表的位置就在zk中
写数据：先写wal（预写日志为了避免内存数据丢失，所有数据写入内存之前会先在日志追加记录这个内存操作，然后写入这个store的memstore）wal存储在hdfs上
读取数据
4.1获取元数据：客户请求zk，获取meta表所在的region；读取meta表的数据
4.2 找到对应的region：根据meta表中的元数据，找到对应的region；根据region的范围读取rowkey，判断需要读取哪一个region；根据region的regionserver的地址请求对应的regionserver；
4.3读取数据：先查询memstore；如果开启了缓存就读blockcache；如果缓存中没有也读取storefile，从storefile读取完后放入缓存；如果没有开启缓存就读取storefile；memstore写缓存，数据先写入的地方，最后数据变成storefile，内存区域，regionserver；blockcache：读缓存，如果列族开启了缓存，这个列族的数据从storefile中被读取以后就会放入blockcache内存区域，regionserver的堆内存中

五.LSM模型设计

将内存memstore中的数据溢写到hdfs中变成磁盘文件storefile【HFILE】（关闭集群自动flush；参数配置，自动触发机制）保持内存中不断存储最新的数据
工作中一般手动flush：flush ‘itcast：t2’；避免大量flush占用大量内存和磁盘的IO；
Compaction ：将多个单独有序的storefile文件进行合并，合并为整体有序的大文件，加快读取速度
2.0之前磁盘中合并:minor compaction major compaction
2.0 版本开始增加内存合并：In-memorycompaction（将当前写入的数据划分segment）

major_compact ‘itcast：t3’
region分裂split设计及规则：避免一个region存储的数据过多，实现将一个region分裂为两个region；由regionserver实现region的分裂，由master负责将两个新的region分配到regionserver
通过分裂实现分摊负载，避免热点，降低故障率

六.热点问题

某个时间段内，大量的读写请求全部集中在某个region中，导致这台regionserver的负载较高，其他的region和regionserver比较空闲（这台regionserver故障的概率会增加，整体性能降低，效率较差；主要原因数据分配不均衡）

七.补充知识点

分布式设计：预分区
1.1在建表时指定一张表拥有多个region分区；实现分布式并行读写，将无穷区间划分为几段，将数据存储在不同分区中，实现分区的负载均衡；划分的规则：rowkey或者rowkey的前缀来划分
1.2
rowkey的设计：唯一的标记一条数据，rowkey的前缀是什么，决定了可以按照什么条件走索引查询；region的划分和数据的分区划分。
2.1实现高性能的读写平台。
2.2 设计规则：
2.2.1 业务原则:设计必须贴合业务的需求，一般选择最常用的查询条件作为rowkey的前缀
2.2.2 唯一原则：具有唯一性，不能重复
2.2.3 组合原则：将更多的经常作为查询的列放入rowkey，可以满足更多的条件查询可以走索引查询
2.2.4 散列原则：避免热点问题，将数据的rowkey生成规则构建散列的rowkey（选择不会连续的字段作为rowkey的前缀）
2.2.5 长度原则：在满足业务需求的情况下，rowkey越短越好一般建议长度小于100字节（rowkey越长比较性能越差，rowkey在底层的存储是冗余的）
bulkload介绍：
优化：
4.1内存分配