【存储引擎】hbase的读取写入及常见数据源导入hbase代码实现

最新推荐文章于 2024-01-23 01:28:34 发布

孟知之

最新推荐文章于 2024-01-23 01:28:34 发布

阅读量648

点赞数

分类专栏：存储引擎文章标签： hbase flink hbase写入 hdfs

本文链接：https://blog.csdn.net/weixin_42526352/article/details/106213314

版权

存储引擎专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

HBase是一个高可靠性，高性能，面向列、可伸缩的分布式存储系统，利用HBase技术可以在廉价的PC Sever上搭建起大规模结构化存储集群。
Hbase架构

1. 特点

海量存储，适合存储PB级别的海量数据，在PB级别的数据以及采用廉价pc存储的情况下，能在几十到百毫秒内返回数据。
列式存储，HBase是根据列族拉开存储数据的，列族下面可以有非常多的列，列族在创建表的时候就必须制定。
极易扩展，其扩展性主要表现在两个方面，一个是基础上层处理能力的扩展，一个是基于存储的扩展。
高并发，由于采用廉价PC，因此单个IO的延迟其实并不小，一般在几十到上百个MS之间，这里说是在并发情况下，Hbase的单个IO延迟下降并不多。
稀疏，主要是针对Hbase列的灵活性，在列族中，可以指定任意多的列，在列数据为空的情况下，是不会占用存储空间的。

2. Hbase 的存储格式

Hbase中的所有数据文件都存储在HDFS文件系统上，格式主要有两种：

HFile，HBase中Key-Value数据的存储格式，HFile是Hadoop的二进制格式文件，实际上StoreFile就是对HFile做了轻量级包装，即StoreFile底层就是HFile。
HLog File，HBase中WAL(Write Ahead Log)的存储格式，物理上是Hadoop的Sequence File

3. Hbase数据读取与数据写入

3.1 数据读取

客户端通过 zookeeper 以及-root-表和.meta.表找到目标数据所在的 regionserver(就是数据所在的 region 的主机地址)
(0.98版本以前，0.98及以后没有-ROOT-表)
联系 regionserver 查询目标数据
regionserver 定位到目标数据所在的 region，发出查询请求
region 先在 memstore 中查找，命中则返回
如果在 memstore 中找不到，则在storefile 中扫描（可能会扫描到很多的storefile----BloomFilter）

3.2 数据写入

hbase写入

client 先根据 rowkey 找到对应的 region 所在的 regionserver
client 向 regionserver 提交写请求
regionserver 找到目标 region
region 检查数据是否与 schema 一致
如果客户端没有指定版本，则获取当前系统时间作为数据版本
将更新写入 WAL log
将更新写入 Memstore
当memstore写入的值变多，触发溢写操作（flush），进行文件的溢写，成为一个StoreFile
当溢写的文件过多时，会触发文件的合并（Compact）操作，合并有两种方式（major，minor）,多个StoreFile合并成一个StoreFile,同时进行版本合并和数据删除
- minor compaction：小范围合并，默认是3-10个文件进行合并，不会删除其他版本的数据。
- major compaction：将当前目录下的所有文件全部合并，一般手动触发，会删除其他版本的数据(不同时间戳的)
当region中的数据逐渐变大之后，达到某一个阈值，会进行裂变（一个region等分为两个region，并分配到不同的regionserver），原本的Region会下线，新Split出来的两个Region会被HMaster分配到相应的HRegionServer上，使得原先1个Region的压力得以分流到2个Region上。

4. 从kafka到flink到hbase代码实现

pom依赖配置

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>2.1.5</version>
</dependency> 	
<dependency>
    <groupId>org.apache.phoenix</groupId>
    <artifactId>phoenix-core</artifactId>
    <version>5.0.0-HBase-2.0</version>
</dependency>


<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.8.1</version>
</dependency>

Flink读取kafka消息

public class flink_hbase {

//需要配置zk的地址，以及hbase的zk
    private static String hbaseZookeeperQuorum = "10.25.xxx.53,10.45.xxx.164,10.45.xxx.125";
    private static String hbaseZookeeperClinentPort = "2181";
    private static TableName tableName = TableName.valueOf("testflink");
    private static final String columnFamily = "cf1";



    public static void main(String[] args) {


        final String ZOOKEEPER_HOST = "10.25.xxx.53:2181,10.45.xxx.164:2181,10.45.xxx.125:2181";
        final String KAFKA_HOST = "10.45.xxx.125:9092,10.45.xxx.142:9092";
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(1000); // 非常关键，一定要设置启动检查点！！
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        Properties props = new Properties();
        props.setProperty("zookeeper.connect", ZOOKEEPER_HOST);
        props.setProperty("bootstrap.servers", KAFKA_HOST);
        props.setProperty("group.id", "test");

        DataStream<String> transction = env.addSource(new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(), props));
        //DataStream<String> transction1 = env.addSource(new FlinkKafkaConsumer<String>("test2",new SimpleStringSchema(), props));


        transction.rebalance().map(new MapFunction<String, Object>() {
           public String map(String value)throws IOException{

// 写入Hbase
               writeIntoHBase(value);
               return value;
           }

        }).print();
        //transction.writeAsText("/home/admin/log2");
        // transction.addSink(new HBaseOutputFormat();
        try {
            env.execute();
        } catch (Exception ex) {

            Logger.getLogger(flink_hbase.class.getName()).log(Level.SEVERE, null, ex);
            ex.printStackTrace();
        }
    }

    public static void writeIntoHBase(String m)throws IOException
    {
        // Hbase的配置信息
        org.apache.hadoop.conf.Configuration config = HBaseConfiguration.create();

        config.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
        config.set("hbase.master", "10.45.xxx.26:60000");
        config.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClinentPort);
        config.setInt("hbase.rpc.timeout", 20000);
        config.setInt("hbase.client.operation.timeout", 30000);
        config.setInt("hbase.client.scanner.timeout.period", 200000);

        //config.set(TableOutputFormat.OUTPUT_TABLE, hbasetable);

        Connection c = ConnectionFactory.createConnection(config);

        Admin admin = c.getAdmin();
        if(!admin.tableExists(tableName)){
            admin.createTable(new HTableDescriptor(tableName).addFamily(new HColumnDescriptor(columnFamily)));
        }
        // 获取表对象
        Table t = c.getTable(tableName);

        TimeStamp ts = new TimeStamp(new Date());

        Date date = ts.getDate();

        Put put = new Put(org.apache.hadoop.hbase.util.Bytes.toBytes(date.toString()));

        put.addColumn(org.apache.hadoop.hbase.util.Bytes.toBytes(columnFamily), org.apache.hadoop.hbase.util.Bytes.toBytes("test_column"),
                org.apache.hadoop.hbase.util.Bytes.toBytes(m));
        t.put(put);
        
        // 关闭连接
        t.close();
        c.close();
    }
}

5. HBase从hdfs导入数据

自定义mr

public class HdfsToHBase {
    public static void main(String[] args) throws Exception{
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181");
        conf.set("hbase.rootdir", "hdfs://hadoop1:9000/hbase");
        conf.set(TableOutputFormat.OUTPUT_TABLE, args[1]);
        Job job = Job.getInstance(conf, HdfsToHBase.class.getSimpleName());
        TableMapReduceUtil.addDependencyJars(job);
        job.setJarByClass(HdfsToHBase.class);
        
        job.setMapperClass(HdfsToHBaseMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        
        job.setReducerClass(HdfsToHBaseReducer.class);
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        job.setOutputFormatClass(TableOutputFormat.class);
        job.waitForCompletion(true);
    }
    
    public static class HdfsToHBaseMapper extends Mapper<LongWritable, Text, Text, Text>{
        private Text outKey = new Text();
        private Text outValue = new Text();
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] splits = value.toString().split("\t");
            outKey.set(splits[0]);

// 数据源格式
// 1    zhangsan    10    male    NULL
// 2    lisi    NULL    NULL    NULL

            outValue.set(splits[1]+"\t"+splits[2]+"\t"+splits[3]+"\t"+splits[4]);
            context.write(outKey, outValue);
        }
    }
    
    public static class HdfsToHBaseReducer extends TableReducer<Text, Text, NullWritable>{
        @Override
        protected void reduce(Text k2, Iterable<Text> v2s, Context context) throws IOException, InterruptedException {
            Put put = new Put(k2.getBytes());
            for (Text v2 : v2s) {
                String[] splis = v2.toString().split("\t");
                if(splis[0]!=null && !"NULL".equals(splis[0])){
                    put.add("f1".getBytes(), "name".getBytes(),splis[0].getBytes());
                }
                if(splis[1]!=null && !"NULL".equals(splis[1])){
                    put.add("f1".getBytes(), "age".getBytes(),splis[1].getBytes());
                }
                if(splis[2]!=null && !"NULL".equals(splis[2])){
                    put.add("f1".getBytes(), "gender".getBytes(),splis[2].getBytes());
                }
                if(splis[3]!=null && !"NULL".equals(splis[3])){
                    put.add("f1".getBytes(), "birthday".getBytes(),splis[3].getBytes());
                }
            }
            context.write(NullWritable.get(),put);
        }
    }
}