HBase 的使用和使用

导入数据的方式

创建表以后,我们需要向表中 批量 的插入数据

     -1.可以调用Java API

          Put(单条,多条)

     -2.使用Mapreduce

          (1)SQOOP工具,将RDBMS中的数据导入

          (2)使用自带MapReduce程序

          (3)自己编写MapReduce

思考

     向HBase表中插入数据过程(正常情况下)

          (1)数据写入WAL(预写日志)

          (2)写入MemStore

          (3)spill为Hfile文件存储HDFS

     不正常情况

          直接将数据写入到Hfile文件中

使用HBase自带的MapReduce

MapReduce在执行的时候,依赖HBase的相关Jar

     bin/hbase mapredcp

     我们需要Jar包,放到对应ClassPath,设置HADOOP_CLASSPATH 

详情脚本,看单独的文件

HADOOP_HOME=/opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6
HBASE_HOME=/opt/cdh5.7.6/hbase-1.2.0-cdh5.7.6

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf  \
${HADOOP_HOME}/bin/yarn jar  \
${HBASE_HOME}/lib/hbase-server-1.2.0-cdh5.7.6.jar  \
rowcounter  \
ns1:orders

官方解释:

An example program must be given as the first argument.
Valid program names are:
  CellCounter: Count cells in HBase table.
  WALPlayer: Replay WAL files.
  completebulkload: Complete a bulk data load.
  copytable: Export a table from local cluster to peer cluster.
  export: Write table data to HDFS.
  exportsnapshot: Export the specific snapshot to a given FileSystem.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table.
  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed after being appended to the log.
 

 

案例:

--使用importtsv将tsv/csv数据导入到HBase表中
    命名空间
        orders
    销售订单历史表
        依据用户Id和时间范围检索数据
        表名:history_orders
        列簇: order
        rowkey:UserId+orderDate+orderId
        列:date,orderId,userId,orderAmt
    创建表
        create_namespace 'orders'
        create "orders:history_orders1",{NAME=>'order',COMPRESSION => 'SNAPPY'},SPLITS_FILE => 'splits.txt'

 

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf  \
${HADOOP_HOME}/bin/yarn jar  \
${HBASE_HOME}/lib/hbase-server-1.2.0-cdh5.7.6.jar  \
importtsv \
-Dimporttsv.columns=order:date,order:orderId,order:userId,order:orderAmt,HBASE_ROW_KEY \
-Dimporttsv.separator=, \
orders:history_orders \
/sale_orders.csv

方式二:直接将文件写入到HFILE中

(1)
--直接将文件写入HFile中,而不经过MemStore
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf  \
${HADOOP_HOME}/bin/yarn jar  \
${HBASE_HOME}/lib/hbase-server-1.2.0-cdh5.7.6.jar  \
importtsv \
-Dimporttsv.columns=order:date,order:orderId,order:userId,order:orderAmt,HBASE_ROW_KEY \
-Dimporttsv.separator=, \
#HFile文件的存放目录
-Dimporttsv.bulk.output=/datas/hfile-output \
#如果有多个Task在运行,其中一个还有完成,推测可能是因为资源的原因
#在其他机器上也启动该任务,2个机器同时运行这个任务,谁先完成,用谁的结果
-Dmapreduce.map.speculative=false \
-Dmapreduce.reduce.speculative=false \
orders:history_orders1 \
/sale_orders.csv

(2)
--completebulkload: Complete a bulk data load. 将HFile文件加载到HBASE表中
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf  \
${HADOOP_HOME}/bin/yarn jar  \
${HBASE_HOME}/lib/hbase-server-1.2.0-cdh5.7.6.jar  \
completebulkload \
/datas/hfile-output  orders:history_orders1

HBase使用总结

假设决定是HBasa存储海量的数据,现有10TB的问价拿数据,需要加载到HBase表中,方案如下:

(1)设计表(合理)

rowkey的设计(3原则:唯一性、前缀匹配、热点性)

(2)创建表

预分区(分区)、压缩

(3)采用MapReduce程序

将文件文件数据转换HFile文件,采用Bulk load方式加载HFile到表中

 

HBase使用总结

假设决定是HBasa存储海量的数据,现有10TB的问价拿数据,需要加载到HBase表中,方案如下:

(1)设计表(合理)

rowkey的设计(3原则:唯一性、前缀匹配、热点性)

(2)创建表

预分区(分区)、压缩

(3)采用MapReduce程序

         将文件文件数据转换HFile文件,采用Bulk load方式加载HFile到表中

 

例一:采用MapReduce程序

public class F_SaleOrderMapReducer extends Configured implements Tool {

    private final static String ORDERS_TABLE_NAME="ns1:orders";
    private final static String HISTORY_ORDERS_TABLE_NAME="orders:history_orders88";

    static class ReadOrderMapper extends TableMapper<ImmutableBytesWritable, Put> {

        private final static String ORDER_COLUMN_NAME_USER_ID = "user_id";
        private final static String ORDER_COLUMN_NAME_ORDER_ID = "order_id";
        private final static String ORDER_COLUMN_NAME_DATE = "date";
        private final static String HISTORY_ROW_KEY_SEPARATOR = "_";
        private final static byte[] HISTORY_COLUMN_FAMILY = Bytes.toBytes("order");

        private ImmutableBytesWritable mapOutput = new ImmutableBytesWritable();

        @Override
        protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
            Put put = resultToPut(key, value);

            mapOutput.set(put.getRow());
            context.write(mapOutput, put);
        }

        private Put resultToPut(ImmutableBytesWritable key, Result result) {
            String orderId = Bytes.toString(key.get());
            HashMap<String, String> orderMap = new HashMap<>();

            for (Cell cell : result.rawCells()) {
                String filed = Bytes.toString(CellUtil.cloneQualifier(cell));
                String value = Bytes.toString(CellUtil.cloneValue(cell));
                orderMap.put(filed, value);
            }
            StringBuffer sb = new StringBuffer();

            sb.append(orderMap.get(ORDER_COLUMN_NAME_USER_ID)).reverse();
            sb.append(HISTORY_ROW_KEY_SEPARATOR);

            sb.append(orderMap.get(ORDER_COLUMN_NAME_DATE));
            sb.append(HISTORY_ROW_KEY_SEPARATOR);
            sb.append(orderId);

            Put put = new Put(Bytes.toBytes(sb.toString()));
            for (Map.Entry<String, String> entry : orderMap.entrySet()) {
                put.addColumn(
                        HISTORY_COLUMN_FAMILY,
                        Bytes.toBytes(entry.getKey()),
                        Bytes.toBytes(entry.getValue())
                );
            }
                put.addColumn(
                        HISTORY_COLUMN_FAMILY,
                        Bytes.toBytes(ORDER_COLUMN_NAME_ORDER_ID),
                        Bytes.toBytes(orderId)
                );
                return put;
        }
    }


    @Override
    public int run(String[] args) throws Exception {
        //读取配置
        Configuration conf = this.getConf();
        //创建Job
        Job job = Job.getInstance( conf, F_SaleOrderMapReducer.class.getName() );
        job.setJarByClass( F_SaleOrderMapReducer.class );

        //.....
        Scan scan = new Scan();
        scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
        scan.setCacheBlocks(false);  // don't set to true for MR jobs
        // set other scan attrs

        TableMapReduceUtil.initTableMapperJob(
                ORDERS_TABLE_NAME,        // input HBase table name
                scan,             // Scan instance to control CF and attribute selection
                ReadOrderMapper.class,   // mapper
                ImmutableBytesWritable.class,             // mapper output key
                Put.class,// ,             // mapper output value
                job);

        TableMapReduceUtil.initTableReducerJob(
                HISTORY_ORDERS_TABLE_NAME,      // output table
                null,             // reducer class
                job);

        job.setNumReduceTasks(0);

        boolean isSuccess = job.waitForCompletion( true );
        return isSuccess?0:1;
    }

    public static void main(String[] args) {
        //HBase配置文件
        Configuration conf = HBaseConfiguration.create();
        try {
            //运行job
            int status = ToolRunner.run( conf, new F_SaleOrderMapReducer(), args );
            //结束程序
            System.exit( status );
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

例二:将文件文件数据转换HFile文件,采用Bulk load方式加载HFile到表中

public class G_SaleOrdersMapReducer extends Configured implements Tool {

    private final static String ORDERS_TABLE_NAME="ns1:orders";
    private final static String HISTORY_ORDERS_TABLE_NAME="orders:history_orders89";

    static class ReadOrderMapper extends TableMapper<ImmutableBytesWritable, Put> {

        private final static String ORDER_COLUMN_NAME_USER_ID = "user_id";
        private final static String ORDER_COLUMN_NAME_ORDER_ID = "order_id";
        private final static String ORDER_COLUMN_NAME_DATE = "date";
        private final static String HISTORY_ROW_KEY_SEPARATOR = "_";
        private final static byte[] HISTORY_COLUMN_FAMILY = Bytes.toBytes("order");

        private ImmutableBytesWritable mapOutput = new ImmutableBytesWritable();

        @Override
        protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
            Put put = resultToPut(key, value);

            mapOutput.set(put.getRow());
            context.write(mapOutput, put);
        }

        private Put resultToPut(ImmutableBytesWritable key, Result result) {
            String orderId = Bytes.toString(key.get());
            HashMap<String, String> orderMap = new HashMap<>();

            for (Cell cell : result.rawCells()) {
                String filed = Bytes.toString(CellUtil.cloneQualifier(cell));
                String value = Bytes.toString(CellUtil.cloneValue(cell));
                orderMap.put(filed, value);
            }
            StringBuffer sb = new StringBuffer();

            sb.append(orderMap.get(ORDER_COLUMN_NAME_USER_ID)).reverse();
            sb.append(HISTORY_ROW_KEY_SEPARATOR);

            sb.append(orderMap.get(ORDER_COLUMN_NAME_DATE));
            sb.append(HISTORY_ROW_KEY_SEPARATOR);
            sb.append(orderId);

            Put put = new Put(Bytes.toBytes(sb.toString()));
            for (Map.Entry<String, String> entry : orderMap.entrySet()) {
                put.addColumn(
                        HISTORY_COLUMN_FAMILY,
                        Bytes.toBytes(entry.getKey()),
                        Bytes.toBytes(entry.getValue())
                );
            }
            put.addColumn(
                    HISTORY_COLUMN_FAMILY,
                    Bytes.toBytes(ORDER_COLUMN_NAME_ORDER_ID),
                    Bytes.toBytes(orderId)
            );
            return put;
        }
    }


    @Override
    public int run(String[] args) throws Exception {
        //读取配置
        Configuration conf = this.getConf();
        //创建Job
        Job job = Job.getInstance( conf, G_SaleOrdersMapReducer.class.getName() );
        job.setJarByClass( G_SaleOrdersMapReducer.class );

        //.....
        Scan scan = new Scan();
        scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
        scan.setCacheBlocks(false);  // don't set to true for MR jobs
        // set other scan attrs

        TableMapReduceUtil.initTableMapperJob(
                ORDERS_TABLE_NAME,        // input HBase table name
                scan,             // Scan instance to control CF and attribute selection
                ReadOrderMapper.class,   // mapper
                ImmutableBytesWritable.class,             // mapper output key
                Put.class,// ,             // mapper output value
                job);

        TableMapReduceUtil.initTableReducerJob(
                HISTORY_ORDERS_TABLE_NAME,      // output table
                null,             // reducer class
                job);

        job.setNumReduceTasks(0);

        //设置MapReduce输出的数据格式
        job.setOutputFormatClass(HFileOutputFormat2.class);

        //往哪张表里面写
        HTable table = new HTable(conf, HISTORY_ORDERS_TABLE_NAME);
        HFileOutputFormat2.configureIncrementalLoad(job,table,table.getRegionLocator());

        //设置HFile文件的输出目录
        Path outputPath = new Path(args[0] + System.currentTimeMillis());
        FileOutputFormat.setOutputPath(job,outputPath);

        boolean isSuccess = job.waitForCompletion( true );
        //如果MapReduce运行完成,成功之后,将输出HFile文件 加载到 表中
        if (isSuccess){
            LoadIncrementalHFiles load = new LoadIncrementalHFiles(conf);
            load.doBulkLoad(outputPath,table);
        }
        return isSuccess?0:1;
    }

    public static void main(String[] args) {
        //HBase配置文件
        Configuration conf = HBaseConfiguration.create();
        try {
            //运行job
            int status = ToolRunner.run( conf, new G_SaleOrdersMapReducer(), args );
            //结束程序
            System.exit( status );
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

 

通过MR从HBASE表ETL数据到历史订单表中

需求:从HBASE表中读取数据,对数据进行转换 重新写入到一个HBASE表中

源表:ns1:orders

cf :info

rowkey:orderId

columns: date,user_id,order_amt

 

目标表:

orders:history_orders88

cf:order

rowKey:userId + orderDate + orderId

Columns: date,user_id,order_amt, orderId

 

文章分析

https://blog.csdn.net/yunqiinsight/article/details/80134511?tdsourcetag=s_pcqq_aiomsg

http://www.uml.org.cn/bigdata/201804131.asp

 

HBase表中Region的管理

compaction分为2类

major compaction

删除已删除或过期的Cell。

这样提升了读取性能,由于Major compaction重写了所有HFile文件,因此在此过程中可能会发生大量磁盘I/O和网络流量。这被称为写入放大

Major compaction执行计划可以自动运行。由于写入放大,通常计划在周末或晚上进行Major compaction。由于服务器故障或负载平衡,Major compaction还会使任何远程数据文件成为本地服务器的本地数据文件。

Minor Compaction

只是将一个region的多个小的storeFile合并成一个较大的StoreFile文件

将较小的文件重写为较少但较大的文件来减少存储文件的数量,执行合并排序。

 

因为meta表region只有一个,执行离线meta表compaction时只有一个task,非常的缓慢耗时

单个Redion Server可服务大约1000个region。

HBase HMaster功能

Region分配,DDL(create, delete tables)操作由HBase Master处理。

 

Mater的主要职责:

协调Region Servers

启动时分配Region,还原时重新分配Region或者负载均衡

监控集群中所有RegionServer实例(监听Zookeeper的消息)

管理员方法

提供创建,删除,更新表的接口

每个region小为1GB(默认)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值