hbase加载数据的方式以及与其他组件hive等集成

最新推荐文章于 2021-03-08 21:07:42 发布

无敌策哥

最新推荐文章于 2021-03-08 21:07:42 发布

阅读量609

点赞数

分类专栏：大数据文章标签： hbase

本文链接：https://blog.csdn.net/qq_39481696/article/details/82597912

版权

大数据专栏收录该内容

17 篇文章 1 订阅

订阅专栏

hbase与其他组件集成

hbase与MapReduce集成

设置HBase、Hadoop环境变量(hbase目录下)
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-nn
设置Hadoop_classpath环境变量

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`

HBase默认集成了一些Mapreduce程序
- 启动hdfs，yarn，historyserver
- 运行rowcounterk

  HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp` $HADOOP_HOME/bin/yarn jar $HBASE_HOME/lib/hbase-server-0.98.6-hadoop2.jar rowcounter people

HBase的数据迁移及importTsv功能

ImportTsv是HBase官方提供的基于Mapreduce的批量数据导入工具。同时ImportTsv是Hbase提供的一个命令行工具，可以将存储在HDFS上的自定义分隔符（默认\t）的数据文件，通过一条命令方便的导入到HBase表中，对于大数据量导入非常实用
- 创建数据文件

10001 zhangsan  35
10002 lisi  32
10003 wangwu  29

上传到hdfs
bin/hdfs dfs -mkdir -p /user/hbase/importtsv
bin/hdfs dfs -put /opt/student.tsv /user/hbase/importtsv
在hbase上创建student表
create ‘student’,’info’
开始运行MapReduce

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age \
student \
hdfs://node-1:8020/user/hbase/importtsv

用BulkLoad加载数据

按步骤

查看所需要的jar包
${HBASE_HOME} bin/hbase mapredcp
1.临时设置环境变量
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-nn
export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`
注意：任务在那个窗口提交，那么环境变量便在哪个窗口设置
2.启动Hadoop的hdfs与yarn的进程
$ sbin/start-dfs.sh
$ sbin/start-yarn.sh
3.启动历史服务器 jobhistoryserver
$ sbin/mr-jobhistory-daemon.sh start jobhistoryserver
4.启动zookeeper
$ bin/zkServer.sh start
启动HMaster和HRegionserver
$ bin/start-hbase.sh


1.将csv的file文件编程hfile文集那，然后在加载，已处理大数据量的问题
# 表会自动创建，输出目录也会自动创建
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar \
importtsv -Dimporttsv.bulk.output=/user/hbase/exporthfile \
-Dimporttsv.columns=HBASE_ROW_KEY,t:name,t:age \
student2 /user/hbase/importtsv
2.将hfile文件导入到Hbase表中
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar  completebulkload  \
hdfs://node-1:8020/user/hbase/exporthfile student2
3.查看Hbase中student2表中的数据
hbase(main):002:0> scan 'student2'

hbase和hive集成

1.设置拷贝的jar包
    *使用软连接的方式：
            ln -s 源文件 目标文件
        只会在选定的位置上生成一个文件的镜像，不会占用空间
        相当于windows的一个快捷方式去
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HIVE_HOME=/opt/modules/hive-0.13.1-cdh5.3.6
ln -s $HBASE_HOME/lib/hbase-common-0.98.6-hadoop2.jar  $HIVE_HOME/lib/hbase-common-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/hbase-server-0.98.6-hadoop2.jar $HIVE_HOME/lib/base-server-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/hbase-client-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-client-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/hbase-protocol-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-protocol-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/hbase-it-0.98.6-cdh5.3.3.jar $HIVE_HOME/lib/hbase-it-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/htrace-core-2.04.jar $HIVE_HOME/lib/htrace-core-2.04.jar
ln -s $HBASE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/hbase-hadoop-compat-0.98.6-hadoop2.jar $HIVE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar
ln -s $HBASE_HOME/lib/high-scale-lib-1.1.1.jar $HIVE_HOME/lib/hbase-hadoop2-compat-0.98.6-hadoop2.jar

2.修改hive-site.xml
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>node-1,node-2,node-3</value>
    </property>

3.在hive中创建表并映射到bbase中的表
    1.启动Hadoop的hdfs与yarn的进程
    2.启动zookeeper
    需要启动metastore服务
    $ bin/hive --service metastore &
    3.启动hbase

示例,在hive中创建hbase表，并与之关联
hive表被删除，则hbase表也没了

CREATE TABLE hive_hbase_table(
no int,
name string,
age string
)STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:age")
TBLPROPERTIES ("hbase.table.name" = "student3");

在hive中创建外部表来关联hbase表使用MapReduce对数据进行清洗，将数据保存到Hbase中
hive表被删除，不影响hbase表

CREATE EXTERNAL TABLE hive_hbase_table1(
no int,
name string,
age string
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:age")
TBLPROPERTIES ("hbase.table.name" = "student3");

hbase与sqoop集成

在mysql创建一张表，插入数据
修改配置文件sqoop-env.sh添加hbase的环境变量
在sqoop的主目录下导入环境变量
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
将mysql数据导入hbase

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/test \
--username root \
--password 123456 \
--table my_user \
--columns "id,account,passwd" \
--column-family "info" \
--hbase-create-table \
--hbase-row-key "id" \
--hbase-table "hbasesqoop" \
--num-mappers 1 \
--split-by id

hbase到hbase表(自定义mr)

## 实现从将一个hbase的表（student）的数据导入到另一个hbase表（student_copy）
## 先在hbase中创建原表student，导入数据
## 再创建空表student_copy
## 运行代码
使用MapReduce 操作Hbase，复制Hbase中的表
####将student表中的数据复制到student_copy表中
1.Hbase中已存在student表
2.创建一个student_copy 空的表
    create 'student_copy','info'

3.编码

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class User2StudentMapReduce extends Configured implements Tool{

    // 使用TableMapper读取HBase表的数据
    public static class ReadUserMapper extends
            //这里定义的是Map的输出
            //ImmutableBytesWritable -- 》key   
            //Put  ---》Result row
            TableMapper<ImmutableBytesWritable, Put> {

        //读取student表，每行作为一个输入，并取出了rowkey
        //key是 表的rowkey      Result row --》是rowkey对应的结果
        protected void map(ImmutableBytesWritable key, Result row,
                Context context) throws IOException, InterruptedException {
            //根据rowkey 构建Put对象
            Put put = new Put(key.get());
            //Result对象的访问方法rawCells()获取到多个单元格的方法
            Cell[] rawCells = row.rawCells();
            for (Cell cell : rawCells) {
            //从每个单元格中判断列簇info 是否存存在，如果存在则取出字段对应的值
                if ("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
                    if ("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                        put.add(cell);//将info:name列放入put
                        // CellUtil.cloneValue(cell)
                        // put.add(family, qualifier, value) ;
                    }
                    else if ("age".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                        put.add(cell);//将info:age列放入put
                    }else if("address".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))){
                        put.add(cell);//将info:address列放入put
                    }
                }
            }
            // mapper output
            context.write(key, put);
        }
    }

    public static class WriteStudentReducer extends
            TableReducer<ImmutableBytesWritable, Put, NullWritable> {

        @Override
        protected void reduce(ImmutableBytesWritable key, Iterable<Put> puts,
                Context context) throws IOException, InterruptedException {
            for (Put put : puts) {
                // reducer output
                context.write(NullWritable.get(), put);
            }
        }
    }

    //运行
    public int run(String[] args) throws Exception {

        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, this.getClass().getSimpleName());//job名任意
        job.setJarByClass(User2StudentMapReduce.class);
        job.setNumReduceTasks(1); //reducer个数

        Scan scan = new Scan();
        scan.setCacheBlocks(false); //MR的时候为非热点数据，不需要缓存
        scan.setCaching(500); //每次从服务器端读取的行数

        TableMapReduceUtil.initTableMapperJob("student", //输入表
                scan,
                ReadUserMapper.class, // mapper class
                ImmutableBytesWritable.class, // mapper output key
                Put.class, // mapper output value
                job);

        TableMapReduceUtil.initTableReducerJob("student_copy", //输出表
                WriteStudentReducer.class, // reducer class
                job);

        boolean isSuccess = job.waitForCompletion(true);
        return isSuccess ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        int status = ToolRunner.run(conf, new User2StudentMapReduce(), args);
        System.exit(status);
    }
}
4.//引入环境classpath
 export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
 export HADOOP_HOME=/opt/modules/hadoop-2.5.0-cdh5.3.6
 export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`
5.执行
#######将工程导出jar文件
执行
$HADOOP_HOME/bin/yarn jar /opt/h.jar
查看结果
6.bin/hbase shell
    scan 'student_copy'

hbase与hue集成

[hbase]
  # Comma-separated list of HBase Thrift servers for clusters in the format of '(name|host:port)'.
  # Use full hostname with security.
  hbase_clusters=(Cluster|bigdata.com:9090)

  # HBase configuration directory, where hbase-site.xml is located.
  hbase_conf_dir=/opt/modules/hbase-0.98.6-hadoop2/conf
2、启动HBase的thrift
$ bin/hbase-daemon.sh start thrift

3、启动Hue进程,并访问
$ build/env/bin/supervisor
http://bigdata-hive:8888

namespace

类似数据库，每一个namespace可以存储表
- 创造一个namespace
create namespace ‘ns2’
- 查看ns2命名空间下有哪些表
list_namespace_tables ‘ns2’
- 查看描述信息
describe_namespace ‘ns2’
- 删除命名空间，命名空间中没有表才能将其删除
drop_namespace ‘ns2’

在指定的命名空间中创造一张表

create ‘ns2:t11’,’f1’
在默认的命名空间下创建一张t2表，列簇为f1，版本号1，有效期为2592000 ，启用缓存
create ‘t2’, {NAME => ‘f1’, VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
删除表
disable ‘ns2:t11’
drop ‘ns2:t11’

创建预分区

创建预分区表
create 't1','f1',SPLITS=>['10','20','30','40']
使用16进制字符生成分区
create 't2', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
向分区表插入数据
插入数据后可以在web页面查看
在此页面http://node-1:60010/master-status

scan 扩展，区间查询

从指定row key开始查询指定数量，数量不足则查最多数量
scan ‘student’, {LIMIT => 10, STARTROW => ‘10001’}
在指定区间查找前几条，不查询stoprow所在的行
scan ‘student’, {LIMIT => 2, STARTROW => ‘10001’,STOPROW=>’10004’}
查询指定字段
scan ‘user’, {COLUMNS => [‘info:name’, ‘info:age’], LIMIT => 10, STARTROW => ‘100010005’}

使用通配符

scan ‘use’, {LIMIT => 100, STARTROW => ‘201801*’,STOPROW=>’201812*’}
查询2018年1月到11月的100条信息

scan使用filter

binary是等于
substring 是含有就可以
- 在use表中查询，过滤字段值为zhangsan的数据
scan 'use', FILTER=>"ValueFilter(=,'binary:zhangsan')"
- 查询字段值包含32的值
scan 'use', FILTER=>"ValueFilter(=,'substring:32')"
- 通过查询user表，并且指定要过滤的字段为name,和name的值包含10005，的数据
scan 'use', FILTER=>"ColumnPrefixFilter('name') AND ValueFilter(=,'substring:zhangsan')"
- 查询user表，指定过滤name字段，字段值包含321或者232的数据
scan 'use', FILTER=>"ColumnPrefixFilter('name') AND (ValueFilter(=,'substring:321') OR ValueFilter(=,'substring:232'))"

count

统计一个表的行数
count ‘use’
count ‘ns1:use’

hbase 与 hive实例

创造hbase表
create ‘userTelphone’,’info’
加载数据，这里手动加入一些

put 'userTelphone','182600937646_20151001082013','info:area','shanghai'
put 'userTelphone','182600937646_20151001082053','info:area','shanghai'
put 'userTelphone','182600937646_20151001082013','info:active','zhujiao'
put 'userTelphone','182600937646_20151024092018','info:area','shanghai'
put 'userTelphone','182600937646_20151024092018','info:active','zhujiao'
put 'userTelphone','182600937646_20151227092018','info:area','shanghai'
put 'userTelphone','182600937646_20151227092018','info:active','zhujiao'
put 'userTelphone','182600937648_20151124092018','info:area','shanghai'
put 'userTelphone','182600937648_20151124092018','info:active','zhujiao'

创造hive表保存最终结果

CREATE  TABLE hive_hbase_res(
telphone string,
teltime string,
area string,
active string,
phone string,
talktime string,
mode string,
price string)
row format delimited fields terminated by '\t';

创造外部表映射到hbase中的表

CREATE EXTERNAL TABLE Tel_hbase_ext(
telphone_teltime string ,
area string ,
active string
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:area,info:active")
TBLPROPERTIES ("hbase.table.name" = "|");

和hive处理日志的步骤差不多

insert overwrite table hive_hbase_res1
select split(telphone_teltime,"_")[0],split(telphone_teltime,"_")[1] from Tel_hbase_ext;

无敌策哥

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录