《Spark系列》建立Hive和Hbase的映射关系,通过Spark将Hive表中数据导入ClickHouse

本文详细介绍了如何在HBase中创建表并使用Hive建立映射,实现数据同步查询。接着展示了通过Spark将Hive数据导入ClickHouse,并提供了完整的项目配置及代码示例,实现了Hive到ClickHouse的数据迁移。最后,验证了Hive和ClickHouse中的数据一致性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

HBase+Hive+Spark+ClickHouse

在HBase中建表,通过Hive与HBase建立映射关系,实现双方新增数据后彼此都可以查询到。

通过spark将Hive中的数据读取到并经过处理保存到ClickHouse中

一 Hbase

1 Hbase表操作

1.1 创建命名空间

hbase(main):008:0> create_namespace 'zxy',{'hbasename'=>'hadoop'}
0 row(s) in 0.0420 seconds

1.2 创建列簇

hbase(main):012:0> create 'zxy:t1',{NAME=>'f1',VERSIONS=>5}
0 row(s) in 2.4850 seconds


hbase(main):014:0> list 'zxy:.*'
TABLE
zxy:t1
1 row(s) in 0.0200 seconds

=> ["zxy:t1"]
hbase(main):015:0> describe 'zxy:t1'
Table zxy:t1 is ENABLED
zxy:t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_C
ELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', M
IN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{

1.3 按行导入数据

hbase(main):016:0> put 'zxy:t1','r1','f1:name','zxy'
0 row(s) in 0.1080 seconds
hbase(main):028:0> append 'zxy:t1','r1','f1:id','001'
0 row(s) in 0.0400 seconds

hbase(main):029:0> scan 'zxy:t1'
ROW                     COLUMN+CELL
 r1                     column=f1:id, timestamp=1627714724257, value=001
 r1                     column=f1:name, timestamp=1627714469210, value=zxy
1 row(s) in 0.0120 seconds

hbase(main):030:0> append 'zxy:t1','r2','f1:id','002'
0 row(s) in 0.0060 seconds

hbase(main):031:0> append 'zxy:t1','r2','f1:name','bx'
0 row(s) in 0.0080 seconds

hbase(main):032:0> append 'zxy:t1','r3','f1:id','003'
0 row(s) in 0.0040 seconds

hbase(main):033:0> append 'zxy:t1','r3','f1:name','zhg'
0 row(s) in 0.0040 seconds

hbase(main):034:0> scan 'zxy:t1'
ROW                     COLUMN+CELL
 r1                     column=f1:id, timestamp=1627714724257, value=001
 r1                     column=f1:name, timestamp=1627714469210, value=zxy
 r2                     column=f1:id, timestamp=1627714739647, value=002
 r2                     column=f1:name, timestamp=1627714754108, value=bx
 r3                     column=f1:id, timestamp=1627714768018, value=003
 r3                     column=f1:name, timestamp=1627714778121, value=zhg
3 row(s) in 0.0190 seconds

二 Hive

1 建立Hbase关联表

hive (zxy)> create external table if not exists t1(
          > uid string,
          > id int,
          > name string
          > )
          > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:id,f1:name")
          > TBLPROPERTIES ("hbase.table.name" = "zxy:t1");
OK
Time taken: 0.306 seconds
hive (zxy)> select * from t1
          > ;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
Time taken: 0.438 seconds, Fetched: 3 row(s)

2 Hbase添加数据

  • hbase添加数据

hbase(main):002:0> append 'zxy:t1','r4','f1:id','004'
0 row(s) in 0.1120 seconds

hbase(main):003:0> append 'zxy:t1','r4','f1:name','hyy'
0 row(s) in 0.0220 seconds

hbase(main):004:0> scan 'zxy:t1'
ROW                                      COLUMN+CELL
 r1                                      column=f1:id, timestamp=1627714724257, value=001
 r1                                      column=f1:name, timestamp=1627714469210, value=zxy
 r2                                      column=f1:id, timestamp=1627714739647, value=002
 r2                                      column=f1:name, timestamp=1627714754108, value=bx
 r3                                      column=f1:id, timestamp=1627714768018, value=003
 r3                                      column=f1:name, timestamp=1627714778121, value=zhg
 r4                                      column=f1:id, timestamp=1627716660482, value=004
 r4                                      column=f1:name, timestamp=1627716670546, value=hyy
  • hive更新到数据
hive (zxy)> select * from t1;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
r4      4       hyy

3 Hive添加数据

hive添加数据不能直接通过load添加数据,所以这里选择使用中间表来导入数据

  • user.txt
r5 5 tzk
r6 6 fyj
  • 创建中间表
hive (zxy)> create table if not exists t2 (uid string,id int,name string) row format delimited fields terminated by ' '
          > ;
OK
Time taken: 0.283 seconds
  • 导入数据
hive (zxy)> load data local inpath '/data/data/user.txt' into table t2;
Loading data to table zxy.t2
Table zxy.t2 stats: [numFiles=1, totalSize=18]
OK
  • 查询导入中间表数据
hive (zxy)> insert into table t1 select * from t2;
Query ID = root_20210731154037_e8019cc0-38bb-42fc-9674-a9de2be9dba6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1627713883513_0001, Tracking URL = http://hadoop:8088/proxy/application_1627713883513_0001/
Kill Command = /data/apps/hadoop-2.8.1/bin/hadoop job  -kill job_1627713883513_0001
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2021-07-31 15:41:23,373 Stage-0 map = 0%,  reduce = 0%
2021-07-31 15:41:34,585 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 3.45 sec
MapReduce Total cumulative CPU time: 3 seconds 450 msec
Ended Job = job_1627713883513_0001
MapReduce Jobs Launched:
Stage-Stage-0: Map: 1   Cumulative CPU: 3.45 sec   HDFS Read: 3659 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 450 msec
OK
Time taken: 60.406 seconds

  • Hive端查询数据
hive (zxy)> select * from t1;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
r4      4       hyy
r5      5       tzk
r6      6       fyj
Time taken: 0.335 seconds, Fetched: 6 row(s)
hive (zxy)>
  • Hbase端查询到数据
hbase(main):001:0> scan 'zxy:t1'
ROW                                      COLUMN+CELL
 r1                                      column=f1:id, timestamp=1627714724257, value=001
 r1                                      column=f1:name, timestamp=1627714469210, value=zxy
 r2                                      column=f1:id, timestamp=1627714739647, value=002
 r2                                      column=f1:name, timestamp=1627714754108, value=bx
 r3                                      column=f1:id, timestamp=1627714768018, value=003
 r3                                      column=f1:name, timestamp=1627714778121, value=zhg
 r4                                      column=f1:id, timestamp=1627716660482, value=004
 r4                                      column=f1:name, timestamp=1627716670546, value=hyy
 r5                                      column=f1:id, timestamp=1627717294053, value=5
 r5                                      column=f1:name, timestamp=1627717294053, value=tzk
 r6                                      column=f1:id, timestamp=1627717294053, value=6
 r6                                      column=f1:name, timestamp=1627717294053, value=fyj
6 row(s) in 0.4660 seconds

三 Hive2ClickHouse

完整项目连接

1 pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.zxy</groupId>
    <artifactId>hive2ch</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <scala.version>2.11.12</scala.version>
        <play-json.version>2.3.9</play-json.version>
        <maven-scala-plugin.version>2.10.1</maven-scala-plugin.version>
        <scala-maven-plugin.version>3.2.0</scala-maven-plugin.version>
        <maven-assembly-plugin.version>2.6</maven-assembly-plugin.version>
        <spark.version>2.4.5</spark.version>
        <scope.type>compile</scope.type>
        <json.version>1.2.3</json.version>
        <!--compile provided-->
    </properties>

    <dependencies>

        <!--json 包-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${json.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>${scope.type}
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

DATA数据猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值