Hive及HBase表的数据迁移

最新推荐文章于 2023-11-17 17:24:07 发布

CesarChoy

最新推荐文章于 2023-11-17 17:24:07 发布

阅读量439

点赞数 1

分类专栏： Hive 数据迁移

本文链接：https://blog.csdn.net/weixin_42687074/article/details/98672547

版权

Hive 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

数据迁移

1 篇文章 0 订阅

订阅专栏

前文：

公司最近数据迁移，做了个测试来验证执行命令。

一、Hive部分


##############################################
创建原始数据表

create table log_text_1(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
partitioned by (part string)
row format delimited fields terminated by '\t'
stored as textfile ;

create table log_text_2(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
partitioned by (part string)
row format delimited fields terminated by '\t'
stored as textfile ;

##############################################
导入数据

load data local inpath '/opt/module/datas/part1_1.txt' into table default.log_text_1 partition(part='part1_1');

load data local inpath '/opt/module/datas/part1_2.txt' into table default.log_text_1 partition(part='part1_2');

load data local inpath '/opt/module/datas/part1_3.txt' into table default.log_text_1 partition(part='part1_3');

load data local inpath '/opt/module/datas/part2_1.txt' into table default.log_text_2 partition(part='part2_1');
																					   																							   
load data local inpath '/opt/module/datas/part2_2.txt' into table default.log_text_2 partition(part='part2_2');
																					   																							   
load data local inpath '/opt/module/datas/part2_3.txt' into table default.log_text_2 partition(part='part2_3');

删除错误分区
alter table log_text_1 drop partition (part='part2_1');
查看成功分区
show partitions log_text_1;

##############################################
创建orc数据表

create table log_orc_1(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
partitioned by (part string)
row format delimited fields terminated by '\t'
stored as orc ;

create table log_orc_2(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
partitioned by (part string)
row format delimited fields terminated by '\t'
stored as orc ;

##############################################
导入orc表数据

尝试使用导入
load data local inpath '/opt/module/datas/part1_1.txt' into table default.log_orc_1 partition(part='part1_1');
load data local inpath '/opt/module/datas/part1_2.txt' into table default.log_orc_1 partition(part='part1_2');
load data local inpath '/opt/module/datas/part1_3.txt' into table default.log_orc_1 partition(part='part1_3');
load data local inpath '/opt/module/datas/part2_1.txt' into table default.log_orc_2 partition(part='part2_1');			
load data local inpath '/opt/module/datas/part2_2.txt' into table default.log_orc_2 partition(part='part2_2');			
load data local inpath '/opt/module/datas/part2_3.txt' into table default.log_orc_2 partition(part='part2_3');
错误！
原因：orc压缩格式表不能用load，也不能select

测试插入
insert overwrite  table log_orc_1 partition(part='part1_1') select * from log_text_1 where part  =  'part1_1';
成功删除
alter table log_orc_1 drop partition (part='part1_1');

from log_text_1
insert overwrite table log_orc_1 partition(part='part1_1') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part1_1'
insert overwrite table log_orc_1 partition(part='part1_2') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part1_2'
insert overwrite table log_orc_1 partition(part='part1_3') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part1_3';

from log_text_2
insert overwrite table log_orc_2 partition(part='part2_1') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part2_1'
insert overwrite table log_orc_2 partition(part='part2_2') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part2_2'
insert overwrite table log_orc_2 partition(part='part2_3') 
select track_time,url,session_id,referer,ip,end_user_id,city_id where part='part2_3';

##############################################
导出数据(速度很快，基本为IO流操作)

方案一：全部导出
export table default.log_orc_1 to  '/migrate/log_orc_all';

方案二：部分导出
尝试导入到同一个文件夹下
export table default.log_orc_1 partition(part='part1_1') to  '/migrate/log_orc_part'; 
上句执行成功
export table default.log_orc_1 partition(part='part1_2') to  '/migrate/log_orc_part';
执行失败：FAILED: SemanticException Line 1:13 Invalid path ''/migrate/log_orc_part'': Target is not an empty directory : hdfs://hadoop211:9000/migrate/log_orc_part
只能导出到空文件中,去除相应的操作数据
hadoop dfs -rm /migrate/log_orc_part/_metadata
hadoop dfs -rm -r/migrate/log_orc_part/part=part1_1

新方案：导出对应的文件夹中，不需先创建文件夹
export table default.log_orc_1 partition(part='part1_1') to  '/migrate/log_orc_part/part1_1';
export table default.log_orc_1 partition(part='part1_2') to  '/migrate/log_orc_part/part1_2';
export table default.log_orc_1 partition(part='part1_3') to  '/migrate/log_orc_part/part1_3';

##############################################
导入数据(速度很快，基本为IO流操作)

方案一：全部导入
import table log_orc_1_in1 from  '/migrate/log_orc_all';

方案二：部分导入
import table log_orc_1_in2 partition(part='part1_1') from  '/migrate/log_orc_part/part1_1';
import table log_orc_1_in2 partition(part='part1_2') from  '/migrate/log_orc_part/part1_2';
import table log_orc_1_in2 partition(part='part1_3') from  '/migrate/log_orc_part/part1_3';

尝试新方案导入所有分区，避免写多条sql语句
import table log_orc_1_in3 from  '/migrate/log_orc_part/';
执行失败：FAILED: SemanticException [Error 10027]: Invalid path

##############################################

二、HBase表

##############################################

创建Hbase表

临时表，用于向Hbase导入数据
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';
向临时表导入数据
load data local inpath '/opt/module/datas/emp.txt' into table emp;

创建HBase表
CREATE TABLE hive_hbase_emp_table(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");

查看HBase表结构
hive> show create table hive_hbase_emp_table;
OK
CREATE TABLE `hive_hbase_emp_table`(
  `empno` int COMMENT 'from deserializer', 
  `ename` string COMMENT 'from deserializer', 
  `job` string COMMENT 'from deserializer', 
  `mgr` int COMMENT 'from deserializer', 
  `hiredate` string COMMENT 'from deserializer', 
  `sal` double COMMENT 'from deserializer', 
  `comm` double COMMENT 'from deserializer', 
  `deptno` int COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_emp_table', 
  'transient_lastDdlTime'='1565098184')

向HBase表导入数据
insert into table hive_hbase_emp_table select * from emp;

##############################################
在HBase上导出数据

查看用法./hbase org.apache.hadoop.hbase.mapreduce.Export
./hbase org.apache.hadoop.hbase.mapreduce.Export hbase_emp_table /migrate/hbase_emp_table_all

##############################################
导入数据导HBase上

方案一：
hbase org.apache.hadoop.hbase.mapreduce.Import  hbase_emp_table_in1 /migrate/hbase_emp_table_all
执行错误：没有对应表
2019-08-06 22:43:02,690 INFO  [main] mapreduce.Job: Task Id : attempt_1564995760485_0007_m_000000_1, Status : FAILED
Error: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 10 actions: Table 'hbase_emp_table_in1' was not found, got: hbase_emp_table.: 10 times, servers with issues: null, 
        at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:260)
        at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$2400(AsyncProcess.java:240)
        at org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1889)
        at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:247)
        at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:170)
        at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:120)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:670)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

创建一张对应HBase的Hive后再导入数据
CREATE TABLE hive_hbase_emp_table_in1(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table_in1");

./hbase org.apache.hadoop.hbase.mapreduce.Import  hbase_emp_table_in1 /migrate/hbase_emp_table_all
执行成功！
select * from hive_hbase_emp_table_in1;
出现数据。

##############################################

方案二：采用修复方式

下载原数据
hadoop dfs -get /hbase/data/default/hbase_emp_table

drop table hive_hbase_emp_table;

上传数据
hadoop dfs -put hbase_emp_table /hbase/data/default/hbase_emp_table

./hbase hbck -fixTableOrphans hbase_emp_table
./hbase hbck -fixMeta hbase_emp_table
./hbase hbck -fixAssignments hbase_emp_table
./hbase hbck -repair hbase_emp_table
HBase中修复了相应表，查询不到数据。

创建相应的HBase表
CREATE TABLE hive_hbase_emp_table(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");
执行失败：
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:MetaException(message:Table hbase_emp_table already exists within HBase; use CREATE EXTERNAL TABLE instead to register it in Hive.)
该修复方案不适合于在Hive中创建的HBase表

##############################################

CesarChoy

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive及HBase表的数据迁移

前文：公司最近数据迁移，做了个测试来验证执行命令。一、Hive部分##############################################创建原始数据表create table log_text_1(track_time string,url string,session_id string,referer string,...
复制链接

扫一扫

专栏目录