hive3.0.0 新建orc表以便支持 delete 问题及sql语句调优

最新推荐文章于 2024-07-24 09:15:39 发布

卢子墨

最新推荐文章于 2024-07-24 09:15:39 发布

阅读量1.3k

点赞数

分类专栏： Hive 文章标签： hive orc hive表调优

本文链接：https://blog.csdn.net/lukabruce/article/details/87874879

版权

Hive 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

说明：之前查了资料得到hive3.0 及以上版本是支持ACID的，但是在实际操作中并没有实现delete功能，为了节省时间之间将原来存储格式为textfile格式的内部表修改为存储格式为orcfile的orc表，经过实操，发现实现了delete功能，且性能提升一倍左右。

另：测试的内存配置为垃圾8G，执行引擎为yarn。

# 老思路 cst_bsc_inf_dplt 全量表 按客户ID分桶
create table if not exists cst_bsc_inf_dplt(
  cst_id string,
  ip_id string,
  .......,
  rmrk_1 string)
  comment 'this is the custormer_bastic_information_copy view table'
  clustered by (cst_id) into 8 buckets
  row format serde 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' with serdeproperties ('field.delim'='|@|')
  store as textfile tblproperties ('serialization.encoding'='utf-8');

# 新思路 将cst_bsc_inf_dplt建立成ORC表 按客户ID分桶 
create table if not exists cst_bsc_inf_dplt(
  cst_id string,
  ip_id string,
  .......,
  rmrk_1 string)
  comment 'this is the custormer_bastic_information_copy view table'
  clustered by (cst_id) into 8 buckets
  store as orcfile tblproperties ('serialization.encoding'='utf-8','transactional'='true');



=======调优前=======
(老思路 textfile表：匹配全量表与增量表相同的cst_id,然后进行全表覆写  千万级别全量数据5.6G + 增量数据 2.5M(1100条左右)   7分钟)

sql脚本：
use cst_lbl;
insert overwrite table cat_bsc_inf_dplt select * from cat_bsc_inf_dplt_mid union all(select a.* from cat_bsc_inf_dplt a left join cat_bsc_inf_dplt_mid b on 1=1 and a.cst_id=b.cst_id where b.cst_id is null);
exit;

=======调优后=======
(新思路 orc表:先查出增量表中的cst_id，全量表与之相同的cst_id整条数据删除；然后将增量表全部数据整体插入到增量表中。千万级别全量数据5.6G + 增量数据 2.5M(1100条左右)  3分钟)

sql脚本：
use cst_lbl;
delete from cat_bsc_inf_dplt where cst_id in (select cst_id from cat_bsc_inf_dplt_mid);
insert into cat_bsc_inf_dplt select * from cat_bsc_inf_dplt_mid；
exit;

卢子墨

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
hive3.0.0 新建orc表以便支持 delete 问题及sql语句调优

说明：之前查了资料得到hive3.0 及以上版本是支持ACID的，但是在实际操作中并没有实现delete功能，为了节省时间之间将原来存储格式为textfile格式的内部表修改为存储格式为orcfile的orc表，经过实操，发现实现了delete功能，且性能提升一倍左右。另：测试的内存配置为垃圾8G，执行引擎为yarn。# 老思路 cst_bsc_inf_dplt 全量表按客户ID分桶...
复制链接

扫一扫