hive-sql 性能优化

cg6

已于 2023-01-08 12:36:34 修改

阅读量269

点赞数

分类专栏：大数据 hive 文章标签： hive sql hadoop

于 2022-06-13 11:21:00 首次发布

本文链接：https://blog.csdn.net/a123147abc/article/details/125255675

版权

大数据同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

hive

18 篇文章 1 订阅

订阅专栏

1、distribute by [sort by] 或cluster by 优化数据倾斜

数据倾斜排查过程：
1、原始sql 如下：

select *
from jour
join spc on spc.id=jour.parent_id 
left join spcadmin on spc.rect_id=spcadmin.rect_id and  to_date(jour.create_time) =to_date(spcadmin.dt_date) 
left join person on person.id = jour.modify_user_id  
where 1 = 1
	and jour.deleted='false'
;

2、map 端出现数据倾斜
在这里插入图片描述
3、大表数据文件分布情况，数据文件分布不均匀，如下图所示
4、spc表是事务表，严重影响关联查询性能，删表重新创建为非事务表

注意：不需要表为事务表才能修改事务表为内部或外部表

5、创建临时表并对数据进行随机分发

set mapred.reduce.tasks=10;   -- 输出10和分布均匀的数据文件
drop table if exists A_tmp;
create table if not exists A_tmp as
select
    id,
    create_user_id ,
    create_time ,
    modify_user_id,
    modify_time,
    tenant_id,
    parent_id,
    bind_id ,
    operation_type,
    `type`,
    object_type
from A
where deleted='false'
distribute by rand(123)
;

数据重新分发后的数据分布，如下图
在这里插入图片描述

优化sql如下:

set mapred.reduce.tasks=10;   -- 输出10和分布均匀的数据文件
drop table if exists A_tmp;
create table if not exists A_tmp as
select
    id,
    create_user_id ,
    create_time ,
    modify_user_id,
    modify_time,
    tenant_id,
    parent_id,
    bind_id ,
    operation_type,
    `type`,
    object_type
from jour 
where deleted='false'
distribute by rand(123)
;

select * 
from spc
join jour on spc.id=jour.parent_id  
left join spcadmin on spc.rect_id=spcadmin.rect_id and  to_date(jour.create_time) =to_date(spcadmin.dt_date) 
left join person on person.id = jour.modify_user_id  
;

或优化sql如下：

set mapred.reduce.tasks=10; 
select * 
from spc
join (
	select *
	from jour 
	where deleted='false'
	distribute by id
)jour on spc.id=jour.parent_id  
left join spcadmin on spc.rect_id=spcadmin.rect_id and  to_date(jour.create_time) =to_date(spcadmin.dt_date) 
left join person on person.id = jour.modify_user_id  
;