Hive_优化/问题

若叶时代

已于 2024-06-24 13:19:06 修改

阅读量1.8k

点赞数

分类专栏：数据计算引擎文章标签： hive

于 2020-09-23 15:10:42 首次发布

本文链接：https://blog.csdn.net/weixin_43875878/article/details/108302163

版权

数据计算引擎专栏收录该内容

14 篇文章 0 订阅

订阅专栏

0 参考列表

阿里云:MaxCompute优化系列-如何使用`MAPJOIN` ？https://developer.aliyun.com/article/425415?spm=a2c6h.14164896.0.0.500f2b94FCMkLBhttps://developer.aliyun.com/article/425415?spm=a2c6h.14164896.0.0.500f2b94FCMkLB

1 语句优化

1.1 过滤优化

①分区裁剪:使用时只读取需要的分区.

②列裁剪:使用表时,不读取不需要的列,减少IO消耗.读取操作有select,where,join,group by,sort by等.

③where子句先分区过滤,再字段过滤,区分度大的字段先执行.

1.2 join优化

①多表关联时,能减少数据量的表先关联.

②join子句的关联键区分度大的先执行.

1.3 group by 优化

①数据量大时,用group by替代distinct.

②使用窗口函数时,排序键区分度大的放在前面

2 参数优化

2.1 Map 优化

①设置每个MapTask的内存,MB:mapreduce.map.memory.mb = 1024.

2.2 shuffle 优化

2.2.1 join

2.2.2.1 mapjoin

(1) 原理

在进行join操作时,mapjoin会将指定的小表加载到各个Map端进行计算,省去了reduce阶段的计算,从而提高计算效率.

(2) 使用限制

①left join的左表必须是大表,right join的右表必须是大表,inner join无要求,full join不能使用mapjoin;
②使用mapjoin需要引用小表或子查询时,需要引用别名.

(3) 使用方法

 // 开启Mapjoin
set hive.auto.convert.join = true;
// 设置mapjoin时小表的最大文件大小,默认为25000000(25M)
set hive.mapjoin.smalltable.filesize;

select 
        -- /* + mapjoin(t2,t3) */
        /* + mapjoin(t2),mapjoin(t3) */
        字段
        ...
from t1
[join...] t2
[join...] t3
;

(4) 问题

mapjoin导致内存溢出.

解决方案:①不使用mapjoin;②提高mapjoin时小表的最大文件大小.

2.3 Reduce 优化

①设置Reduce个数:mapred.reduce.tasks = 1.
②设置每个ReduceTask的内存,MB:mapreduce.reduce.memory.mb = 1024.

3 问题

3.1 数据倾斜

(1) join 时字段有空值：null不参与join计算，或者给null随机赋值。

(2) join 时大表关联小表：将 join 改为 mapjoin。

(3) group by 时字段的值有不同数据类型：将数据类型调整一致。

(4) count distinct 时固定的特殊值比较多：分别统计特殊值部分和非特殊值部分的去重记录数,再求和.

(5) count distinct 时,数据量非常大：使用 sum ... group by 代替.

-- 统计商品访问的UV
select
         sku_code
        ,sum(uv) as vst_uv
from (
        select
                 sku_code
                ,user_id
                ,1 as uv
        from dwd_log_vst_di
        where ds = '${cur_date}'
        group by     sku_code
                    ,user_id
    ) as t1
group by sku_code
;

3.2 小文件问题

3.2.1 影响

①影响Map任务启动,一个小文件对应一个实例,造成资源浪费,影响整体执行性能.

②小文件过多会产生大量元数据,占用资源.

3.2.2 原因及解决方案

(1) 动态分区表包含大量分区

解决方案:

①检查分区字段设置是否合理;

②分区设置生命周期,过期不用的数据自动清理.

(2) 数据集成工具频繁上传小文件

异常信息:SQLTask is splitting data sources 任务时间过长.

解决方案:

①避免频繁上传小文件,积累较大后一次性上传;

②定期执行小文件合并.

-- 默认为50000个,最高为1000000个
set hive.merge.size.per.task = 50000;
alter table 表名[分区名] merge smallfiles;

3.3 join 效率低

(1) 原因:大表跟小表关联时,在reduce端进行join时在shuffle阶段会消耗大量的时间.

解决方案:使用mapjoin.

(2) 原因:join时的数据量大

解决方案:

①提高每个join Worker的数量,odps.stage.joiner.num.

②提高每个Join Worker的内存大小,odps.stage.joiner.mem,256 MB~12288 MB,默认值为1024 MB.

(3) 原因:join条件复杂,如关联字段过多、关联时对关联字段进行处理

解决方案:

①减少不必要的关联字段;

②提前处理好要关联的字段.

(4) 事实表关联了大维度表

解决方案:从大维度表中提取使用频率较高的字段,形成迷你维度表来关联.

4 语句异常

4.1 group by

Semantic analysis exception - column reference 表名.字段名 should appear in GROUP BY key
原因:查询字段不是分组列且未聚合

4.2 union

1 type mismatch for UNION, left has 2 columns while right has 1 columns

原因:union 两边的字段数量不一样.


2 Illegal union operation - type mismatch for column 0 of UNION, left is BIGINT while right is STRING
原因:union 两边对应字段的数据类型不一致.

4.3 insert

1 wrong columns count 1 in data source, requires 23 columns (includes dynamic partitions if any)
原因:insert into/overwrite 后的结果集的字段数与表的字段数不一致.

5 软件问题

(1) hive.metastore.api.MetaException: User root is not allowed to perform this API call.

CSDN:hive中beeline连接异常User:*** is not allowed to impersonatehttps://blog.csdn.net/qq_42982169/article/details/83317596https://blog.csdn.net/qq_42982169/article/details/83317596

(2) HiveAccessControlException Permission denied: Principal [name=hive, type=USE]does not have following.

在hive-site.xml中增加配置

<property> 
    <name>hive.users.in.admin.role</name> 
    <value>root</value> 
    <description>enable or disable the hive client authorization</description> 
</property>

(3) Hive 注释中文乱码
CSDN:hive详细笔记之hive注释中文乱码问题https://blog.csdn.net/qq_37933018/article/details/106944191

若叶时代

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Hive_优化/问题

②提高每个Join Worker的内存大小,odps.stage.joiner.mem,256 MB~12288 MB,默认值为1024 MB.①left join的左表必须是大表,right join的右表必须是大表,inner join无要求,full join不能使用mapjoin;②列裁剪:使用表时,不读取不需要的列,减少IO消耗.读取操作有select,where,join,group by,sort by等.①提高每个join Worker的数量,odps.stage.joiner.num.
复制链接

扫一扫

专栏目录