hive rand函数_Hive实践（高阶篇）

最新推荐文章于 2024-07-11 15:27:16 发布

weixin_39945915

最新推荐文章于 2024-07-11 15:27:16 发布

阅读量1.3k

点赞数

文章标签： hive rand函数

本文链接：https://blog.csdn.net/weixin_39945915/article/details/113076667

版权

本文深入探讨了Hive SQL的执行原理，包括MapReduce作业的各个阶段，以及如何通过业务场景优化、语句本身优化和参数配置优化来提升性能。特别强调了列裁剪、分区裁剪、join优化等策略，并提供了Hive rand函数在解决数据倾斜问题上的应用示例。

摘要由CSDN通过智能技术生成

1.1、Hive SQL执行原理

简单来说HIVE就是一个查询引擎，通过语法分析、语法解析、语法优化等一系列操作将SQL转化为MapReduce作业，MR作业一般会有以下几个过程：

输入：根据输入数据的大小进行分片
MAP阶段：Map任务的个数由输入的分片个数决定；在Map阶段对分片文件中的每行进行检查过滤，并按指定的列保存到bending文件。
Combine阶段：将Map阶段的输出文件进行一定程度的合并。
Shuffle阶段：Map任务的输出必须经过一个名叫Shuffle的阶段才能交给Reduce任务去处理。Shuffle过程包含了分区、排序、分隔、复制和合并等过程
Reduce阶段：对于group by句，这里需要调用reduce函数逻辑将数据按照group by的字段进行汇总，并保留文件到bending中。
输出文件

1.2、Hive SQL优化总结

常用SQL优化手段，分几个方面：

1.2.1 业务场景优化

好的模型设计事半功倍
了解数据分布，动手通过业务逻辑精确有效解决数据倾斜

1.2.2 语句本身优化

列裁剪和分区裁剪
把握整体减少 job 数，单个作业最优不如整体最优
慎用distinct，尽量用group by替换
利用with as语句减少重复查询
join优化

。优先过滤后再进行Join操作,减少参与join的数据量

。union all代替join并行执行

。多表join时，尽量用相同key

。空值、无意义值过滤或者用随机数打散避免倾斜

1.2.3 参数配置优化

合理 MapReduce 的task数
设置合理文件数，合并小文件
利用 mapjoin特性
map端聚合减少shuffle数据量
group by、join倾斜均衡参数
启用压缩减少中间结果数据和输出数据
合理限制资源使用量避免任务互相影响,尤其tez引擎资源

1.3、Hive SQL优化示例

1.3.1 参数设置优化

---普通join转为mapjoin优化set hive.auto.convert.join=true;set hive.mapjoin.smalltable.filesize=2500000;--group by 优化set hive.map.aggr=true;--map端聚合set hive.groupby.skewindata=true;--小文件合并set hive.merge.mapfiles = true;set hive.merge.mapredfiles= true;set hive.merge.tezfiles=true;set hive.merge.size.per.task = 256000000;set hive.merge.smallfiles.avgsize=256000000;--增加并行度，不相互依赖任务并行执行set hive.exec.parallel=true; set hive.exec.parallel.thread.number=8; --适当设置也不是越多越好。tez有stage并行优化--reduce个数设置，根据任务合理设置set hive.exec.reducers.max = 100; --reduce数set hive.exec.reducers.bytes.per.reducer=500000000; --每个reduce处理的数据量---内存调试， AppMaster的container以及AppMaster向RM申请的container大小set tez.am.resource.memory.mb = 2048;set hive.tez.container.size = 2048;--map数设置，减少提高并发，增大提高吞吐量，一般建议1Gset tez.grouping.min-size = 1024000000;set tez.grouping.max-size = 1024000000;--tez容器资源重用，避免重复申请set tez.am.container.reuse.enabled=true;--大部分是有reduce阶段，建议开启中间过程压缩set hive.exec.compress.intermediate=true;set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

1.3.2 逻辑优化

1) 按需取数据，指定获取的列、指定取数据的分区

--按需取数据，指定获取的列、指定取数据的分区select    ptdate, pay_type  --获取需要的字段，不需要不from  tmp.dualWHERE ptdate = '2020-09-21' --指定查询数据的分区，一般是日期 and type = 'xxx'    --过滤需要的数据

2)cube 以及 distinct的优化

--优化前2-3小时--优化后45min左右select id    , nvl(type, "合计") type    , nvl(plat, "合计")plat    , nvl(is_new, "合计")is_new    , nvl(flag, "合计") flag    , nvl(name, "合计")name    , scene    , count(user_id) exposure_product_uv    , sum(exposure_product_pv) exposure_product_pv    , sum(unique_product_exposure_pv) unique_product_exposure_pvfrom(select     id    , nvl(type, "合计") type    , nvl(plat, "合计")plat    , nvl(is_new, "合计")is_new    , nvl(flag, "合计")flag    , nvl(name, "合计")name    , scene    , user_id    , sum(pv) exposure_product_pv    , count(1) unique_product_exposure_pvfrom (    select id        , nvl(type, "合计") type        , nvl(plat, "合计")plat        , nvl(is_new, "合计")is_new        , nvl(flag, "合计")flag        , nvl(name, "合计")name        , scene        , user_id        , request_id        , count(1) pv    from tmp_exp_log    where event = 'exposure_product'    group by id        , flag        , name        , is_new        , plat        , scene        , type        , user_id        , request_id    grouping sets(        (id, plat, is_new, flag, name, scene, user_id, request_id)        , (id, plat, is_new, scene, user_id, request_id)        , (id, plat, scene, user_id, request_id)        , (id, is_new, scene, user_id, request_id)        , (id, name, scene, user_id, request_id)        , (id, flag, scene, user_id, request_id)        , (id, scene, user_id, request_id)        , (id, plat, is_new, flag, name, scene, type, user_id, request_id)        , (id, plat, is_new, scene, type, user_id, request_id)        , (id, plat, scene, type, user_id, request_id)        , (id, is_new, scene, type, user_id, request_id)        , (id, name, scene, type, user_id, request_id)        , (id, flag, scene, type, user_id, request_id)        , (id, scene, type, user_id, request_id)    )  )bgroup by   id   , flag   , name   , is_new   , plat   , scene   , type   , user_id) agroup by id    , flag    , name    , is_new    , plat    , scene    , type;

3)关联字段加随机数，避免特殊值导致的笛卡尔积

--优化前1个多小时跑不完--优化后20min出结果select  l.*,   ......from    tmp.dual1  lleft join tmp.dual2  u   on (l.id = u.id and u.date = '2020-11-01')left join tmp.dual3 p   on (if(l.pcode='-',rand(10),l.pcode) = p.pcode)left join tmp.dual4 w   on (if(l.wcode='-',rand(10),l.wcode) = w.wcode)left tmp.dual5  el   on (if(l.eid='-',rand(10),l.eid) = el.eid)where l.date = '2020-11-01'

4)join的子查询维度一致可以改为union all+group by 减少关联，并行执行

select   date,   sum(amt) as amt,   sum(uv) as uv from   (    select       f.date,       sum(f.amt) as amt,       0 as uv     from       tmp.dual1 f     where       1 = 1       and f.date = '2020-10-11'     group by       f.date     union all     select       f.date,       0 as amt,       count(distinct f.user_id) as uv     from       tmp.dual2 f     where       1 = 1       and f.date = '2020-10-11'     group by       f.date  ) unionTable group by   date

总结：Hive SQL的优化就是通过各种方式避免数据倾斜、数据冗余、job或IO过多，高效利用集群的并发特性。

weixin_39945915

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
hive rand函数_Hive实践（高阶篇）

1.1、Hive SQL执行原理简单来说HIVE就是一个查询引擎，通过语法分析、语法解析、语法优化等一系列操作将SQL转化为MapReduce作业，MR作业一般会有以下几个过程：输入：根据输入数据的大小进行分片MAP阶段：Map任务的个数由输入的分片个数决定；在Map阶段对分片文件中的每行进行检查过滤，并按指定的列保存到bending文件。Combine阶段：将Map阶段的输出文件进行...
复制链接

扫一扫