Hive工程实践

最新推荐文章于 2024-05-07 13:37:08 发布

scying1

最新推荐文章于 2024-05-07 13:37:08 发布

阅读量1.3k

点赞数

分类专栏： Hive 文章标签： hive

本文链接：https://blog.csdn.net/scying1/article/details/81020897

版权

Hive 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近在参与某toB项目，数据需离线统计出并推送至线上业务库，其中用hive做的离线分析。总结写下常见问题及心得吧。

一.工程类技术范畴：数据统计工作大题划分为四步：指标统计、批量脚本、数据格式、异常流程；

step1. 指标统计：通过创建表存储每个指标的值，例如用hive表loan_apply_rate存储申请通过率；复杂度在于：指标值多，且指标定义可能不明确；

step2. 批量脚本：将step1创建的各张表综合成批量执行的perl脚本；复杂度在于：若执行时间长，会影响业务方使用，可自行测试出大小适中的perl脚本（把大的脚本做垂直区分，如申请类一个脚本，提现类一个脚本；或者做水平区分，如vintage指标依赖中间许多逻辑，可以把部分逻辑单独拆分为中间表，最终vintage指标再依赖该中间表）；

step3.数据格式：新建一张总表，该表存储所有的指标值；并且将step2生成的表转化成业务方期望的数据格式（可以把step2指标转换为多个业务方期望格式，做指标复用）。示例如下：

step4.异常流程：包括批量脚本父子任务执行顺序异常，今日统计的数据异常时数据回滚或重新统计等，数据去重以及数据备份等；

二.hive类技术范畴

1. 常用优化

1.1 定理：如果只用rn=1，即只需最值，则没必要用rownumber。查找申请表里授信金额最大的一笔订单？

case1: select * from a where dt='2018-12-19' order by loan_amount desc limit 1;（map70s 、reduce400s）（常用但低效）

case2: select * from (select *,max(loan_amount ) la from a where dt='2018-12-19') a where la=loan_amount ;（map70s 、reduce1300s ）（常用但低效）

case3: select * from (select *, row_number() over(sort by loan_amount desc) rn from a where dt='2018-12-19) a where rn=1;（map70s 、reduce9000s timeout）

case4: select * from (select * from a where dt='2018-12-19') a join (select max(loan_amount) la from a where dt=2018-12-19') b on a.loan_amount=la; (map70s、map70s、reduce2s )

case5: select * from (select max(struct(apply_no,loan_amount)) la from a where dt='2018-12-19') b;(map70s、reduce2s)

1.2 定理: 替代distinct

case1: select count(distinct(user_jrid)) from user where dt=‘2018-12-19’; (完成时间：800s)(因为distinct是o(n^log2 n),且只有一个reduce)（常用但低效）

case2: select 1,count(1) from (select user_jrid from a where dt='2018-12-31' group by user_jrid) a ; (通过groupby 并行化去重,完成时间：80s)(o(n^log2 n)，但是可多个reduce并行执行);

1.3 各阶段复杂度:

2. UDF

指定为月末：

2.1 when split(statistics_date,'-')[1] in ('1','3','5','7','8','10','12') then concat(statistics_date,'-31')
when split(statistics_date,'-')[1] in ('4','6','9','11') then concat(statistics_date,'-30')
when cast(split(statistics_date,'-')[0] as int)%4=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-29')
when cast(split(statistics_date,'-')[0] as int)%4!=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-28') end as new_statistics_date

2.2 date_sub(concat(substr(concat(substr(created_date, 1, 7), '-01'), 1, 7), '-01'), 1)

3.常用函数

3.1 行转列：collect_set/collect_list（得到的是array<String>类型）；clollect_ws可以合并collect_set（如collect_ws(',',collect_set())）；

case1: 产品默认排序，把产品汇总到一行。

3.2 列转行：lateral view explode/pos_explode

case1: select v from (select split('1 2 3 4 5 6 7 8 9 0',' ') v1 ) t1 lateral view explode(v1) t2 as v;

case2: select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),t.pos + 1) as biz_date from (select pose_explode(split(space(30),' '))); 如下图，统计某行过去30天每天的申请提现指标。(若用group by的原因，则select的字段需做collect_set判断；本语句select字段多，繁琐)

3.3 select * from (select *,row_number() over(partition by cash_id order by modified_date desc) as rn from cash_apply) a where rn=1；提现表为增量表，上述语句可查找到最新的提现表

3.4 其他：instr； months_between;
order by,sort by, distribute by, cluster by：参照 https://blog.csdn.net/zhanglh046/article/details/78572939