Hive SQL的坑和note

Q010910

已于 2024-07-24 22:24:20 修改

阅读量666

点赞数 10

文章标签： hive sql hadoop

于 2024-07-23 20:53:44 首次发布

本文链接：https://blog.csdn.net/m0_63190465/article/details/140593582

版权

Hive的GROUP BY是不能识别别名的：把别名对应的那个表达式都扔到group by里去
非聚合列必须出现在group by中

select col1, col2, collect_set(col3)
sum(is_drawback) as order_cnt
count(1) as xxx
FROM xxx.table
group by col1, col2

insert into 与 insert overwrite 都可以向hive表中插入数据，但是insert into直接追加到表中数据的尾部，而insert overwrite会重写数据，既先进行删除，再写入。如果存在分区的情况，insert overwrite会只重写当前分区数据。

INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';

当我们使用复杂的sql时比如 select * from a join b join c 这种尽量使用create temporary table。因为这种join比较耗时一次即可。

但是如果我们使用的目的仅仅时简化sql比如有时候查询指定的字段 select a,b,c,d,e,f,g,h from t 这种比较简单的查询还是推荐with tmp as ()语法，因为hive本身查询这种就很快，不需要额外花费时间落地为数据这样还更耗时

case when

select
dname ,
sum(case when gender='男' then 1 else 0 end) as m_cnts ,
sum(case when gender='女' then 1 else 0 end) as f_cnts ,
case when dname='A' then '教学部' else '后勤部' end   as ch
from
tb_case_when_demo
group by dname ;
 
+--------+---------+---------+------+
| dname  | m_cnts  | f_cnts  |  ch  |
+--------+---------+---------+------+
| A      | 2       | 1       | 教学部  |
| B      | 1       | 2       | 后勤部  |
+--------+---------+---------+------+

中位数
percentile或者percentile_approx，此函数本是求分位数，但是0.5的分位数不就是中位数嘛！
若是int型（bigint等）用percentile 例:select percentitle(item_a,0.5) from table_a;
若是float（或者double等）用percentile_approx，例：select percentitle_approx(item_a,0.5) from table_a。
approx是approximate的简写。

环比and同比

SELECT 
    MONTH(transdt) as transdt_m,
    COUNT(DISTINCT card_no) AS active_users,
    AVG(txamountrmb) AS avg_tx_amount,
    COUNT(*) AS transaction_cnt,
    LAG(COUNT(DISTINCT card_no), 12) OVER (ORDER BY transdt_m) AS lag_active_users,
    LAG(COUNT(*), 1) OVER (ORDER BY transdt_m) AS lag_transaction_cnt,
    COUNT(DISTINCT f2pan) / LAG(COUNT(DISTINCT f2pan), 12) OVER (ORDER BY transdt_m) - 1 AS active_users_rate,
    COUNT(*) / LAG(COUNT(*), 1) OVER (ORDER BY transdt_m) - 1 AS transaction_cnt_rate
FROM tablename
WHERE channel = 'Outbound'
GROUP BY MONTH(transdt)
ORDER BY transdt_m;

having是在 group by 分完组之后再对数据进行筛选，所以having 要筛选的字段只能是分组字段或者聚合函数
where 是从数据表中的字段直接进行的筛选的，所以不能跟在group by后面，也不能使用聚合函数
在数据量很大的情况下，尽量不要使用count(distinct)，group by和join，会有数据倾斜的问题。优先过滤后再进行Join操作，最大限度的减少参与join的数据量；最好是小表join大表。

Q010910

关注

10
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Hive SQL的坑和note

但是如果我们使用的目的仅仅时简化sql比如有时候查询指定的字段 select a,b,c,d,e,f,g,h from t 这种比较简单的查询还是推荐with tmp as ()语法，因为hive本身查询这种就很快，不需要额外花费时间落地为数据这样还更耗时。
复制链接

扫一扫