HIVE SQL 高频(持续更新)

整理自数据分析常考面试题100题

1. 连续日期

1、为日期排序
 row_number() over (partition by use_id order by date) as rank

2、求日期和排序的差值

3、求 diff 出现最多的次数 max(count(diff))

例1:

- 每个用户一周活跃天数 -
SELECT user_id,
       COUNT(1) AS active_day
FROM (SELECT user_id, date 
      FROM table
      WHERE TIMESTAMPDIFF(DAY,date,CURRENT_DATE) <= 7
      GROUP BY user_id, date) a  -- 这里还要加GROUP BY date是为了去重 --
GROUP BY user_id;

- 每个用户一周内最大连续活跃天数 -
WITH recent_day AS(
SELECT user_id, date
FROM table
WHERE TIMESTAMPDIFF(DAY, date, CURRENT_DATE) <= 7
GROUP BY user_id, date
),
diff_table AS(
SELECT user_id,
       ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY date) AS rank,
       DATE_SUB(date, ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY transdt)) AS gap
       -- 连续活跃的话,gap会一样 -- 
       FROM recent_day
),
gap_cnt AS(
SELECT user_id, count(gap) AS count
FROM diff_table
GROUP BY user_id, gap
)
SELECT user_id, MAX(count)
FROM gap_cnt
GROUP BY user_id

* TIMESTAMPDIFF(unit, start, end) ; DATEDIFF(end, start)

例2:

- 截至当前,每个用户已经连续签到的天数 - 
SELECT user_id,
DATEDIFF(CURRENT_DATE, no_sign_in) AS consecutive_days
FROM (SELECT user_id, MAX(date) AS no_sign_in
      FROM table
      WHERE is_sign_in = 0
      GROUP BY user_id
      ) 

思路:找到最近的“中断”点:找到最近一次没有签到的日期,那么在这之后的所有天数都可以被视为连续签到的天数。

2.时间间隔问题

1. 为日期排序

row number() over(partition by id order by date) as rank

2、错位相减,来实现“相邻” 要求, 求日期和排序的差值( diff)

3、根据题目要求,求出相应指标· max(时间差)· count(*) · min(时间差)

- 每个用户相邻两次浏览时间之差小于 3min 的次数 - 
WITH rank_table AS(
SELECT user_id, user_time,
       ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY user_time) AS rk
FROM table 
),
misaligned_sub AS(
SELECT user_id, a.user_time, b.user_time
FROM rank_table a
JOIN rank_table b ON a.user_id = b.user_id AND a.rk = b.rk - 1
)
SELECT user_id, COUNT(1) AS times
FROM misaligned_sub
WHERE TIMESTAMPDIFF(MINUTE, a.user_time, b.user_time) <= 3
GROUP BY user_id

延申:同一个用户购买同一个商品相邻两次的时间间隔

-同一个用户购买同一个商品相邻两次的时间间隔-
WITH ranked_purchases AS (
    SELECT user_id, 
           product_id, 
           purchase_time, 
           ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY purchase_time) AS rk
    FROM purchases_table
)
SELECT a.user_id, 
       a.product_id, 
       a.purchase_time AS current_purchase_time, 
       b.purchase_time AS next_purchase_time,
       TIMESTAMPDIFF(MINUTE, b.purchase_time, a.purchase_time) AS time_interval_minutes
FROM ranked_purchases a
JOIN ranked_purchases b 
ON a.user_id = b.user_id 
   AND a.product_id = b.product_id 
   AND a.rk = b.rk - 1;

3. 求累加

SELECT user_id, 
       date, 
       amount, 
       SUM(amount) OVER (PARTITION BY user_id ORDER BY date) AS cumulative_amount
FROM table
  • ROWS BETWEEN: 基于行的位置来定义窗口帧,精确匹配行。
  • RANGE BETWEEN: 基于值范围来定义窗口帧,可能包含具有相同排序值的多行。range表示的是 具体的值,比这个值小n的行,比这个值大n的行
  • sum(close) range between 100 preceding and 200 following: 如当前行的 close 字段值是 200,那么这个窗口大小的定义就会选择分区中 close 字段值落在 100 至 400 区间的记录(行)。

unbounded preceding 前面所有行

unbounded following 后面所有行

current row 当前行

n following 后面n行

n preceding 前面n行

4. LEAD and LAG

lag :用于统计窗口内往上第 n 行值

lead :用于统计窗口内往下第 n 行值

lag 和 lead 有三个参数,第一个参数是列名,第二个参数是偏移量,第三个参数是超出记录窗口时的默认值

例1:有三列user,time和url,获取用户在某个页面停留时长

SELECT userid, url,
       UNIX_TIMESTAMP(LEAD(time,1) OVER(PARTITION BY userid ORDER BY time)) - UNIX_TIMESTAMP(time) AS period
FROM user_log

例2:寻找至少连续出现 3 次的数字

SELECT DISTINCT num
FROM (SELECT 
        num,
        LAG(num,1) OVER() AS lag1,
        LAG(num,2) OVER() AS lag2
      FROM table
      ) a
WHERE num = lag1 AND num = lag2

5.行/列转换

多行转一行:

SELECT Product,
CONCAT_WS(',',COLLECT_SET(Supplier)) AS Supplier
FROM mytable
GROUP BY Product

*collect_set 函数,有两个作用,第一个是去重,去除 group by 后的重复元素,第二个是形成一个集合,将 group by 后属于同一组的集合起来成为一个集合。

一行转多行:

SELECT Product, Supplier
FROM mytable
LATERAL VIEW EXPLODE(SPLIT(Supplier,',')) tb1 AS Supplier

6.去除最高最低的平均

计算除去部门最高工资,和最低工资的平均工资(字节跳动面试)

SELECT a.Department, AVG(a.Salary) AS avg_salary
FROM (
    SELECT 
        Department,
        Salary,
        RANK() OVER(PARTITION BY Department ORDER BY Salary ASC) AS asc_rk,
        RANK() OVER(PARTITION BY Department ORDER BY Salary DESC) AS desc_rk
    FROM Salary_Info
) a
WHERE a.asc_rk > 1  -- 去除最低的薪资
  AND a.desc_rk > 1 -- 去除最高的薪资
GROUP BY a.Department;

核心是使用窗口函数 RANK 分别对工资 salary 进行升序和降序排列,就获得了该分组内最低和最高的工资,过滤掉这 2 条记录

rank()排序相同时会重复,总数不变,即会出现1、1、3这样的排序结果;

dense_rank()排序相同时会重复,总数会减少,即会出现1、1、2这样的排序结果;

  • 10
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值