Hive 求得所有用户和活跃用户的总数及平均年龄

有日志如下,请写出代码
求得所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有访问记录的用户)日期 用户 年龄
数据集

2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19

1)建表

create table user_age
(
    dt      string,
    user_id string,
    age     int
) row format delimited fields terminated by ',';

2)按照日期以及用户分组,按照日期排序并给出排名

select dt,
       user_id,
       min(age) age,
       rank() over (distribute by user_id sort by dt)
from user_age
group by dt, user_id;

3)计算日期及排名的差值

with a as (select dt,
                  user_id,
                  min(age)                                       age,
                  rank() over (distribute by user_id sort by dt) `rn`
           from user_age
           group by dt, user_id)
select user_id,
       age,
       date_sub(dt, rn) `flag`
from a;

+-------+---+----------+
|user_id|age|flag      |
+-------+---+----------+
|test_1 |23 |2019-02-10|
|test_1 |23 |2019-02-11|
|test_2 |19 |2019-02-10|
|test_2 |19 |2019-02-10|
|test_2 |19 |2019-02-12|
|test_2 |19 |2019-02-12|
|test_3 |39 |2019-02-10|
+-------+---+----------+

4)过滤出差值大于等于2的,即为连续两天活跃的用户

with a as (select dt,
                  user_id,
                  min(age)                                       age,
                  rank() over (distribute by user_id sort by dt) `rn`
           from user_age
           group by dt, user_id),
     b as (select user_id,
                  age,
                  date_sub(dt, rn) `flag`
           from a)
select user_id, age
from b
group by user_id, age
having count(*) >= 2
;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
+-------+---+

5)对数据进行去重处理(一个用户可以在两个不同的时间点连续登录),例如:a用户在1月10号1月11号以及1月20号和1月21号4天登录。

with a as (select dt,
                  user_id,
                  min(age)                                       age,
                  rank() over (distribute by user_id sort by dt) `rn`
           from user_age
           group by dt, user_id),
     b as (select user_id,
                  age,
                  date_sub(dt, rn) `flag`
           from a),
     c as (select user_id, age
           from b
           group by user_id, age
           having count(*) >= 2)
select user_id,
       min(age) age
from c
group by user_id;
;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
+-------+---+

6)计算活跃用户(两天连续有访问)的人数以及平均年龄

with a as (select dt,
                  user_id,
                  min(age)                                       age,
                  rank() over (distribute by user_id sort by dt) `rn`
           from user_age
           group by dt, user_id),
     b as (select user_id,
                  age,
                  date_sub(dt, rn) `flag`
           from a),
     c as (select user_id, age
           from b
           group by user_id, age
           having count(*) >= 2),
     d as (select user_id,
                  min(age) age
           from c
           group by user_id)
select count(*),
       cast(sum(age) / count(*) as decimal(10, 2))
from d;
+--+-----+
|c0|c1   |
+--+-----+
|2 |21.00|
+--+-----+

7)对全量数据集进行按照用户去重

select user_id,
       min(age) age
from user_age
group by user_id;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
|test_3 |39 |
+-------+---+

8)计算所有用户的数量以及平均年龄

select count(*) `user_count,`,
        cast(sum(age) / count(*) as decimal(10, 2))
 from (select user_id,
              min(age) age
       from user_age
       group by user_id) g
;
+-----------+-----+
|user_count,|c1   |
+-----------+-----+
|3          |27.00|
+-----------+-----+

9)将第5步以及第7步两个数据集进行union all操作

10)求和并拼接为最终SQL

with a as (select dt,
                  user_id,
                  min(age)                                       age,
                  rank() over (distribute by user_id sort by dt) `rn`
           from user_age
           group by dt, user_id),
     b as (select user_id,
                  age,
                  date_sub(dt, rn) `flag`
           from a),
     c as (select user_id, age
           from b
           group by user_id, age
           having count(*) >= 2),
     d as (select user_id,
                  min(age) age
           from c
           group by user_id),
     e as (select count(*)                                    `activity_user_count`,
                  cast(sum(age) / count(*) as decimal(10, 2)) `activity_avg_age`
           from d),
     f as (select user_id,
                  min(age) age
           from user_age
           group by user_id),
     g as (select count(*)                                    `twice_count`,
                  cast(sum(age) / count(*) as decimal(10, 2)) `twice_count_age`
           from f),
     h as (select activity_user_count,
                  activity_avg_age,
                  0 twice_count,
                  0 twice_count_age
           from e
           union all
           select 0 user_total_count,
                  0 user_total_avg_age,
                  twice_count,
                  twice_count_age
           from g)
select sum(activity_user_count) `activity_user_count`,
       sum(activity_avg_age)    `activity_avg_age`,
       sum(twice_count)         `twice_count`,
       sum(twice_count_age)     `twice_count_age`
from h
;
+-------------------+----------------+-----------+---------------+
|activity_user_count|activity_avg_age|twice_count|twice_count_age|
+-------------------+----------------+-----------+---------------+
|2                  |21.00           |3          |27.00          |
+-------------------+----------------+-----------+---------------+

总结

思路就是分别求所有用户和活跃用户的总数及平均年龄,之后进行union即可

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值