有一个5000
万的用户文件(user_id,name,age)
,一个2
亿记录的用户看电影的记录文件(user_id,url)
,根据年龄段观看电影的次数进行排序。
用户信息表
Log表(看电影的记录文件)
- 对
log
表进行数据清洗:每个用户看了多少次电影
with t1 as (
select user_id,
count(url) as cnt
from test4log
group by user_id
)
2. 对user
表进行数据清洗
t2 as (
select *
case when age>=0 and age<=10 then '0-10'
when age>10 and age<=20 then '10-20'
when age>20 and age<=30 then '20-30'
when age>30 and age<=40 then '30-40'
when age>40 and age<=50 then '40-50'
when age>50 and age<=60 then '50-60'
when age>60 and age<=70 then '60-70'
end age_phase
from test4user
)
3. 将t1、t2
两表关联合并
t3 as (
select *
from t1 join t2 on t1.user_id=t2.user_id
)
select * from t3
4. 排序合并生成题目所需要的结果
t3 as (
select age_phase,
sum(cnt) sum1
from t1 join t2 on t1.user_id=t2.user_id
group by t2.age_phase
)
select * from t3