c++ log函数_数据分析系列:SQL笔试总结之神奇的窗口函数

f009ab4932de8eca0fedd7e964df87f8.png 什么是窗口函数?   

窗口函数就是类似于group by聚合函数,但又不同于聚合函数。聚合函数是将组内多个数据聚合成一个值,而窗口函数除了可以将组内数据聚合成一个值,还可以保留原始的每条数据。

比如说求平均数就是avg() over(partition by ** order by **)

其中over()函数的作用是将聚合函数的结果显示在每条单独的记录当中。

partition by()函数的作用与group by()相似,在over()中运用partition by是指定哪一列进行分组,然后聚合函数就会在分好的组里面进行聚合运算。

order by()函数就是顺序聚合了。

下面简单讲一下序列函数:

ntile函数:

这是个切片分组函数,如ntile(3) over(partition by classroom order by student_id)

就是在每个班级内进行分组,同时按照学号排序,将每个班级的同学分为三组。

row_number,rank,dense_rank函数:

row_number:顺序排序,按照排列不管数值相等于否,都是1,2,3,4,5,6

dense_rank:连续排序,如果两个数值相同就会给相同的等级,如1,2,2,3,4,5

rank:跳跃排序,虽然有两个相同的数值依然会给相同等级,但是过了这村没这店了,该等级就跳过去了,如1,2,2,4,5,6

lag()和lead()函数:

这就是提前和滞后函数了。

如lag(student,1) over(partition by classroom )

就是以班级这一列分组,将班级中同学的学号整体向后移动一位。

first_value(),last_value函数:

顾名思义,这就是第一个值和最后一个值

如first_value(cale_date) over(partition by shapname order by sale_date)

先按照商店名分组,再在组内按照销售时间排序,最后用first_value求出每个店铺的最早销售日期。

下面举一些栗子,几乎都是要用到窗口函数的。

f009ab4932de8eca0fedd7e964df87f8.png 一:商品订单数据   

数据表:

订单表orders,

字段有('order_id'订单号,

'user_id‘用户编号’,

'order_pay‘订单金额’ ,

'order_time‘下单时间’,

'商品一级类目commodity_level_I',

'商品二级类目commodity_level_2')

1.求最近7天内每一个一级类目下成交总额排名前3的二级类目:

思路:先找出最近7天的一级二级类目,以及总交易额,再在此基础上用row_number排序,再按照排序找到前三即可。注意缩进哦。

select commodity_level_1,        commodity_level_2,        total_payfrom        (select commodity_level_1,                commodity_level_2,                total_pay,                row_number() over(partition by commodity_level_1 order by a.total_pay desc)                            as rank         from                (select commodity_level_1,                        commodity_level_2,                        sum(order_pay) as total_pay                from orders                where datediff(now() , order_time) <= 7                group by commodity_level_1,commodity_level_2                ) a        ) bwhere rank <= 3

2.提取8.1-8.10每一天消费金额排名在101-195的user_id

思路:先找出8.1-8.10的数据以及总消费金额,再用row_number排序,最后提取相应要求的排名即可,记住要转化日期和缩进哦。

select user_id,        order_date,        total_payfrom        (select user_id,                total_pay,                row_number() over(partition by order_date order by a.total_pay desc) as rank        from                (select user_id,                        sum(order_pay) as total_pay,                        convert(order_time,date) as order_date                from orders                where convert(order_time,date) between '20180801' and '20180810'                group by convert(order_time,date),user_id                ) a        ) bwhere rank between '101' and '195'
f009ab4932de8eca0fedd7e964df87f8.png 二:活动运营数据分析   

数据表

表1——订单表orders,

字段有(user_id‘用户编号’,

order_pay‘订单金额’ ,

order_time‘下单时间’)

表2——活动报名表act_apply,

字段有(act_id‘活动编号’,

user_id‘报名用户’,

act_time‘报名时间’)

1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额,总订单数

思路:不难,自己想想就行

select user_id,       count(*) as '总订单数',       sum(order_pay) as '总订单金额'from orders left join act_applyon orders.user_id = act_apply.user_idwhere order_time >= act_timegroup by user_id

2.统计每个活动从开始后到当天(考试日)平均每天产生的订单数,活动开始时间定义为最早有用户报名的时间。(涉及到时间的数据类型均为:datetime)

思路:先找出每个活动的开始时间,再找出活动开始时间到现在的时间差,用总订单数除以该时间差即可。

select act_id,count(*)/datediff(now(),first_act) as '平均订单'from orders aleft join        (select user_id,                act_id,        min(act_time) over(partition by act_id ) as first_act        from act_apply        group by act_id        ) bon a.user_id = b.user_idwhere order_time>=act_timegroup by act_id,first_act
f009ab4932de8eca0fedd7e964df87f8.png 三:用户行为路径分析   

表1——用户行为表tracking_log,

字段有(user_id‘用户编号’,

opr_id‘操作编号’,

log_time‘操作时间’)

1.统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻

思路:先把日期lag一下,新列出一列,再比较即可

select log_date,       count(*) as '用户数'from       (select distinct user_id,              opr_id as cur_opr,              convert(log_time,date) as log_date,              lag(opr_id,1) over(partition by user_id,convert(log_time,date) order by log_time) as next_opr       from tracking_log       ) awhere a.cur_opr==A and a.next_opr==Bgroup by  log_date

2.统计用户行为序列为A-B-D的用户数,其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)

思路:用like+%判断,外加group_concat函数

select count(*) as resultfrom(select user_id,       group_concat(opr_id order by log_time) as user_behavior_pathfrom tracking_loggroup by user_idhaving (user_behavior_path like '%A%B%D') and (user_behavior_path not like '%A%B%C%D')) a
f009ab4932de8eca0fedd7e964df87f8.png 四:用户留存分析   

表:用户登陆表user_log,

字段有(user_id‘用户编号’,

log_date‘登陆时间’)

1.求每天新增用户数,以及他们第2天、30天的留存率

思路一:不要窗口函数,用case when

select l.first_date as '日期',       count(distinct l.user_id) as '新增用户数',       round(count(distinct case when (datediff(log_date,first_date) == 1 then l.user_id else null end)/       count(distinct l.user_id)),2       ) as '次日留存率',       round(count(distinct case when (datediff(log_date,first_date) == 29 then l.user_id else null end)/       count(distinct l.user_id)),2       ) as '30日留存率'from user_log lleft join(select distinct user_id,       min(log_date) as first_datefrom user_loggroup by user_id) ton t.user_id = l.user_idgroup by first_date

思路二:用窗口函数找出第一天,第二天,第30天的用户登录,相除即可

即先找到每个用户第一个登录的时间,用第一次登录时间聚合,统计distinct用户,即为每天的新增用户

计算第二天回访的用户量,计算当前时间与第一次登录的时间差即可

第30天用户回访也这么做即可

select a.date as '日期',b.2_back/a.new as '第二天的回访率',c.3_back/a.new as '第30天的回访率'from(select convert(log_date,date) as 'date' ,        count(distinct user_id) as newfrom       (       select user_id,              min(log_date) over(partition by user_id) as first_date,              log_date       from user_log       ) aagroup by aa.first_date) aleft joinselect convert(log_date) as date,       count(user_id) as 2_backfrom       (       select user_id,              date_log,              min(log_date) over(partition by user_id) as first_date       from user_log       ) bbwhere datediff(bb.date_log,bb.first_date) = 1group by bb.first_date) bon a.date=b.dateleft joinselect convert(log_date) as date,       count(user_id) as 30_backfrom       (       select user_id,              date_log,              min(log_date) over(partition by user_id) as first_date       from user_log       ) ccwhere datediff(cc.date_log,cc.first_date) = 29group by bb.first_date) con a.date=c.date

2.找近90天,30天,7天的登录人数

思路:看了上面的,应该有思路了吧,用case when判断即可

select       count(distinct case when datediff(now(),log_date()) <= 90 then user_id else null end) as 90_log_users,       count(distinct case when datediff(now(),log_date()) <= 30 then user_id else null end) as 30_log_users,       count(distinct case when datediff(now(),log_date()) <= 7 then user_id else null end) as 7_log_usersfrom user_log

3.求用户近一个月平均登录时间间隔(按天)

思路:用窗口函数lead新建一列,再进行比较,就能知道平均登陆时间间隔,记住在每个user单次下进行的哦。

select user_id ,avg(diff)from       (       select user_id,              lead(log_time,1) over(partition by user_id order by log_time) - log_time as diff       from user_log       ) twhere datediff(now(),log_time)<=30group by user_id
f009ab4932de8eca0fedd7e964df87f8.png 五:统计(中/四分位数,众数等)   

字段:店铺id(shop_id),

销量(sale),

商品id(commodity_id).

1.求每个店铺商品销量排名的中位数

思路一:设每个店铺销售量组成的序列长度为cnt,

当cnt为偶数的时候,中位数所在的序号是cnt/2,cnt/2+1;

当cnt为奇数的时候,中位数所在的序号为ceiling(cnt/2)

select shop_id,avg(sale) as resultfrom(select sale,       count(1) over(partition by shop_id) as total,       cast(count(1) over(partition by shop_id) as decimal) / 2 as even_mid,       ceiling(cast(count(1) over(partition by shop_id) as decimal) / 2) as odd_mid,       row_number() over(partition by shop_id) as sale_rankfrom orders) twhere (total%2=0 and t.sale_rank in (even_mid,even_mid+1)) or (total%2=1 and t.sale_rank = odd_mid)group by shop_id

思路二:abs(rn-(cnt+1)/2)<1

解释下上面的公式,rn就是长度为cnt的序列的排序,即为row_number,

如果是奇数,1,2,3,4,5,那他的中位数就是3,|3-(5+1)|=0;

如果是偶数,1,2,3,4,那他的中位数所在的序号就是2,3,

那么2-(4+1)/2=-0.5,3-(4+1)/2=0.5

所以可见,不管他是技术还是偶数,abs(rn-(cnt+1)/2)<1

select shop_id,avg(sale) as medianfrom       (       select shop_id,              sale,              row_number() over(partition by shop_id order by sale) as rn,              count(1) over(partition by shop_id) as cnt       from orders       ) awhere abs(rn - (cnt + 1) / 2) < 1group by shop_id

思路三:不用窗口函数,不排序,直接利用中位数定义

有点绕,尽量理解下:

当一个数组cnt为奇数的时候,大于中位数的数值个数等于小于中位数的数值个数。

当一个数组cnt为偶数的时候,那么中位数就等于排序后中间两个数的平均值,如果这个数组每个数都是唯一的,那么就会发现对于这两个数来说,大于他们的数值的个数和小于他们数值的个数的绝对值等于1,即为这两个数出现的频率。如果这个数组的值不是唯一的,你会发现也成立。

总结一下,不管他们是长度是奇数还是偶数,中位数出现的频率一定大于等于(大于它的数和小于它的数的绝对值之差)

步骤如下:

1.自连结

2.计算各个数字出现的频率:sum(case when t1.sale=t2.sale then 1 else 0 end)

3.算出大于它的数和小于它的数的绝对值之差,用于比较:abs(sum(sign(t1.sale-t2.sale)))

select shop_id,avg(sale) as medianfrom       (       select t1.shop_id,              t1.sale,       from orders t1       left join orders t2       on t1.shop_id=t2.shop_id       group by t1.shop_id,t2.shop_id       having sum(case when t1.sale=t2.sale then 1 else 0 end) >=       abs(sum(sign(t1.sale-t2.sale)))       ) tgroup by t.shop_id

2.求每个店铺订购商品的众数

思路:用窗口函数,求出每个店铺商品的数量,再找出其中的最大值,就是众数了

select .shop_id,commodity_id,order_numfrom       (       select shop_id,              commodity_id,              order_num,              max(order_num) over(partition by shop_id) as max_order_num       from              (              select shop_id,                     commodity_id,                     count(*) as order_num              from orders              group by shop_id,commodity_id              ) t1       ) t2where order_num = max_order_num

3.求四分位数

三个表

T1:good_id,cate_id(分类)

T2:mall_id(店铺), good_id

T3:mall_id, credit_score(信用分)

问,在不同分类中,店铺的信用分的top25%

思路:先将三个表连起来,再用row_number排序,再用count找出每个分类的个数,最后提出前25%即可。

select cate_id,mall_id,credit_scorefrom(select cate_id,mall_id,credit_score,row_number() over(partition by cate_id order by credit_score desc) as score_rn,count(*) over(partition by cate_id) as mall_totalfrom       (       select t1.cate_id,              t2.mall_id,              t3.credit_score       from t1 left join (t2 left join t3 on t2.mall_id=t3.mall_id) on t1.good_id=t2.good_id       group by t1.cate_id,t2.mall_id       ) t)where score_rn <= mall_total*0.25
f009ab4932de8eca0fedd7e964df87f8.png 六:GMV周同比统计   

字段:时间(sale_date),

店铺类别(cate_id),

店铺数量(mall_num),gmv

1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例

注:以下解法只适用于数据连续情况,如果数据不连续,例如对于上一周没有星期六,星期日的数据,lead(gmv,7)这样规定移动窗口必然会出现错误。

思路:用lead窗口函数就行

select cate_id,       avg(gmv_diff),       avg(gmv_rate)from       (       select sale_date,              cate_id,              mall_num,              gmv,              (lead(gmv,7) over(partition by cate_id order by sale_date) - gmv) as gmv_diff ,              (lead(gmv,7) over(partition by cate_id order by sale_date) - gmv)/gmv as gmv_rate       from T       where convert(sale_date,date) between '20190611' and '20190624'       ) twhere convert(sale_date,date) between '20190611' and '20190624'group by cate_id

2.在618前一周gmv top20%,20-40%等这5类商铺在618后一周内gmv日均提升幅度和比例

思路:1.算出每个店铺的sum(gmv)

2.用case when标出gmv_quantile

3.和上面一样,用lead算出gmv_diff和gmv_rate,用上面的gmv_quantile判断是否在范围内

4.代码太长我懒得写了,hahah

f009ab4932de8eca0fedd7e964df87f8.png 七:连续区间问题   

关于这题,我看到过一个绝佳的栗子,和大家分享下:

题目:有一张用户签到表【user_attendence】,标记每天用户是否签到(说明:该表包含所有用户所有工作日的出勤记录)

包含三个字段:日期【date】,

用户id【user_id】,

用户当天是否签到【user_sign_in:0否1是】;

问题1:请计算截至当前每个用户已经连续签到的天数(输出表仅包含当天签到的所有用户,计算其连续签到天数)

输出表【t_user_consecutive_days】:用户id【fuser_id】,用户联系签到天数【fconsecutive_days】

思路:先找用户最近一次未签到日期,再用今天减那个日期

create table t_user_consecutive_days asselect user_id,       datediff(now(), max_sign_date_0) as fconsecutive_daysfrom       (       select user_id,              max(date) as max_sign_date_0       from user_attendence       where user_sign_in = 0       group by user_id       ) tgroup by user_id

问题2:请计算每个用户历史以来最大的连续签到天数(输出表为用户签到表中所有出现过的用户,计算其历史最大连续签到天数)

输出表【user_max_days】:用户id【user_id】,用户最大连续签到天数【max_days】

思路:把所有用户的签到转化成一个0-1序列,找出其中连续1的最长序列的len()

create table user_max_days asselect user_id,       max(len(cut_record)) as max_daysfrom       (       select user_id,              record,              cut_record,       from              (              select user_id,                     group_concat(user_sign_in) as record              from user_attendence              group by user_id              ) t1       lateral view explode(split(record,'0')) t as cut_record       ) t2where cut_record <> ' 'group by user_id
f009ab4932de8eca0fedd7e964df87f8.png 八:学生成绩分析   

表:Enrollments

字段:student_id,course_id,grade

1.查询每位学生获得的最高成绩和它所对应的科目,若科目成绩并列,取 course_id 最小的一门。查询结果需按 student_id 增序进行排序。

思路一:用窗口函数dense_rank来排序grade,再用row_number来排序course_id

select a.student_id,a.course_id,a.gradefrom       (       select student_id,course_id,grade,              row_number() over(partition by student_id order by course_id) as course_rank       from              (              select student_id,course_id,grade,                     dense_rank() over(partition by student_id order by grade) as grade_rank              from Enrollments              ) t       where t.grade_rank = 1       ) awhere a.course_rank = 1order by student_id

思路二:用in解法,更快捷简单

select student_id,       min(course_id) as course_idfrom Enrollmentswhere (student_id,grade) in       (select student_id ,              max(grade)       from Enrollments       group by stucent_id)group by student_idorder by student_id
f009ab4932de8eca0fedd7e964df87f8.png 九:学生做题情况分析   

表t:做题日期(time),学生id(stu_id),题目id(exer_id)

统计10.1-10.10每天做新题的人的数量,重点在每天

思路:新题的判断是先找出每位同学在做每道题的第一次时间,即为新题。再distinct后看是否连续做了十天即可

select count(1) as result(select stu_id,       count(distinct a.first_time) as cntfrom       (       select stu_id,              exer_id,              time,              min(time) over(partition by stu_id,exer_id) as first_time       from T       where convert(time,date) between '20191001' and '10191010'       ) awhere a.time = T.timegroup by stu_idhaving cnt = 10) b

参考文献:

https://zhuanlan.zhihu.com/p/116012207

https://zhuanlan.zhihu.com/p/117498021

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值