前言:
在做监控ETL任务的dashboard的时候,有这样一个需求——记录近三天调度耗时与历史执行中位数耗时比值TOP20。起初,感觉很简单,但在实际的开发过程中,sql底子弱的问题被暴露的一览无余,来来回回居然折腾了一天。哎,太菜,记录一下心路历程。
单一中位数的求取
中位数是啥
一串数字,按从小到大排列,当总数是奇数时,取最中间的数;当总数是偶数时,取最中间两个数的平均数。
单一中位数的参考
何为单一呢?就是你只要从一组数据中,找出一个即可,假如是这样的需求,你可以参考以下两种写法:
参考一:select group_concat(id), avg(value) from ( #最外层开始 select id, value from ( #第二层开始 select id, @index:=@index+1 as myindex, value from student, (select @index:=0) AS initvar order by value #最内层 ) as t where floor(@index/2+1)=myindex or ceil(@index/2)=myindex #第二层结束) as x #最外层结束参考二:SELECT AVG (DISTINCT income)FROM (SELECT t1.income FROM Graduates t1,Graduates t2 GROUP BY t1.income --s1 的条件 HAVING SUM(CASE WHEN t2.income >= t1.income THEN 1 ELSE 0 END) >= COUNT(*) / 2.0 --s2的条件 AND SUM(CASE WHEN t2.income <= t1.income THEN 1 ELSE 0 END) >= COUNT(*) / 2.0) TMP;
分组中位数的求取
但往往在实际开发过程中,不会那么简单,涉及的数据量会特别庞大,需要你求的也是一堆数据中的去取一堆中位数,怎么做呢?提供一种解题思路:
第一步:对数据进行分组排序,并给每一组的数据进行分组编号
(由于我的mysql版本5.7不支持 row_number函数,所以,需要自行row_number函数功能的实现,干货如下:)
select @row_number:=CASE WHEN @customer_no = x.project_id THEN @row_number +1 ELSE 1 END AS num1, @customer_no:=x.project_id AS project_id, //customer_no是一个临时变量,每次查询都被赋值为x.project_id。而case中判断条件在customer_no赋值之前,其实就是判断当前行x.project_id值是否与上一行x.project_id值相同。当不相同时重新编号(输出1),从而实现了分组顺序编号的功能 x.overtime from( select (ef.end_time - ef.start_time) as overtime, ef.project_id, st.task_name, st.task_desc, st.create_user from azkaban.execution_flows ef LEFT JOIN azkaban.projects pj ON ef.project_id = pj.id Where ef.start_time > UNIX_TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 10 day)) * 1000 AND ef.STATUS != '70' order by ef.project_id ) x,( SELECT @row_number:=0, @customer_no:=0 ) as it order by x.project_id,x.overtime;
第二步:获取各分组的中位数编号
由于我的数据量很大,所以,我这边就直接进行(count(*) DIV 2 +1)作为我的分组中位数的编号。
select (count(*) DIV 2 +1) as num, ef.project_id from azkaban.execution_flows ef where ef.start_time > UNIX_TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 10 day)) * 1000 AND ef.STATUS != '70' group by ef.project_id)xxxx on xxx.project_id= xxxx.project_id
第三步:进行left join 即可
select xxx.overtime as avgOverTime, xxx.task_name as taskName from( 第一步 )xxx left join( 第二步 )xxxx on xxx.project_id= xxxx.project_id where xxx.num1=xxxx.num
总结:
sql功力还是欠佳,加油,菜鸡共勉!