基尼系数通常衡量一个国家或地区居民收入差距的常用指标。同样基尼系数也可以衡量影响因素差距的目标。
基尼系数衡量一个某些因素差距的常用指标。基尼系数介于0-1之间,基尼系数越大,表示不平等程度越高。
第一步:计算出差值、影响程度、前期值占比、斜率、环比波幅
差值=7月份的值-6月份的值
影响程度=某个差值/差值的总和
前期值占比=6月份的值/6月份的总和
斜率=影响程度/前期值占比
环比波幅=差值/6月份的值
2020/06 | 2020/07 | 差值 | 影响程度 | 前期值占比 | 斜率 | 环比波幅 | |
北京 | 700 | 200 | -500 | 1.25 | 0.583333333 | 2.142857143 | -71.43% |
上海 | 100 | 400 | 300 | -0.75 | 0.083333333 | -9 | 300.00% |
广州 | 400 | 200 | -200 | 0.5 | 0.333333333 | 1.5 | -50.00% |
杭州 | 0 | 0 | 0 | 0 | 0 | #DIV/0! | #DIV/0! |
深圳 | 0 | 0 | 0 | 0 | 0 | #DIV/0! | #DIV/0! |
长沙 | 0 | 0 | 0 | 0 | 0 | #DIV/0! | #DIV/0! |
南京 | 0 | 0 | 0 | 0 | 0 | #DIV/0! | #DIV/0! |
天津 | 0 | 0 | 0 | 0 | 0 | #DIV/0! | #DIV/0! |
总和 | 1200 | 800 | -400 | 1 | 1 | 1 | -33.33% |
第二步:计算出影响程度累计值、前期值占比累计值。
纵坐标 | 横坐标 | |||
影响程度 | 影响程度累计值 | 前期值占比 | 前期值占比累计值 | |
上海 | -0.75 | -0.75 | 0.083333333 | 0.083333333 |
广州 | 0.5 | -0.25 | 0.333333333 | 0.416666667 |
北京 | 1.25 | 1 | 0.583333333 | 1 |
第三步:根据影响程度、前期值占比、斜率计算影响系数,计算方法如下:
使用presto sql计算基尼系数,把sql稍微修改一下,同样也可以用hive计算基尼系数:
SELECT 1-sum(s) AS jini
FROM
(SELECT city,
earlier_degree_sum/2*(influence_degree_sum+last_value(influence_degree_sum) OVER(
ORDER BY slope ROWS BETWEEN CURRENT ROW AND 1 following)+2) AS s
FROM
(SELECT city,
slope,influence_degree,earlier_degree,
sum(influence_degree) over(
ORDER BY influence_degree ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS influence_degree_sum, --影响程度累计值
sum(earlier_degree) over(
ORDER BY influence_degree ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS earlier_degree_sum --前期值占比累计值
FROM (
(SELECT city,
if(d_sum!=0,d_value*1.00/d_sum,NULL) AS influence_degree, --影响程度
if(last_sum!=0,last_month_value*1.00/last_sum,NULL) AS earlier_degree, --前期值占比
if(last_month_value!=0
AND d_sum!=0
AND last_sum!=0,(d_value*1.000000/d_sum)/(last_month_value*1.000000/last_sum)) AS slope --斜率
FROM
(SELECT city, --计算6月和7月的每个城市的值
count(CASE
WHEN create_time<TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END) AS last_month_value,
count(CASE
WHEN create_time>=TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END) AS current_month_value,
(count(CASE
WHEN create_time<TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END)-count(CASE
WHEN create_time>=TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END)) AS d_value
FROM dby_ods.ods_bi_sop_hourly
WHERE create_time<=TIMESTAMP '2020-07-31 00:00:00'
AND create_time>TIMESTAMP '2020-05-31 00:00:00'
GROUP BY city) t1
JOIN
(SELECT count(CASE --计算6月和7月的总的值
WHEN create_time<TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END) AS last_sum,
count(CASE
WHEN create_time>=TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END) AS current_sum,
(count(CASE
WHEN create_time>=TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END)-count(CASE
WHEN create_time<TIMESTAMP '2020-07-01 00:00:00' THEN user_id
ELSE NULL
END)) AS d_sum
FROM dby_ods.ods_bi_sop_hourly
WHERE create_time<=TIMESTAMP '2020-07-31 00:00:00'
AND create_time>TIMESTAMP '2020-05-31 00:00:00') t2 ON 1=1) t3) t4
ORDER BY influence_degree) t5
ORDER BY slope) t6;