Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:
- Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
- Postgres-XL的项目发起人Mason Sharp
- pgpool的作者石井达夫(Tatsuo Ishii)
- PG-Strom的作者海外浩平(Kaigai Kohei)
- Greenplum研发总监姚延栋
- 周正中(德哥), PostgreSQL中国用户会创始人之一
- 汪洋,平安科技数据库技术部经理
- ……
|
|
PostgreSQL自带了一些常用的统计学聚合函数, 非常好用.
本文介绍一下方差和标准差的一些聚合函数.
总体方差 : population covariance
总体标准差 :
population standard deviation
样本方差 : sample covariance
样本标准差 : sample standard deviation
均值 : mean
样本均值和样本方差的介绍 :
均值介绍 :
对方差, 标准差, 均值不了解的话, 建议参考网易公开课, 统计学.
浅显易懂.
PostgreSQL计算方差, 标准差的聚合函数如下 :
stddev(expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | historical alias for stddev_samp |
stddev_pop(expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | population standard deviation of the input values |
stddev_samp(expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | sample standard deviation of the input values |
variance (expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | historical alias for var_samp |
var_pop (expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | population variance of the input values (square of the population standard deviation) |
var_samp (expression) | smallint, int, bigint, real, double precision, or numeric | double precision for floating-point arguments, otherwise numeric | sample variance of the input values (square of the sample standard deviation) |
其中stddev和variance是stddev_samp和var_samp的别名.
这些函数用于计算数据集的总体/样本 方差,总体/样本 标准差.
例如 :
1,2,3,100 这组数据共4个值, 总体均值和样本均值分别为 :
(1+2+3+100)/4 = 26.5
总体方差 : ((1-26.5)^2 +
(2-26.5)^2 +
(3-26.5)^2 +
(100-26.5)^2)/4 = 1801.25
样本方差 :
((1-26.5)^2 +
(2-26.5)^2 +
(3-26.5)^2 +
(100-26.5)^2)/(4-1) = 2401.6666....
总体标准差 : 平方根(
总体方差
) =
42.4411357058220109
样本标准差 :
平方根(样本
方差
) =
49.0068022489395513
使用PostgreSQL计算如下 :
postgres=# select variance(id) from (values(1),(2),(3),(100)) as t(id);variance-----------------------2401.6666666666666667(1 row)postgres=# select var_pop(id) from (values(1),(2),(3),(100)) as t(id);var_pop-----------------------1801.2500000000000000(1 row)postgres=# select var_samp(id) from (values(1),(2),(3),(100)) as t(id);var_samp-----------------------2401.6666666666666667(1 row)postgres=# select stddev(id) from (values(1),(2),(3),(100)) as t(id);stddev---------------------49.0068022489395513(1 row)postgres=# select stddev_pop(id) from (values(1),(2),(3),(100)) as t(id);stddev_pop---------------------42.4411357058220109(1 row)postgres=# select stddev_samp(id) from (values(1),(2),(3),(100)) as t(id);stddev_samp---------------------49.0068022489395513(1 row)
[参考]
1. src/backend/utils/adt/float.c
/** =========================* FLOAT AGGREGATE OPERATORS* =========================** float8_accum - accumulate for AVG(), variance aggregates, etc.* float4_accum - same, but input data is float4* float8_avg - produce final result for float AVG()* float8_var_samp - produce final result for float VAR_SAMP()* float8_var_pop - produce final result for float VAR_POP()* float8_stddev_samp - produce final result for float STDDEV_SAMP()* float8_stddev_pop - produce final result for float STDDEV_POP()** The transition datatype for all these aggregates is a 3-element array* of float8, holding the values N, sum(X), sum(X*X) in that order.** Note that we represent N as a float to avoid having to build a special* datatype. Given a reasonable floating-point implementation, there should* be no accuracy loss unless N exceeds 2 ^ 52 or so (by which time the* user will have doubtless lost interest anyway...)*/..................Datumfloat8_var_pop(PG_FUNCTION_ARGS){ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);float8 *transvalues;float8 N,sumX,sumX2,numerator;
transvalues = check_float8_array(transarray, "float8_var_pop", 3);N = transvalues[0];sumX = transvalues[1];sumX2 = transvalues[2];
/* Population variance is undefined when N is 0, so return NULL */if (N == 0.0)PG_RETURN_NULL();
numerator = N * sumX2 - sumX * sumX;CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);
/* Watch out for roundoff error producing a negative numerator */if (numerator <= 0.0)PG_RETURN_FLOAT8(0.0);
PG_RETURN_FLOAT8(numerator / (N * N));}
Datumfloat8_var_samp(PG_FUNCTION_ARGS){ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);float8 *transvalues;float8 N,sumX,sumX2,numerator;
transvalues = check_float8_array(transarray, "float8_var_samp", 3);N = transvalues[0];sumX = transvalues[1];sumX2 = transvalues[2];
/* Sample variance is undefined when N is 0 or 1, so return NULL */if (N <= 1.0)PG_RETURN_NULL();
numerator = N * sumX2 - sumX * sumX;CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);
/* Watch out for roundoff error producing a negative numerator */if (numerator <= 0.0)PG_RETURN_FLOAT8(0.0);
PG_RETURN_FLOAT8(numerator / (N * (N - 1.0)));}
Datumfloat8_stddev_pop(PG_FUNCTION_ARGS){ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);float8 *transvalues;float8 N,sumX,sumX2,numerator;
transvalues = check_float8_array(transarray, "float8_stddev_pop", 3);N = transvalues[0];sumX = transvalues[1];sumX2 = transvalues[2];
/* Population stddev is undefined when N is 0, so return NULL */if (N == 0.0)PG_RETURN_NULL();
numerator = N * sumX2 - sumX * sumX;CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);
/* Watch out for roundoff error producing a negative numerator */if (numerator <= 0.0)PG_RETURN_FLOAT8(0.0);
PG_RETURN_FLOAT8(sqrt(numerator / (N * N)));}
Datumfloat8_stddev_samp(PG_FUNCTION_ARGS){ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);float8 *transvalues;float8 N,sumX,sumX2,numerator;
transvalues = check_float8_array(transarray, "float8_stddev_samp", 3);N = transvalues[0];sumX = transvalues[1];sumX2 = transvalues[2];
/* Sample stddev is undefined when N is 0 or 1, so return NULL */if (N <= 1.0)PG_RETURN_NULL();
numerator = N * sumX2 - sumX * sumX;CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);
/* Watch out for roundoff error producing a negative numerator */if (numerator <= 0.0)PG_RETURN_FLOAT8(0.0);
PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0))));}