pandas cum扩展之cumvar:增量方差算法
累计方差计算问题
在处理时间序列问题时,通常会遇到累计计算问题,即:对于每个时间 t = 1 , 2 , . . . . . . t=1, 2, ...... t=1,2,......,计算从开始到时间 t t t的样本 x 0 , x 1 , . . . , x t x_0, x_1, ..., x_t x0,x1,...,xt的统计量值。pandas中有现成的cumsum、cumprod、cummax和cummin分别用于计算累计求和、累计乘积、累计最大值和最小值。不过没有cumvar函数用来计算累计方差。当然可以通过简单循环来计算每一个时间 t t t的累计方差,不过在数据量大的情况下,这样效率太低了。
增量方差算法
有变量
X
X
X的一组长度为
M
M
M的历史样本(记为
H
H
H):
h
1
,
h
2
,
h
3
,
.
.
.
,
h
M
h_1, h_2, h_3, ..., h_M
h1,h2,h3,...,hM
其均值为:
H
‾
=
1
M
∑
i
=
1
M
h
i
\overline{H} = \frac{1}{M}\sum_{i=1}^Mh_i
H=M1i=1∑Mhi
方差为:
δ
H
2
(
d
)
=
1
M
−
d
∑
i
=
1
M
(
h
i
−
H
‾
)
2
\delta_H^2(d) = \frac{1}{M-d}\sum_{i=1}^M(h_i-\overline{H})^2
δH2(d)=M−d1i=1∑M(hi−H)2
其中
d
d
d为自由度,一般计算样本方差时
d
=
1
d=1
d=1(无偏估计),计算总体方差时
d
=
0
d=0
d=0(有偏估计)。
现在有一组长度为
N
N
N的增量样本(记为
A
A
A):
a
1
,
a
2
,
a
3
,
.
.
.
,
a
N
a_1, a_2, a_3, ..., a_N
a1,a2,a3,...,aN
其均值和方差分别为:
A
‾
=
1
N
∑
j
=
1
N
a
j
δ
A
2
(
d
)
=
1
N
−
d
∑
j
=
1
N
(
a
j
−
A
‾
)
2
\overline{A} = \frac{1}{N}\sum_{j=1}^Na_j \\ \\ \delta_A^2(d) = \frac{1}{N-d}\sum_{j=1}^N(a_j-\overline{A})^2
A=N1j=1∑NajδA2(d)=N−d1j=1∑N(aj−A)2
我们要通过上面给出的两组样本分别的均值和方差公式来计算两组样本合并在一起之后的全样本:
h
1
,
h
2
,
h
3
,
.
.
.
,
h
M
,
a
1
,
a
2
,
a
3
,
.
.
.
,
a
N
h_1, h_2, h_3, ..., h_M, a_1, a_2, a_3, ..., a_N
h1,h2,h3,...,hM,a1,a2,a3,...,aN
的均值和方差。
全样本均值为:
X
‾
=
1
M
+
N
[
∑
i
=
1
M
h
i
+
∑
j
=
1
N
a
j
]
=
M
H
‾
+
N
A
‾
M
+
N
\begin{aligned} \overline{X} &= \frac{1}{M+N}\left[\sum_{i=1}^Mh_i + \sum_{j=1}^Na_j\right] \\ \\ &=\frac{M\overline{H} + N\overline{A}}{M+N} \end{aligned}
X=M+N1[i=1∑Mhi+j=1∑Naj]=M+NMH+NA
全样本方差为:
δ
X
2
(
d
)
=
1
M
+
N
−
d
[
∑
i
=
1
M
(
h
i
−
X
‾
)
2
+
∑
j
=
1
N
(
a
j
−
X
‾
)
2
]
=
1
M
+
N
−
d
[
∑
i
=
1
M
(
(
h
i
−
H
‾
)
−
(
X
‾
−
H
‾
)
)
2
+
∑
j
=
1
N
(
(
a
j
−
A
‾
)
−
(
X
‾
−
A
‾
)
)
2
]
=
1
M
+
N
−
d
[
∑
i
=
1
M
(
(
h
i
−
H
‾
)
2
−
2
(
h
i
−
H
‾
)
(
X
‾
−
H
‾
)
+
(
X
‾
−
H
‾
)
2
)
+
∑
j
=
1
N
(
(
a
j
−
A
‾
)
2
−
2
(
a
j
−
A
‾
)
(
X
‾
−
A
‾
)
+
(
X
‾
−
A
‾
)
2
)
]
=
1
M
+
N
−
d
[
(
M
−
d
)
δ
H
2
(
d
)
+
M
(
X
‾
−
H
‾
)
2
+
(
N
−
d
)
δ
A
2
(
d
)
+
N
(
X
‾
−
A
‾
)
2
]
\begin{aligned} \delta_X^2(d) &= \frac{1}{M+N-d}\left[\sum_{i=1}^M(h_i-\overline{X})^2 + \sum_{j=1}^N(a_j-\overline{X})^2 \right] \\ \\ &= \frac{1}{M+N-d} \left[\sum_{i=1}^{M}\left( (h_i-\overline{H})-(\overline{X}-\overline{H})\right)^2 + \sum_{j=1}^{N}\left( (a_j-\overline{A})-(\overline{X}-\overline{A})\right)^2\right] \\ \\ &= \frac{1}{M+N-d} ~[~ \sum_{i=1}^{M}\left((h_i-\overline{H})^2-2(h_i-\overline{H})(\overline{X}-\overline{H}) + (\overline{X}-\overline{H})^2 \right) \\ \\ &+ \sum_{j=1}^{N}\left((a_j-\overline{A})^2-2(a_j-\overline{A})(\overline{X}-\overline{A}) + (\overline{X}-\overline{A})^2 \right) ~] \\ \\ &= \frac{1}{M+N-d} \left[(M-d)\delta_H^2(d)+M(\overline{X}-\overline{H})^2 + (N-d)\delta_A^2(d)+N(\overline{X}-\overline{A})^2 \right] \end{aligned}
δX2(d)=M+N−d1[i=1∑M(hi−X)2+j=1∑N(aj−X)2]=M+N−d1[i=1∑M((hi−H)−(X−H))2+j=1∑N((aj−A)−(X−A))2]=M+N−d1 [ i=1∑M((hi−H)2−2(hi−H)(X−H)+(X−H)2)+j=1∑N((aj−A)2−2(aj−A)(X−A)+(X−A)2) ]=M+N−d1[(M−d)δH2(d)+M(X−H)2+(N−d)δA2(d)+N(X−A)2]
Python实现
通过Python实现增量方差算法。
以10万个样本作为测试,迭代算法用时13.93秒,增量算法用时3.42秒。
以100万个样本作为测试,迭代算法用时81.68秒,增量算法用时6.81秒。
迭代算法用时呈指数递增,增量算法使用是线性递增的,效率提升十分明显。
# -*- coding: utf-8 -*-
import time
import numpy as np
def cumvar_iter(series, ddof=1):
'''累计方差计算——迭代'''
cumvar = np.nan * np.zeros(len(series),)
for k in range(len(series)):
cumvar[k] = np.var(series[:k+1], ddof=ddof)
return cumvar
def cumvar_delta(series, ddof=1):
'''累计方差计算——增量算法'''
def delta_var(n0, mean0, var0, n1, mean1, var1, ddof=1):
'''
增量方差算法
'''
n = n0+n1
if n0 <= ddof or n1 <= ddof or n <= ddof: # 样本量必须大于自由度
return np.nan
fm = n - ddof
mean = (n0 * mean0 + n1 * mean1) / n
fz1 = (n0-ddof) * var0 + n0 * (mean - mean0) ** 2
fz2 = (n1-ddof) * var1 + n1 * (mean - mean1) ** 2
var = (fz1 + fz2) / fm
return var
# 累计均值
cummean = np.cumsum(series) / np.arange(1, len(series)+1)
# 累计方差
cumvar = np.nan * np.ones(len(series),)
if ddof == 0:
cumvar[0] = 0
else:
for k in range(ddof, ddof+ddof+1):
cumvar[k] = np.var(series[:k+1], ddof=ddof)
for k in range(ddof+ddof+1, len(series)):
var0, mean0, n0 = cumvar[k-ddof-1], cummean[k-ddof-1], k-ddof
var1 = np.var(series[k-ddof:k+1], ddof=ddof)
mean1 = np.mean(series[k-ddof:k+1])
# 增量方差
cumvar[k] = delta_var(n0, mean0, var0, ddof+1, mean1, var1,
ddof=ddof)
return cumvar
if __name__ == '__main__':
# 生成一个长度为十万的测试序列
series = np.random.randint(10, 1000, (100000,))
start_time = time.time()
cumvar1 = cumvar_iter(series, ddof=1)
print(f'迭代算法用时: {round(time.time()-start_time, 6)}s.')
start_time = time.time()
cumvar2 = cumvar_delta(series, ddof=1)
print(f'增量算法用时: {round(time.time()-start_time, 6)}s.')
# 结果:
# 迭代算法用时: 13.930764s.
# 增量算法用时: 3.415858s.
参考:
欢迎关注公众号:一本正经d胡说