移动标准差以及移动平均值(movstd、movmean)
最近在工作中遇到这样一个问题:
有一个序列长度为
n
n
n 的序列
T
=
[
t
0
,
t
1
,
…
,
t
n
−
1
]
T=[t_0, t_1, \dots, t_{n-1}]
T=[t0,t1,…,tn−1],给定一个窗大小
m
(
m
<
=
n
)
m (m <= n)
m(m<=n),下标从0开始,计算窗大小的均值和标准差,即计算T[0:m-1]、T[1:m]、T[2:m+1]…T[n-m+1:n] 的平均值和标准差
暴力解法
最简单的无脑的方法就是暴力循环了,很明显这种方法特别慢,时间复杂度为 O ( n ∗ m ) O(n*m) O(n∗m)
下面为你呈现暴力代码
import numpy as np
import time
# generate time sequence
n = 1000 * 1000
m = 1000
T = np.random.rand(n)
# brute force
means = np.zeros(n - m + 1)
stds = np.zeros(n - m + 1)
start_time = time.time()
for i in range(n - m + 1):
means[i] = np.mean( T[i:i+m] )
end_time = time.time()
print('Running time of brute force for mean is {}s'.format((end_time - start_time)))
start_time = time.time()
for i in range(n - m + 1):
stds[i] = np.std( T[i:i+m] )
end_time = time.time()
print('Running time of brute force for std is {}s'.format((end_time - start_time)))
Running time of brute force for mean is 5.774143934249878s
Running time of brute force for std is 20.721827030181885s
movmean
有没有办法进行优化呢?这里介绍移动标准差(movstd)和移动平均值(movmean)
先从移动平均值(movmean)开始,它很简单并且符合直觉:在滑动的过程中,有很多重叠部分,我们可以利用重叠的部分,从而节约计算时间
如上图所示,计算时可以利用前一个均值,这样就避免了不必要的加法操作,平均值的计算复杂度降低为
O
(
n
)
O(n)
O(n)
μ
i
∗
m
=
(
t
i
+
t
i
+
1
+
⋯
+
t
i
+
m
−
1
)
\mu_i*m = (t_i + t_{i+1} + \dots + t_{i+m-1})
μi∗m=(ti+ti+1+⋯+ti+m−1)
μ i + 1 ∗ m = μ i ∗ m − t i + t m \mu_{i+1}*m = \mu_i*m - t_i + t_m μi+1∗m=μi∗m−ti+tm
下面代码显示了如何实现 movmean
def movmean(T, m):
assert(m <= T.shape[0])
n = T.shape[0]
sums = np.zeros(n - m + 1)
sums[0] = np.sum(T[0:m])
cumsum = np.cumsum(T)
cumsum = np.insert(cumsum, 0, 0) # 在数组开头插入一个0
sums = cumsum[m:] - cumsum[:-m]
return sums/m
start_time = time.time()
means_2 = movmean(T, m)
end_time = time.time()
print('Running time of movmean is {}s'.format((end_time - start_time)))
Running time of movmean is 0.009449005126953125s
movstd
在介绍移动标准差之前,我们先回顾下标准差计算公式:
σ
=
1
m
∑
i
=
1
m
(
t
i
−
u
)
2
\sigma = \sqrt{\frac{1}{m} \sum_{i=1}^m(t_i - u)^2}
σ=m1i=1∑m(ti−u)2
假设有一个长度为 3 的序列 [a, b, c],我们来计算一下它的标准差
首先计算均值:
μ
=
1
m
(
a
+
b
+
c
)
\mu = \frac{1}{m}(a+b+c)
μ=m1(a+b+c)
然后标准差:
σ
=
1
3
(
(
a
−
μ
)
2
+
(
b
−
μ
)
2
+
(
c
−
μ
)
2
)
=
1
3
(
a
2
+
b
2
+
c
2
−
2
a
μ
−
2
b
μ
−
2
c
μ
+
μ
2
)
=
1
3
(
a
2
+
b
2
+
c
2
)
−
(
1
3
(
a
+
b
+
c
)
)
2
\begin{array}{l} \sigma &= \sqrt{ \frac{1}{3} ((a-\mu)^2 + (b-\mu)^2 + (c-\mu)^2)} \\ &= \sqrt{ \frac{1}{3} (a^2 + b^2 + c^2 -2a\mu - 2b\mu - 2c\mu + \mu^2)} \\ &= \sqrt{ \frac{1}{3}(a^2+b^2+c^2) - (\frac{1}{3}(a+b+c))^2 } \\ \end{array}
σ=31((a−μ)2+(b−μ)2+(c−μ)2)=31(a2+b2+c2−2aμ−2bμ−2cμ+μ2)=31(a2+b2+c2)−(31(a+b+c))2
我们可以发现,标准差的计算可以用累计和来表示,而累加和是可以在 O ( n ) O(n) O(n)时间内完成,这就是 movstd
下面的代码展示了如何计算 movstd
def movstd(T, m):
n = T.shape[0]
cumsum = np.cumsum(T)
cumsum_square = np.cumsum(T**2)
cumsum = np.insert(cumsum, 0, 0) # 在数组开头插入一个0
cumsum_square = np.insert(cumsum_square, 0, 0) # 在数组开头插入一个0
seg_sum = cumsum[m:] - cumsum[:-m]
seg_sum_square = cumsum_square[m:] - cumsum_square[:-m]
return np.sqrt( seg_sum_square/m - (seg_sum/m)**2 )
start_time = time.time()
stds_2 = movstd(T, m)
end_time = time.time()
print('Running time of movstd is {}s'.format((end_time - start_time)))
Running time of movstd is 0.03198814392089844s
总结
通过提前计算好累计和,移动平均和移动标准差以空间换时间,算法速度比起暴力方法提升了几个数量级