Update: The best performing algorithm so far is this one.
This question seeks to explore the available robust methods or algorithms for detecting sudden peaks in real-time timeseries data.
I do not seek fast and obvious answers. I would like every answer to provide a different approach to the problem, complemented with the advantages and disadvantages of the proposed method.
Consider the following dataset
p = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1,...
1 1 1.1 0.9 1 1.1 1 1 0.9 1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1,...
1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1,...
1 3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];
(Matlab format but it's not about the language but about the algorithm)
You can clearly see that there are three large peaks and some small peaks.
Specification
This dataset is a specific example of the class of timeseries datasets that the question is about. This class of datasets has two general features:
There is basic noise with a general mean
There are large 'peaks' or 'higher data points' that significantly deviate from the noise.
Let's also assume the following:
the width of the peaks cannot be determined beforehand
the height of the peaks clearly deviates from the other values
the used algorithm must calculate realtime (so change with each new datapoint)
Clearly, for such a situation, a boundary value needs to be constructed which triggers signals. However, in reality, this boundary value cannot be static and must be determined realtime based on an algorithm.
My Question: How can I calculate such boundary values realtime? What are well-known and applicable algorithms for such situations? (not this data set specifically)
Additionally, I would like to know the following:
Is it possible to detect peaks without delay given only what is known before that time?
What needs to be known about a peak to detect it (with some given delay)?
Well-founded/ creative alternatives and useful insights are all highly appreciated. (answers in any language are fine: it's about the algorithm)
解决方案
New (very) robust algorithm that uses the standard deviation
I have constructed a very well performing algorithm that signals when the data points are a specified number of standard deviations away from the moving mean. However, when a signal is detected, subsequent data points that are also a signal (so significantly away from the moving mean), will not corrupt the signal threshold. That is, the algorithm creates a 'new mean' and 'new st.dev.' in which the data points that are signals are not used. Therefore, the threshold remains uncorrupted and is able to correctly identify future signals too, without loss of performance. This works extremely well!
In order to display the power of this robust algorithm, I have prepared a demo in which the user can specify its own data. This little demo displays both how the algorithm works and why it is so useful.
The full working Matlab code for this demo:
function [] = RobustDetectionDemo()
%% SPECIFICATIONS
LAG = 10; % lag for the moving mean and moving st. dev.
DIFF = 3.5; % number of st. dev. from the mean to signal
INFLUENCE = 0.0; % when signal: how much is mean/st.dev. influenced?
% or e.g. 0.05/0.1 for influencing
DIRECTION = 'both'; % signal when 'up'/'down'/'both' from the mean
%%
figure(1);
subplot(2,2,1);
title('Draw 30 data points');
ylim([0 5]); xlim([0 50]);
[x,y] = ginputExtra_realtime(30, true, LAG, DIFF, INFLUENCE, DIRECTION);
end
function [x y] = ginputExtra_realtime(n,booText, LAG, DIFF, INFLUENCE, DIRECTION)
if booText == true
bText = booText;
else
bText = false;
end
H = gca;
set(H, 'YLimMode', 'manual'); set(H, 'XLimMode', 'manual');
set(H, 'YLim', get(H,'YLim')); set(H, 'XLim', get(H,'XLim'));
numPoints = n; xg = []; yg = [];
for i=1:numPoints
[xi yi] = ginput(1);
xg = [xg xi]; yg = [yg yi];
if i == 1
hold on;
plot(H, xg(i),yg(i),'ro');
if bText text(xg(i),yg(i),num2str(i),'FontSize',12); end
else
plot(xg([i-1:i]),yg([i-1:i]),'r');
if bText text(xg(i),yg(i),num2str(i),'FontSize',12); end
if length(xg) > LAG
robustMA(xg, yg, LAG, DIFF, INFLUENCE, DIRECTION);
end
end
end
hold off;
x = xg; y = yg;
end
function [] = robustMA( x, y, lag, diff, influence, direction)
% robustMA :: Signal detection algorithm ::
% Author: Jean-Paul van Brakel
% ************************************************************ %
% TO BE USED FOR: *determining significant and sudden changes*
% ************************************************************ %
% x = x-axis data
% y = y-axis data
% lag = lag of moving mean and moving st.dev.
% diff = number of st.dev. away from the mean in order to give a signal
% influence = number between 0 and 1 that indicates influence of signals
% direction = 'up'/'down'/'both' which means the following:
% - 'up' : only signal for deviations ABOVE the mean
% - 'down': only signal for deviations BELOW the mean
% - 'both': signal for deviations ABOVE and BELOW the mean
p = y;
outputmean = tsmovavg(y,'s',lag,2);
outputstdev = movingstd(y,lag,'backward');
newMean = zeros(1, length(outputmean));
newStdev = zeros(1, length(outputmean));
signals = ones(1, length(outputmean));
newMean(lag-1) = outputmean(lag);
newStdev(lag-1) = outputstdev(lag);
for i = lag:length(outputmean)
if strcmp(direction, 'up')
if (p(i) > newMean(i-1)+diff*newStdev(i-1))
newMean(i) = (newMean(i-1) + influence*p(i))/(1+influence);
newStdev(i) = (newStdev(i-1) + influence*sqrt((p(i)-newMean(i-1))^2))/(1+influence);
signals(i) = 2;
else
newMean(i) = (newMean(i-1)+p(i))/2;
newStdev(i) = (newStdev(i-1) + sqrt((p(i)-newMean(i-1))^2))/2;
signals(i) = 1;
end
elseif strcmp(direction, 'down')
if (p(i) < newMean(i-1)-diff*newStdev(i-1))
newMean(i) = (newMean(i-1) + influence*p(i))/(1+influence);
newStdev(i) = (newStdev(i-1) + influence*sqrt((p(i)-newMean(i-1))^2))/(1+influence);
signals(i) = 2;
else
newMean(i) = (newMean(i-1)+p(i))/2;
newStdev(i) = (newStdev(i-1) + sqrt((p(i)-newMean(i-1))^2))/2;
signals(i) = 1;
end
elseif strcmp(direction, 'both')
if (p(i) > newMean(i-1)+diff*newStdev(i-1) || ...
p(i) < newMean(i-1)-diff*newStdev(i-1))
newMean(i) = (newMean(i-1) + influence*p(i))/(1+influence);
newStdev(i) = (newStdev(i-1) + influence*sqrt((p(i)-newMean(i-1))^2))/(1+influence);
signals(i) = 2;
else
newMean(i) = (newMean(i-1)+p(i))/2;
newStdev(i) = (newStdev(i-1) + sqrt((p(i)-newMean(i-1))^2))/2;
signals(i) = 1;
end
end
end
figure(1);
subplot(2,2,2);
hold on;
title('Algorithm output');
area(x, newMean+diff*newStdev, 'FaceColor', [0.9 0.9 0.9], 'EdgeColor', 'none');
area(x, newMean, 'FaceColor', [1 1 1], 'EdgeColor', 'none');
area(x, newMean, 'FaceColor', [0.9 0.9 0.9], 'EdgeColor', 'none');
area(x, newMean-diff*newStdev, 'FaceColor', [1 1 1], 'EdgeColor', 'none');
plot(x, p, ':r', 'LineWidth', 1, 'Color', 'black');
plot(x, newMean, 'LineWidth', 2, 'Color', 'red');
plot(x, newMean+newStdev, 'LineWidth', 2, 'Color', 'green');
plot(x, newMean-newStdev, 'LineWidth', 2, 'Color', 'green');
xlim([0 50]); ylim([0 5])
hold off;
subplot(2,2,4);
hold on;
title('Signal output');
stairs(x, signals, 'LineWidth', 2, 'Color', 'blue');
ylim([0 3]); xlim([0 50]);
hold off;
end
function s = movingstd(x,k,windowmode)
% movingstd: efficient windowed standard deviation of a time series
% usage: s = movingstd(x,k,windowmode)
%
% Movingstd uses filter to compute the standard deviation, using
% the trick of std = sqrt((sum(x.^2) - n*xbar.^2)/(n-1)).
% Beware that this formula can suffer from numerical problems for
% data which is large in magnitude.
% check for a windowmode
if (nargin<3) || isempty(windowmode)
% supply the default:
windowmode = 'central';
elseif ~ischar(windowmode)
error 'If supplied, windowmode must be a character flag.'
end
% check for a valid shortening.
valid = {'central' 'forward' 'backward'};
windowmode = lower(windowmode);
ind = strmatch(windowmode,valid);
if isempty(ind)
error 'Windowmode must be a character flag: ''c'', ''b'', or ''f''.'
else
windowmode = valid{ind};
end
% length of the time series
n = length(x);
% check for valid k
if (nargin<2) || isempty(k) || (rem(k,1)~=0)
error 'k was not provided or not an integer.'
end
switch windowmode
case 'central'
if k<1
error 'k must be at least 1 for windowmode = ''central''.'
end
if n
error 'k is too large for this short of a series and this windowmode.'
end
otherwise
if k<2
error 'k must be at least 2 for windowmode = ''forward'' or ''backward''.'
end
if (n
error 'k is too large for this short of a series.'
end
end
% Improve the numerical analysis by subtracting off the series mean
% this has no effect on the standard deviation.
x = x - mean(x);
% we will need the squared elements
x2 = x.^2;
% split into the three windowmode cases for simplicity
A = 1;
switch windowmode
case 'central'
B = ones(1,2*k+1);
s = sqrt((filter(B,A,x2) - (filter(B,A,x).^2)*(1/(2*k+1)))/(2*k));
s(k:(n-k)) = s((2*k):end);
case 'forward'
B = ones(1,k);
s = sqrt((filter(B,A,x2) - (filter(B,A,x).^2)*(1/k))/(k-1));
s(1:(n-k+1)) = s(k:end);
case 'backward'
B = ones(1,k);
s = sqrt((filter(B,A,x2) - (filter(B,A,x).^2)*(1/k))/(k-1));
end
% special case the ends as appropriate
switch windowmode
case 'central'
% repairs are needed at both ends
for i = 1:k
s(i) = std(x(1:(k+i)));
s(n-k+i) = std(x((n-2*k+i):n));
end
case 'forward'
% the last k elements must be repaired
for i = (k-1):-1:1
s(n-i+1) = std(x((n-i+1):n));
end
case 'backward'
% the first k elements must be repaired
for i = 1:(k-1)
s(i) = std(x(1:i));
end
end
end
The necessary parameters are:
LAG: lag for the moving mean and moving st. dev.
DIFF: number of st. dev. away from the mean to generate a signal
INFLUENCE: when there is a signal, how much is mean/st.dev. influenced? (number between 0-1)
DIRECTION: signal when deviation is 'up'/'down'/'both' away from the mean?
As you can see, I used the settings LAG=10; DIFF=3.5; INFLUENCE=0; for this demo. Feel free to fiddle around with these parameters and study the differences in performance of the algorithm.
Original explanation
I have constructed a new (very well performing) algorithm in which the number of standard deviations away from the mean is used as threshold.
The basic idea behind it is the following:
if (price(t) deviates more than mean(t-1) + k * std(t-1) )
{
// construct signal
new Mean : mean(t) = (mean(t-1)+influence*price(t))/(1+influence);
new Std : std(t) = (std(t-1)+influence*sqrt((price(t)-newMean(t-1))^2))/(1+influence)
} else
{
new Mean : mean(t) = (mean(t-1) + price(t)) / 2
new Std : std(t) = std(t-1) + sqrt((price(t) - mean(t-1))^2) /2
}
This small algorithm performs surprisingly well!
Especially because it doesn't use the real mean and standard deviation but construct a new one such that signals do not influence/ corrupt the signal threshold.
Advantages
Performs surprisingly well (no matter the length of time)
Very robust: adjusts for noise but still remains objective towards high peaks
Only needs 2 input parameters
Allows for different moving average structures (Simple, Exponential, Weighted etc.)
Disadvantages
Need to estimate the 2 input parameters, need to specify the influence parameter
For small general variation in the noise, the algorithm is not very useful
Examples for the sample data:
lag = 30, diff = 3, influence = 0
lag = 30, diff = 3.5, influence = 0
lag = 30, diff = 5, influence = 0
Code
Code for replicating in Matlab:
p = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1,...
1 1 1.1 0.9 1 1.1 1 1 0.9 1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1,...
1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1,...
1 3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];
% SPECS
lag = 30; % lag of the algorithm
diff = 3.5; % number of standard deviations from the mean
influence = 0; % influence parameter (convergence to signals)
% SPECS
outputmean = tsmovavg(p,'s',lag,2);
outputstdev = movingstd(p,lag,'backward');
newMean = zeros(1, length(outputmean));
newStdev = zeros(1, length(outputmean));
signals = zeros(1, length(outputmean));
newMean(lag-1) = outputmean(lag);
newStdev(lag-1) = outputstdev(lag);
for i=lag:length(outputmean)
if (p(i) > newMean(i-1)+diff*newStdev(i-1))
newMean(i) = (newMean(i-1) + influence*p(i))/(1+influence);
newStdev(i) = (newStdev(i-1) + influence*sqrt((p(i)-newMean(i-1))^2))/(1+influence);
signals(i) = 0.3;
else
newMean(i) = (newMean(i-1)+p(i))/2;
newStdev(i) = (newStdev(i-1) + sqrt((p(i)-newMean(i-1))^2))/2;
signals(i) = 0;
end
end
figure;
hold all;
plot(p, ':r', 'LineWidth', 1, 'Color', 'black');
plot(signals, 'LineWidth', 2, 'Color', 'blue');
plot(newMean, 'LineWidth', 2, 'Color', 'red');
plot(newMean+newStdev, 'LineWidth', 2, 'Color', 'green');
Extra function movingstd: code