Anomaly detection - Multivariate Gaussian distribution

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十六章《异常检测》中第129课时《多变量高斯分布》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.

————————————————

Let's talk about one possible extension to the anomaly detection algorithm: Multivariate Gaussian distribution.

Why we need multivariate Gaussian distribution

We'll start with an example of monitoring machines in the data center. If model the two features x_{1} (CPU load) and x_{2} (Memory Use) with Gaussian, they look like the two figures on the right. Let's say we have an example shown as the green cross (x_{1}\approx 0.4; x_{2}\approx 1.5). If look at the data which are shown as red croses, most of them lie in the blue ellipse region. And the two features sort of grow linearly with each other. Whereas for the green cross, the CPU load is very low but memory usage is very high. So the green cross should be raised as anomaly.

But the anomaly detection would fail to flag the green cross as anomaly. The p(x_{1})=p(0.4) and p(x_{2})=p(1.5) both are pretty / reasonably high. It turns out the anomaly detection algorithm is not realizing that the blue ellipse shows the high probability region. Instead, it "sees" those concentric circles and thinks inner circles have higher probability than outter ones. And thinks the green cross has pretty high probability. And actually it tends to think that everything on the, for example, 2nd circle has about equal probability, and doesn't realize the probability of the two blue crosses actually are very different.

Definition of multivariate Gaussian distribution

To fix this, we're going to develop a modified version of the anomaly detection algorithm using multivariate Gaussian / normal distribution. We have:

Instead of model p(x_{1}), p(x_{2}),..., separately, we're going to model p(x) all in one go. And here comes the formula:

p(x;\mu ,\Sigma )=\frac{1}{(2\pi)^{n/2}\left | \Sigma \right |^{\frac{1}{2}}}e^{(-\frac{1}{2}(x-\mu )^{T}\Sigma ^{-1}(x-\mu ))}

Where,

\left | \Sigma \right | is called the determinant (行列式) of Sigma (\Sigma). In Octave, you can compute it with det(\Sigma ).

Examples

Then what this p(x; \mu ,\Sigma ) looks like?

Example-1

Suppose we have:

Two features x_{1}, x_{2}, thus n=2

\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}(identity matrix)

The p(x) will look like figure-1.a. And its highest value at about when x_{1}=x_{2}=0. Down below is a contour plot. 

Following is the experiment I did. Just copy & paste the code into Octave to take a look. And its plot looks like figure-1.b.

function p = multivariate_normal(x, n, mean, covariance)
    x_m = x - mean;
    p=1/((2*pi)^(n/2)*det(covariance)^(0.5))*e^(-0.5*x_m'*inv(covariance)*x_m);
endfunction

n=2;
u=[0;0]
covariance=[1,0;0,1]

tx1 = tx2 = linspace(-4,4,100);

[xx, yy] = meshgrid (tx1, tx2);

for i=1:size(tx1)(2)
    for j=1:size(tx2)(2)
        x=[tx1(i);tx2(j)];
        px=multivariate_normal(x, n, u, covariance);
        pplot(j,i)=px;
    end
end

subplot(2,2,1)
#meshc(tx1, tx2, pplot)
surf(tx1, tx2, pplot)
axis([-4 4 -4 4 0 0.4])

subplot(2,2,3)
contourf(tx1, tx2, pplot)
axis([-4 4 -4 4])

Example-2

Let's check more examples by varing some of parameters.

Suppose we changed \Sigma as the following:

Two features x_{1}, x_{2}, thus n=2

\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 0.6 & 0\\ 0 & 0.6 \end{bmatrix}

Then it looks like figure-2.a. By shrinking the \Sigma, the width of the bump diminishes and the height also increases a bit because the integral of the volume under the surface is equal to 1. Similarly, figure-2.b is the plot via my above code in Octave.

Example-3

Suppose we changed \Sigma as the following:

Two features x_{1}, x_{2}, thus n=2

\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 2 & 0\\ 0 & 2 \end{bmatrix}

Then you'll have much wider and much flatter Gaussian as figure-3.a, 3.b & 3.c. Again 3.b is output of my above Octave codes.

Example-4

Suppose we changed \Sigma as the following, that is just change one of the elements of \Sigma:

Two features x_{1}, x_{2}, thus n=2

\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 0.6 & 0\\ 0 & 1\end{bmatrix}

This reduces the variance of feature x_{1}. And the plot looks like figure-4.a. And 4.b is output of my above Octave codes.

Example-5

Suppose we changed \Sigma as the following,

Two features x_{1}, x_{2}, thus n=2

\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 2 & 0\\ 0 & 1\end{bmatrix}

The plot looks like figure-5.a & 5.b. The distribution falls off more slowly as x_{1} moves away from 0. And falls off very rapidly as x_{2} moves away from 0. Figure-5.b is output of my Octave code above.

Example-6

One of the cool things of the multivariate Gaussian distribution is that you can also use it to model correlations between the data. That is we can use it to model the fact that x_{1} and x_{2} tend to be highly correlated with each other. Let's change the off diagonal entries of the covariance matrix. Figure-6.a is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}. And figure-6.b is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0.5\\ 0.5 & 1\end{bmatrix}.

Above plot can be generated via codes:

	function p = multivariate_normal(x, n, mean, covariance)
		x_m = x - mean;
		p=1/((2*pi)^(n/2)*det(covariance)^(0.5))*e^(-0.5*x_m'*inv(covariance)*x_m);
	endfunction

	n=2;
	u=[0;0]
	covariance=[1,0;0,1]

	tx1 = tx2 = linspace(-4,4,100);

	[xx, yy] = meshgrid (tx1, tx2);

	for i=1:size(tx1)(2)
		for j=1:size(tx2)(2)
			x=[tx1(i);tx2(j)];
			px=multivariate_normal(x, n, u, covariance);
			pplot(j,i)=px;
		end
	end

	subplot(2,2,1)
	#meshc(tx1, tx2, pplot)
	surf(tx1, tx2, pplot)
	axis([-4 4 -4 4 0 0.4])

	subplot(2,2,3)
	contourf(tx1, tx2, pplot)
	axis([-4 4 -4 4])



	covariance=[1,0.5;0.5,1]

	for i=1:size(tx1)(2)
		for j=1:size(tx2)(2)
			x=[tx1(i);tx2(j)];
			px=multivariate_normal(x, n, u, covariance);
			pplot(j,i)=px;
		end
	end

	subplot(2,2,2)
	surf(tx1, tx2, pplot)
	axis([-4 4 -4 4 0 0.4])

	subplot(2,2,4)
	contourf(tx1, tx2, pplot)
	axis([-4 4 -4 4])

Example-7

Figure-7.a is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}. And figure-7.b is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0.8\\ 0.8 & 1\end{bmatrix}. Comparing with example-6, as we increase the off-diagonal entries, from .5 to .8, what we get is a distribution that is more and more thinly peaked along this sort of x_{1}=x_{2}. line.

Example-8

If we set the off-diagnal entries to negative values, like what shown in figure-8.b where \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}  \Sigma =\begin{bmatrix} 1 & -0.5\\ -0.5 & 1\end{bmatrix}, then most of the probability now lie in this region where x_{1}=-x_{2}. Figure-8.a is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix} for comparison.

Similarly, following is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & -0.8\\ -0.8 & 1\end{bmatrix}:

Example-9

So far, we have been changing \Sigma, now let's change \mu.

Figure-10.a is for \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}   \Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}, its distribution is centered at x_{1}=x_{2}=0. If we vary the mean \mu =\begin{bmatrix} 0\\ 0.5 \end{bmatrix}, it varies the peak of the distribution and looks like figure-10.b.

If  \mu =\begin{bmatrix} 1.5\\ -0.5 \end{bmatrix}, it looks like figure-11.b. Again, figure-11.a is just for comparison with \mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}.

Hopefully looking at all these pictures gives you a sense of the sort of probability distribution that the Multivariate Gaussian Distribution allows you to capture. The key advantage of it is allows you to capture when you'd expect two different features to be positively correlated, or maybe negatively correlated.

<end>

 

 

 

 

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值