Anomaly detection - Multivariate Gaussian distribution

最新推荐文章于 2023-04-18 18:11:31 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2023-04-18 18:11:31 发布

阅读量177

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/113106958

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十六章《异常检测》中第129课时《多变量高斯分布》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.

————————————————

Let's talk about one possible extension to the anomaly detection algorithm: Multivariate Gaussian distribution.

Why we need multivariate Gaussian distribution

We'll start with an example of monitoring machines in the data center. If model the two features $x_{1}$ (CPU load) and $x_{2}$ (Memory Use) with Gaussian, they look like the two figures on the right. Let's say we have an example shown as the green cross ( $x_{1}\approx 0.4; x_{2}\approx 1.5$ ). If look at the data which are shown as red croses, most of them lie in the blue ellipse region. And the two features sort of grow linearly with each other. Whereas for the green cross, the CPU load is very low but memory usage is very high. So the green cross should be raised as anomaly.

But the anomaly detection would fail to flag the green cross as anomaly. The $p(x_{1})=p(0.4)$ and $p(x_{2})=p(1.5)$ both are pretty / reasonably high. It turns out the anomaly detection algorithm is not realizing that the blue ellipse shows the high probability region. Instead, it "sees" those concentric circles and thinks inner circles have higher probability than outter ones. And thinks the green cross has pretty high probability. And actually it tends to think that everything on the, for example, 2nd circle has about equal probability, and doesn't realize the probability of the two blue crosses actually are very different.

Definition of multivariate Gaussian distribution

To fix this, we're going to develop a modified version of the anomaly detection algorithm using multivariate Gaussian / normal distribution. We have:

features $x\in \mathbb{R}^{n}$
parameters , (covariance matrix)
- Per https://blog.csdn.net/edward_wang1/article/details/109870637, we can calculate $\Sigma$

Instead of model $p(x_{1}), p(x_{2}),...,$ separately, we're going to model p(x) all in one go. And here comes the formula:

$p(x;\mu ,\Sigma )=\frac{1}{(2\pi)^{n/2}\left | \Sigma \right |^{\frac{1}{2}}}e^{(-\frac{1}{2}(x-\mu )^{T}\Sigma ^{-1}(x-\mu ))}$

Where,

$\left | \Sigma \right |$ is called the determinant (行列式) of Sigma ( $\Sigma$ ). In Octave, you can compute it with $det(\Sigma )$ .

Examples

Then what this $p(x; \mu ,\Sigma )$ looks like?

Example-1

Suppose we have:

Two features $x_{1}$ , $x_{2}$ , thus n=2

$\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}$ (identity matrix)

The p(x) will look like figure-1.a. And its highest value at about when $x_{1}=x_{2}=0$ . Down below is a contour plot.

Following is the experiment I did. Just copy & paste the code into Octave to take a look. And its plot looks like figure-1.b.

function p = multivariate_normal(x, n, mean, covariance)
    x_m = x - mean;
    p=1/((2*pi)^(n/2)*det(covariance)^(0.5))*e^(-0.5*x_m'*inv(covariance)*x_m);
endfunction

n=2;
u=[0;0]
covariance=[1,0;0,1]

tx1 = tx2 = linspace(-4,4,100);

[xx, yy] = meshgrid (tx1, tx2);

for i=1:size(tx1)(2)
    for j=1:size(tx2)(2)
        x=[tx1(i);tx2(j)];
        px=multivariate_normal(x, n, u, covariance);
        pplot(j,i)=px;
    end
end

subplot(2,2,1)
#meshc(tx1, tx2, pplot)
surf(tx1, tx2, pplot)
axis([-4 4 -4 4 0 0.4])

subplot(2,2,3)
contourf(tx1, tx2, pplot)
axis([-4 4 -4 4])

Example-2

Let's check more examples by varing some of parameters.

Suppose we changed $\Sigma$ as the following:

Two features $x_{1}$ , $x_{2}$ , thus n=2

$\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 0.6 & 0\\ 0 & 0.6 \end{bmatrix}$

Then it looks like figure-2.a. By shrinking the $\Sigma$ , the width of the bump diminishes and the height also increases a bit because the integral of the volume under the surface is equal to 1. Similarly, figure-2.b is the plot via my above code in Octave.

Example-3

Suppose we changed $\Sigma$ as the following:

Two features $x_{1}$ , $x_{2}$ , thus n=2

$\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 2 & 0\\ 0 & 2 \end{bmatrix}$

Then you'll have much wider and much flatter Gaussian as figure-3.a, 3.b & 3.c. Again 3.b is output of my above Octave codes.

Example-4

Suppose we changed $\Sigma$ as the following, that is just change one of the elements of $\Sigma$ :

Two features $x_{1}$ , $x_{2}$ , thus n=2

$\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 0.6 & 0\\ 0 & 1\end{bmatrix}$

This reduces the variance of feature $x_{1}$ . And the plot looks like figure-4.a. And 4.b is output of my above Octave codes.

Example-5

Suppose we changed $\Sigma$ as the following,

Two features $x_{1}$ , $x_{2}$ , thus n=2

$\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 2 & 0\\ 0 & 1\end{bmatrix}$

The plot looks like figure-5.a & 5.b. The distribution falls off more slowly as $x_{1}$ moves away from 0. And falls off very rapidly as $x_{2}$ moves away from 0. Figure-5.b is output of my Octave code above.

Example-6

One of the cool things of the multivariate Gaussian distribution is that you can also use it to model correlations between the data. That is we can use it to model the fact that $x_{1}$ and $x_{2}$ tend to be highly correlated with each other. Let's change the off diagonal entries of the covariance matrix. Figure-6.a is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}$ . And figure-6.b is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0.5\\ 0.5 & 1\end{bmatrix}$ .

Above plot can be generated via codes:

	function p = multivariate_normal(x, n, mean, covariance)
		x_m = x - mean;
		p=1/((2*pi)^(n/2)*det(covariance)^(0.5))*e^(-0.5*x_m'*inv(covariance)*x_m);
	endfunction

	n=2;
	u=[0;0]
	covariance=[1,0;0,1]

	tx1 = tx2 = linspace(-4,4,100);

	[xx, yy] = meshgrid (tx1, tx2);

	for i=1:size(tx1)(2)
		for j=1:size(tx2)(2)
			x=[tx1(i);tx2(j)];
			px=multivariate_normal(x, n, u, covariance);
			pplot(j,i)=px;
		end
	end

	subplot(2,2,1)
	#meshc(tx1, tx2, pplot)
	surf(tx1, tx2, pplot)
	axis([-4 4 -4 4 0 0.4])

	subplot(2,2,3)
	contourf(tx1, tx2, pplot)
	axis([-4 4 -4 4])



	covariance=[1,0.5;0.5,1]

	for i=1:size(tx1)(2)
		for j=1:size(tx2)(2)
			x=[tx1(i);tx2(j)];
			px=multivariate_normal(x, n, u, covariance);
			pplot(j,i)=px;
		end
	end

	subplot(2,2,2)
	surf(tx1, tx2, pplot)
	axis([-4 4 -4 4 0 0.4])

	subplot(2,2,4)
	contourf(tx1, tx2, pplot)
	axis([-4 4 -4 4])

Example-7

Figure-7.a is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}$ . And figure-7.b is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0.8\\ 0.8 & 1\end{bmatrix}$ . Comparing with example-6, as we increase the off-diagonal entries, from .5 to .8, what we get is a distribution that is more and more thinly peaked along this sort of $x_{1}=x_{2}$ . line.

Example-8

If we set the off-diagnal entries to negative values, like what shown in figure-8.b where $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & -0.5\\ -0.5 & 1\end{bmatrix}$ , then most of the probability now lie in this region where $x_{1}=-x_{2}$ . Figure-8.a is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}$ for comparison.

Similarly, following is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & -0.8\\ -0.8 & 1\end{bmatrix}$ :

Example-9

So far, we have been changing $\Sigma$ , now let's change $\mu$ .

Figure-10.a is for $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ $\Sigma =\begin{bmatrix} 1 & 0\\ 0 & 1\end{bmatrix}$ , its distribution is centered at $x_{1}=x_{2}=0$ . If we vary the mean $\mu =\begin{bmatrix} 0\\ 0.5 \end{bmatrix}$ , it varies the peak of the distribution and looks like figure-10.b.

If $\mu =\begin{bmatrix} 1.5\\ -0.5 \end{bmatrix}$ , it looks like figure-11.b. Again, figure-11.a is just for comparison with $\mu =\begin{bmatrix} 0\\ 0 \end{bmatrix}$ .

Hopefully looking at all these pictures gives you a sense of the sort of probability distribution that the Multivariate Gaussian Distribution allows you to capture. The key advantage of it is allows you to capture when you'd expect two different features to be positively correlated, or maybe negatively correlated.

<end>