Study notes for Anomaly Detection

最新推荐文章于 2020-05-03 17:11:04 发布

Felix_夜雨

最新推荐文章于 2020-05-03 17:11:04 发布

阅读量1.6k

点赞数 1

分类专栏： Machine Learning 文章标签： machine learning 机器学习 study notes

本文链接：https://blog.csdn.net/u010693617/article/details/9131023

版权

Machine Learning 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Introduction

Assume we have a data set including vast normal examples and only small anomalies. To detect these anomalies, we need to train a probabilistic model to determine whether a given example is anomaly or not.
- p(x) refers to the probability that the example is normal.
- if $p(x_{test})< \epsilon \rightarrow$ flag this as an anomaly
- if $p(x_{test})\ge \epsilon \rightarrow$ this is OK.
- where $\epsilon$ is a threshold that can be determined by maximizing the performance using cross validation.
- Note that only normal examples (negative labels) are used to train the model.
- Thus it can be regarded as an unsupervised learning algorithm.
Applications
- Fraud detection
  - Some features can be used to describe a user's activity, such as length of online-time, login location, frequency, etc.
  - Identify unusual users by checking anything that looks a bit weird
- Manufacturing
  - Detect if a product looks good or not
- Monitoring computers in data center
  - For a cluster of machines, we can model the use of each machine in terms of memory use, number of disk accesses/sec and CPU load, or defining our own complex features such as CPU load/network traffic
  - Detect whether an anomalous machine is able to fail or doing something abnormal.
The differences between anomaly detection and classification (or regression) are:
- The data set only contains smaller number (typically 2-50) of examples as anomalies. Hence it may be not enough to "learn" a classifier or regressor, which require reasonably large number of positive and negative examples.
- There may be no suitable features (or existing any patterns) to describe those anomalies. For example, there are many "types" of anomalies.

The Anomaly Detection Algorithm

For a set of unlabeled training set: $\{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}$ , assume if each example is n-dimensional, i.e. has n features.
Calculate parameters of all features $\mu_1, \ldots, \mu_n, \sigma_1^2, \ldots, \sigma_n^2$ by
$\left\{\begin{array}{lll}\mu_j &=& \frac{1}{m}\sum_{i=1}^m x_j^{(i)}\\ \sigma_i^2 &= &\sum_{i=1}^m (x_j^{(i)}-\mu_j)^2 \end{array}\right.$
Model p(x) as follows:
$p(x)=p(x_1; \mu_1, \sigma_1^2)*p(x_2; \mu_2, \sigma_2^2)*\ldots *p(x_n; \mu_n, \sigma_n^2)=\prod_{j=1}^n p(x_j; \mu_j, \sigma_j^2)$
- It is assumed that each feature is distributed according to a Gaussian distribution
- There is no correlation between two features, i.e., all features are independent.
- Note that: the algorithm may still work if features are correlated.
- Besides, we can conduct dimension reduction (e.g. PCA) to solve this problem.
Compute p(x) and a test example x is anomaly if $p(x)<\epsilon$
Performance measure: precision/recall, F-measure

Choose Features to Use

Plot a histogram of data to check if it has a Gaussian distribution, although it may still works if data is non-Gaussian.
Non-Gaussian data may look like this:

We can use different transformation to make it look more Gaussian, e.g., log(x) transformation obtains:

Hence, we can use log(x) rather than x as a feature. Other possible transformations include log(x+c) or x^0.5.
Use error analysis, trying to interpret the performance of p(x) and come up with new features that can account for the errors. For example, the new feature CPU load/network traffic may be useful.

Anomaly Detection with Multivariate Gaussian Distribution

The new model based on multivariate Gaussian distribution is as follows:
$p(x;\mu, \Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} exp(-\frac{1}{2} (x-\mu)^T\Sigma^{-1}(x-\mu))$
It can be rewritten as $x\sim \mathcal{N}(\mu, \Sigma)$ , or explicitly with m-dimensional $x\sim \mathcal{N}_m(\mu, \Sigma)$ .
The new parameters:
- $\mu$ is an n-dimensional mean vector, where n is the number of features.
- $\Sigma$ is an [n x n] covariance matrix, including the correlaitons between different features.
- $|\Sigma|$ is the absolute values of determinant of sigma. It can be computed in Matlab using det(sigma).
Calculate the parameters of features:
$\left\{\begin{array}{lll}\mu=[E[x_1], E[x_2], \ldots, E[x_m]] & =&\frac{1}{m}\sum_{i=1}^m x^{(i)} \\ \Sigma = [Cov[x_i, x_j]] &= & \frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T\end{array}\right.$

Multivariate Gaussian Distribution

If features are not independent, but correlated with each with to some extent, then the simple anomaly detection may fail to work.
For example, assume the data set is shown as follows:

In this case, the test exmple may be regarded as "normal" by the previous anomaly detection (the checking areas are the circles). This is because the model makes probability prediction in concentric circles around the means of both. However, data in the blue ellipse are more likely to be normal and the test example is far from that.
The multivariate Gaussian distribution can be represented by:
- It can be seen that the final example gives a very tall thin distribution, showing a strong positive correlation.
- We can also make the off-diagonal values negative to show a negative correlation.

Simple vs. Multivariante Gaussian

Simple anomaly detection can be regarded as a special case of multivariate anomaly detection when features are independent, i.e., the covariance between different features is zero.
Simple Gaussian model is more often used because
- It is cheaper to compute;
- It scales much better to very large feature vectors.
- It works well with a small training set.
Simple Gaussian model needs to manually create new features that capture the feature correlations. But it may be difficult in some cases.
Multivariate model can automatically capture feature correlations via covariance matrix.
- Hence it is more computationally expensive.
- Needs for m>n, otherwise the covariance matrix is not invertible.
- No redundant (or linearly dependent) features which also leads to matrix non-invertible.

References

Anomaly Detection: http://www.holehouse.org/mlclass/15_Anomaly_Detection.html

Felix_夜雨

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Study notes for Anomaly Detection

IntroductionAssume we have a data set including vast normal examples and only small anomalies. To detect these anomalies, we need to train a probabilistic model to determine whether a given exampl
复制链接

扫一扫

专栏目录