15 Anomaly detection
15-1 Problem motivation
Anomaly detection example
Aircraft engine features:
x 1 x_1 x1=heat generated
x 2 x_2 x2=vibration intensity
⋯ \cdots ⋯
Dataset: { x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) } \{ x ^ { ( 1 ) } , x ^ { ( 2 ) } , \cdots , x ^ { ( m ) } \} {x(1),x(2),⋯,x(m)}
New engine : x t e s t x_{test} xtest
Density estimation
Dataset:
{
x
(
1
)
,
x
(
2
)
,
⋯
,
x
(
m
)
}
\{ x ^ { ( 1 ) } , x ^ { ( 2 ) } , \cdots , x ^ { ( m ) } \}
{x(1),x(2),⋯,x(m)}
Is
x
t
e
s
t
a
n
o
m
a
l
o
u
s
?
x_{test}\quad anomalous?
xtestanomalous?
Example
Fraud detection:
x
(
i
)
x^{(i)}
x(i)= features of user i’s activities
Model p(x) from data
Identify unusual users by checking which have
p
(
x
)
<
ϵ
p(x)<\epsilon
p(x)<ϵ
Manufacturing
Monitoring computers in a data center
x
(
i
)
x^{(i)}
x(i)= features of machine i
x
1
x_1
x1 =memory use ,
x
2
x_2
x2=number of disk accesses/sec
x
3
x_3
x3 =CPU load ,
x
4
x_4
x4 =CPU load/network traffic
15-2 Gaussian distribution
Gaussian (Normal) distribution
Say x ∈ R x\in\mathbb{R} x∈R. If x x x is a distribution Gaussian with mean μ \mu μ,variance σ 2 \sigma^{2} σ2
x ∼ N ( μ , σ 2 ) x\sim N(\mu,\sigma^2) x∼N(μ,σ2)
p ( x ; μ , σ 2 ) = 1 2 π σ e ( − ( x − μ ) 2 2 σ 2 ) p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}e^{(-\frac{(x-\mu)^2}{2\sigma^2})} p(x;μ,σ2)=2πσ1e(−2σ2(x−μ)2)
σ larger,image wider
Parameter estimation
Dataset: { x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) } \{ x ^ { ( 1 ) } , x ^ { ( 2 ) } , \cdots , x ^ { ( m ) } \} {x(1),x(2),⋯,x(m)} x ( i ) ∈ R x^{(i)}\in \mathbb{R} x(i)∈R
15-3 Algorithm
Density estimation
Training set: x ( 1 ) , ⋯ , x ( m ) x^{(1)},\cdots,x^{(m)} x(1),⋯,x(m)
Each example is x ∈ R n x\in \mathbb{R}^n x∈Rn
Anomaly detection algorithm
-
Choose features x i x_i xi that you think might be indicative of
anomalous examples. -
Fit parameters
μ 1 , ⋯ , μ n , σ 1 2 , ⋯ , σ n 2 \mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2 μ1,⋯,μn,σ12,⋯,σn2
μ j = 1 m ∑ i = 1 m x j ( i ) \mu _ { j } = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } x _ { j } ^ { ( i ) } μj=m1i=1∑mxj(i)
σ j 2 = 1 m ∑ i = 1 m ( x i ( i ) − μ j ) 2 \sigma _ { j } ^ { 2 } = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } ( x _ { i } ^ { ( i ) } - \mu _ { j } ) ^ { 2 } σj2=m1i=1∑m(xi(i)−μj)2
-
Given new example x, compute p(x):
p ( x ) = ∏ j = 1 n p ( x j ; u j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j e x p ( − ( x j − u j ) 2 2 σ j ) p ( x ) = \prod _ { j = 1 } ^ { n } p ( x _ { j } ; u _ { j } , \sigma _ { j } ^ { 2 } ) = \prod _ { j=1 } ^ { n } \frac { 1 } { \sqrt { 2 \pi } \sigma _ { j } } e x p ( - \frac { ( x _ { j } - u _ { j } ) ^ { 2 } } { 2 \sigma _ { j } } ) p(x)=j=1∏np(xj;uj,σj2)=j=1∏n2πσj1exp(−2σj(xj−uj)2)
Anomaly if p ( x ) < ϵ p(x)<\epsilon p(x)<ϵ
Anomaly detection example
15-4 Developing and evaluating an anomaly detection system
The importance of real-number evaluation
When developing a learning algorithm(choosing features, etc. ), making decisions is much easier if we have a way of evaluating our learning algorithm
Assume we have some labeled data of anomalous and non anomalous examples. (y=0 if normal, y=1 if anomalous
Training set: x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) x^{(1)},x^{(2)},\cdots,x^{(m)} x(1),x(2),⋯,x(m)(assume normal examples/not anomalous)
Cross validation set : ( x c v ( 1 ) , y c v ( 1 ) ) , ⋯ , ( x c v ( m c v ) , y c v ( m c v ) ) ( x_{cv} ^ { ( 1 ) } , y_{cv} ^ { ( 1 ) } ) , \cdots , ( x_{cv} ^ { ( m _ { c v } ) } , y_{cv} ^ { ( m _ { c v } ) } ) (xcv(1),ycv(1)),⋯,(xcv(mcv),ycv(mcv))
Test set : ( x t e s t ( 1 ) , y t e s t ( 1 ) ) , ⋯ , ( x t e s t ( m t e s t ) , y t e s t ( m t e s t ) ) ( x_{test} ^ { ( 1 ) } , y_{test} ^ { ( 1 ) } ) , \cdots , ( x_{test} ^ { ( m _ { test } ) } , y_{test} ^ { ( m _ { test } ) } ) (xtest(1),ytest(1)),⋯,(xtest(mtest),ytest(mtest))
Aircraft engines motivation example
10000 good (normal) engines
20 flawed engines (anomalous)
Training set : 6000 good engines
CV: 2000 good engines(y=0),10 anomalous (y=1)
Test: 2000 good engines(y=0),10 anomalous(y=1)
or Alternative
Algorithm evaluation
Fit model p ( x ) p(x) p(x) on training set { x ( 1 ) , ⋯ , x ( m ) } \{x^{(1)},\cdots,x^{(m)}\} {x(1),⋯,x(m)}
On a cross validation/test example predict
y
=
{
1
i
f
p
(
x
)
<
ϵ
(
a
n
o
m
a
l
y
)
0
i
f
p
(
x
)
≥
ϵ
(
n
o
r
m
a
l
)
y=\begin{cases}1\quad if\;p(x)<\epsilon\;(anomaly)\\0\quad if\;p(x)\ge\epsilon\;(normal) \end{cases}
y={1ifp(x)<ϵ(anomaly)0ifp(x)≥ϵ(normal)
Possible evaluation metrics:
- True positive, false positive, false negative, true negative
- Precision/Recall
- F 1 F_1 F1 -score
Can also use cross validation set to choose parameter ϵ \epsilon ϵ
15-5 Anomaly detection vs. supervised learning
15-6 Choosing what features to use
Non-gaussian features
make the data look a bit more Gaussian
Error analysis for anomaly detection
Want
p
(
x
)
p(x)
p(x) large for normal examples
x
x
x
p
(
x
)
p(x)
p(x) small for anomalous examples a
Most common problem:
p
(
x
)
p(x)
p(x) is comparable (say, both large) for normal
and anomalous examples
**Monitoring computers in a data center **
Choose features that might take on unusually large or small values in the event of an anomaly
x
1
x_1
x1= memory use of computer
x
1
x_1
x1=number of disk accesses/sec
x
1
x_1
x1=CPU load
x
1
x_1
x1=network traffic
15-7 Multivariate Gaussian distribution
Motivating example: Monitoring machines in a data center
Multivariate Gaussian (Normal) distribution
x
∈
R
n
x\in\mathbb{R}^n
x∈Rn.Don’t model
p
(
x
1
)
,
p
(
x
2
)
,
⋯
,
e
t
c
.
p(x_1),p(x_2),\cdots,etc.
p(x1),p(x2),⋯,etc.separately.
Model
p
(
x
)
p(x)
p(x) all in one go.
Parameters:
μ
∈
R
n
,
Σ
∈
R
n
×
n
(
c
o
v
a
r
i
a
n
c
e
m
a
t
r
i
x
)
\mu\in\mathbb{R}^n,\Sigma\in\mathbb{R}^{n\times n}\;(covariance\;matrix)
μ∈Rn,Σ∈Rn×n(covariancematrix)
Multivariate Gaussian (Normal) examples
15-8 Anomaly detection using the multivariate Gaussian distribution
Multivariate Gaussian (Normal)distribution
Parameters μ , Σ \mu,\Sigma μ,Σ
p ( x ; μ , Σ ) = 1 ( 2 π ) n 2 ∣ Σ 1 2 ∣ e ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p ( x ; \mu , \Sigma ) = \frac { 1 } { ( 2 \pi ) ^ { \frac { n } { 2 } } | \Sigma ^ { \frac { 1 } { 2 } } |}e^{(- \frac { 1 } { 2 } ( x - \mu ) ^ { T } \Sigma ^ { - 1 } ( x - \mu ))} p(x;μ,Σ)=(2π)2n∣Σ21∣1e(−21(x−μ)TΣ−1(x−μ))
Parameter fitting:
Given training set { x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) } \{ x ^ { ( 1 ) } , x ^ { ( 2 ) } , \cdots , x ^ { ( m ) } \} {x(1),x(2),⋯,x(m)}
u = 1 m ∑ i = 1 m x ( i ) u = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } x ^ { ( i ) } u=m1i=1∑mx(i)
Σ = 1 m ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T \Sigma=\frac{1}{m} \sum _ { i = 1 } ^ { m } ( x ^ { ( i ) } - \mu ) ( x ^ { ( i ) } - \mu ) ^ { T } Σ=m1i=1∑m(x(i)−μ)(x(i)−μ)T
Anomaly detection with the multivariate Gaussian
-
Fit model p ( x ) p(x) p(x) by setting
u = 1 m ∑ i = 1 m x ( i ) u = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } x ^ { ( i ) } u=m1i=1∑mx(i)
Σ = 1 m ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T \Sigma=\frac{1}{m} \sum _ { i = 1 } ^ { m } ( x ^ { ( i ) } - \mu ) ( x ^ { ( i ) } - \mu ) ^ { T } Σ=m1i=1∑m(x(i)−μ)(x(i)−μ)T
-
Given a new example x,compute
p ( x ; μ , Σ ) = 1 ( 2 π ) n 2 ∣ Σ 1 2 ∣ e ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p ( x ; \mu , \Sigma ) = \frac { 1 } { ( 2 \pi ) ^ { \frac { n } { 2 } } | \Sigma ^ { \frac { 1 } { 2 } } |}e^{(- \frac { 1 } { 2 } ( x - \mu ) ^ { T } \Sigma ^ { - 1 } ( x - \mu ))} p(x;μ,Σ)=(2π)2n∣Σ21∣1e(−21(x−μ)TΣ−1(x−μ))
Flag an anomaly if p ( x ) < ϵ p(x)<\epsilon p(x)<ϵ
Relationship to original model
Original model: p ( x ) = p ( x 1 ; μ 1 , σ 1 2 ) × p ( x 2 ; μ 2 , σ 2 2 ) × ⋯ × p ( x n ; μ n , σ n 2 ) p ( x ) = p ( x _ { 1 } ; \mu _ { 1 } , \sigma _ { 1 } ^ { 2 } ) \times p ( x _ { 2 } ; \mu _ { 2 } , \sigma _ { 2 } ^ { 2 } ) \times \cdots \times p ( x _ { n } ; \mu _ { n } , \sigma_n^2 ) p(x)=p(x1;μ1,σ12)×p(x2;μ2,σ22)×⋯×p(xn;μn,σn2)
Corresponds to multivariate Gaussian
p ( x ; μ , Σ ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p ( x ; \mu , \Sigma ) = \frac { 1 } { ( 2 \pi ) ^ { \frac { n } { 2 } } | \Sigma|^{ \frac { 1 } { 2 } } } e x p ( - \frac { 1 } { 2 } ( x - \mu ) ^ { T } \Sigma ^ { - 1 } ( x - \mu ) ) p(x;μ,Σ)=(2π)2n∣Σ∣211exp(−21(x−μ)TΣ−1(x−μ))
where