3.1 发现异常事件(Finding unusual events)
x
t
e
s
t
{x_{test}}
xtest是异常吗?引入事件发生的概率:
p
(
x
t
e
s
t
)
<
ε
p\left( {{x_{test}}} \right) < \varepsilon
p(xtest)<ε————flag anomaly
p
(
x
t
e
s
t
)
⩾
ε
p\left( {{x_{test}}} \right) \geqslant \varepsilon
p(xtest)⩾ε————“OK”
3.2 高斯 / 正态分布(Gaussian / Normal Distribution)
高斯分布:
p
(
x
)
=
1
2
π
σ
e
−
(
x
−
μ
)
2
2
σ
2
p\left( x \right) = \frac{1}{{\sqrt {2\pi } \sigma }}{e^{\frac{{ - {{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}}}
p(x)=2πσ1e2σ2−(x−μ)2
其中:
μ
=
1
m
∑
i
=
1
m
x
(
i
)
\mu = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{\left( i \right)}}}
μ=m1i=1∑mx(i)
σ
2
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
)
2
{\sigma ^2} = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x^{\left( i \right)}} - \mu } \right)} ^2}
σ2=m1i=1∑m(x(i)−μ)2
3.3 异常检测算法(Algorithm)
按照不同的特征计算异常发生的概率:
p
(
x
)
=
∏
j
=
1
n
p
(
x
j
;
μ
j
,
σ
j
2
)
p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)}
p(x)=j=1∏np(xj;μj,σj2)
异常检测算法流程:
- Choose
n
featuresx_i
that you think might be indicative of anomalous examples. - Fit parameters μ,σ: μ j = 1 m ∑ i = 1 m x j ( i ) {\mu _j} = \frac{1}{m}\sum\limits_{i = 1}^m {{x_j}^{\left( i \right)}} μj=m1i=1∑mxj(i) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j ) 2 {\sigma _j}^2 = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x_j}^{\left( i \right)} - {\mu _j}} \right)} ^2} σj2=m1i=1∑m(xj(i)−μj)2
- Given new example
x
, compute p(x): p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j exp ( − ( x j − μ j ) 2 2 σ j 2 ) p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)} = \prod\limits_{j = 1}^n {\frac{1}{{\sqrt {2\pi } {\sigma _j}}}} \exp \left( { - \frac{{{{\left( {{x_j} - {\mu _j}} \right)}^2}}}{{2{\sigma _j}^2}}} \right) p(x)=j=1∏np(xj;μj,σj2)=j=1∏n2πσj1exp(−2σj2(xj−μj)2)
3.4 开发与评估异常检测系统(Developing and evaluating on anomaly detection system)
假设现在有一组有标签数据(labeled data),其中有异常样本和非异常样本(y = 0
if normal, y = 1
if anomalous);
训练集(Test):
x
(
1
)
,
x
(
2
)
,
…
,
x
(
m
)
{x^{\left( 1 \right)}},{x^{\left( 2 \right)}}, \ldots ,{x^{\left( m \right)}}
x(1),x(2),…,x(m)(assume normal examples/ not anomalous)
交叉验证集(CV):
(
x
c
v
(
1
)
,
y
c
v
(
1
)
)
,
…
,
(
x
c
v
(
m
c
v
)
,
y
c
v
(
m
c
v
)
)
\left( {x_{cv}^{(1)},y_{cv}^{(1)}} \right), \ldots ,\left( {x_{cv}^{({m_{cv}})},y_{cv}^{({m_{cv}})}} \right)
(xcv(1),ycv(1)),…,(xcv(mcv),ycv(mcv))
测试集:
(
x
t
e
s
t
(
1
)
,
y
t
e
s
t
(
1
)
)
,
…
,
(
x
t
e
s
t
(
m
t
e
s
t
)
,
y
t
e
s
t
(
m
t
e
s
t
)
)
\left( {x_{test}^{(1)},y_{test}^{(1)}} \right), \ldots ,\left( {x_{test}^{({m_{test}})},y_{test}^{({m_{test}})}} \right)
(xtest(1),ytest(1)),…,(xtest(mtest),ytest(mtest))
案例:飞机引擎数据集中有10000个正常引擎(good / normal),20个异常引擎(flawed / anomalous);
方案一:
训练集(Training set):6000个正常引擎
交叉验证集(CV):2000个正常引擎(y = 0),10个异常引擎(y = 1)
测试集(Test):2000个正常引擎(y = 0),10个异常引擎(y = 1)
方案二(效率高但可能过拟合):
训练集(Training set):6000个正常引擎
交叉验证集(CV):4000个正常引擎(y = 0),20个异常引擎(y = 1)
测试集(Test):无
算法的评估标准:(参考Part2-Week3(下)-4.1和4.2)对于高倾斜数据,可以采用查准率、查全率作为评估指标。
3.5 异常检测与监督学习(Anomaly detection vs. supervised learning)
异常检测(Anomaly detection) | 监督学习(Supervised learning) |
---|---|
高倾斜数据:y = 1 的样本很少(Very small number of positive example: y = 1 . Large number of negative example: y = 0 ) | 正例和反例的样本量都很大(Large number of positive and negative examples) |
异常样本的种类多且杂。(Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.) | 可以通过正例的学习得到反例的特征。(Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.) |
应用:欺诈检测(Fraud detection) | 垃圾邮件分类(Email spam classification) |
生产制造(Manufacturing- Finding new previously unseen defects in manufacturing, e.g. aircraft engines) | Manufacturing- Finding known, previously seen defects |
Monitoring machines in a data center | Weather prediction(sunny/rainy/etc.) |
… | Diseases classification |
3.6 选择使用的特征(Choosing what features to use)
对于不按高斯分布的特征(Non-gaussian features),需要对特征的形式进行数学处理(取对数,开n次方等),使新的特征符合高斯分布。