【吴恩达机器学习笔记】Part3-Week1(下) 异常检测(Anomaly Detection)

3.1 发现异常事件(Finding unusual events)

x t e s t {x_{test}} xtest是异常吗?引入事件发生的概率
p ( x t e s t ) < ε p\left( {{x_{test}}} \right) < \varepsilon p(xtest)<ε————flag anomaly
p ( x t e s t ) ⩾ ε p\left( {{x_{test}}} \right) \geqslant \varepsilon p(xtest)ε————“OK”

3.2 高斯 / 正态分布(Gaussian / Normal Distribution)

高斯分布:
p ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 p\left( x \right) = \frac{1}{{\sqrt {2\pi } \sigma }}{e^{\frac{{ - {{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}}} p(x)=2π σ1e2σ2(xμ)2
其中:
μ = 1 m ∑ i = 1 m x ( i ) \mu = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{\left( i \right)}}} μ=m1i=1mx(i)
σ 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ) 2 {\sigma ^2} = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x^{\left( i \right)}} - \mu } \right)} ^2} σ2=m1i=1m(x(i)μ)2

3.3 异常检测算法(Algorithm)

按照不同的特征计算异常发生的概率:
p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)} p(x)=j=1np(xj;μj,σj2)

异常检测算法流程:

  1. Choose n features x_i that you think might be indicative of anomalous examples.
  2. Fit parameters μ,σ: μ j = 1 m ∑ i = 1 m x j ( i ) {\mu _j} = \frac{1}{m}\sum\limits_{i = 1}^m {{x_j}^{\left( i \right)}} μj=m1i=1mxj(i) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j ) 2 {\sigma _j}^2 = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x_j}^{\left( i \right)} - {\mu _j}} \right)} ^2} σj2=m1i=1m(xj(i)μj)2
  3. Given new example x, compute p(x): p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j exp ⁡ ( − ( x j − μ j ) 2 2 σ j 2 ) p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)} = \prod\limits_{j = 1}^n {\frac{1}{{\sqrt {2\pi } {\sigma _j}}}} \exp \left( { - \frac{{{{\left( {{x_j} - {\mu _j}} \right)}^2}}}{{2{\sigma _j}^2}}} \right) p(x)=j=1np(xj;μj,σj2)=j=1n2π σj1exp(2σj2(xjμj)2)
3.4 开发与评估异常检测系统(Developing and evaluating on anomaly detection system)

假设现在有一组有标签数据(labeled data),其中有异常样本和非异常样本(y = 0 if normal, y = 1 if anomalous);
训练集(Test): x ( 1 ) , x ( 2 ) , … , x ( m ) {x^{\left( 1 \right)}},{x^{\left( 2 \right)}}, \ldots ,{x^{\left( m \right)}} x(1),x(2),,x(m)(assume normal examples/ not anomalous)
交叉验证集(CV): ( x c v ( 1 ) , y c v ( 1 ) ) , … , ( x c v ( m c v ) , y c v ( m c v ) ) \left( {x_{cv}^{(1)},y_{cv}^{(1)}} \right), \ldots ,\left( {x_{cv}^{({m_{cv}})},y_{cv}^{({m_{cv}})}} \right) (xcv(1),ycv(1)),,(xcv(mcv),ycv(mcv))
测试集: ( x t e s t ( 1 ) , y t e s t ( 1 ) ) , … , ( x t e s t ( m t e s t ) , y t e s t ( m t e s t ) ) \left( {x_{test}^{(1)},y_{test}^{(1)}} \right), \ldots ,\left( {x_{test}^{({m_{test}})},y_{test}^{({m_{test}})}} \right) (xtest(1),ytest(1)),,(xtest(mtest),ytest(mtest))

案例:飞机引擎数据集中有10000个正常引擎(good / normal),20个异常引擎(flawed / anomalous);
方案一

训练集(Training set):6000个正常引擎
交叉验证集(CV):2000个正常引擎(y = 0),10个异常引擎(y = 1)
测试集(Test):2000个正常引擎(y = 0),10个异常引擎(y = 1)

方案二(效率高但可能过拟合):

训练集(Training set):6000个正常引擎
交叉验证集(CV):4000个正常引擎(y = 0),20个异常引擎(y = 1)
测试集(Test):无

算法的评估标准:(参考Part2-Week3(下)-4.1和4.2)对于高倾斜数据,可以采用查准率、查全率作为评估指标。

3.5 异常检测与监督学习(Anomaly detection vs. supervised learning)
异常检测(Anomaly detection)监督学习(Supervised learning)
高倾斜数据:y = 1的样本很少(Very small number of positive example: y = 1. Large number of negative example: y = 0正例和反例的样本量都很大(Large number of positive and negative examples)
异常样本的种类多且杂。(Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.)可以通过正例的学习得到反例的特征。(Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.)
应用:欺诈检测(Fraud detection)垃圾邮件分类(Email spam classification)
生产制造(Manufacturing- Finding new previously unseen defects in manufacturing, e.g. aircraft engines)Manufacturing- Finding known, previously seen defects
Monitoring machines in a data centerWeather prediction(sunny/rainy/etc.)
Diseases classification
3.6 选择使用的特征(Choosing what features to use)

对于不按高斯分布的特征(Non-gaussian features),需要对特征的形式进行数学处理(取对数,开n次方等),使新的特征符合高斯分布。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值