【吴恩达机器学习笔记】Part3-Week1（下）异常检测（Anomaly Detection）

hotpants

于 2023-01-23 14:15:26 发布

阅读量274

点赞数 1

分类专栏：吴恩达机器学习笔记文章标签：人工智能深度学习

本文链接：https://blog.csdn.net/hotpants/article/details/128749914

版权

吴恩达机器学习笔记专栏收录该内容

10 篇文章 8 订阅

订阅专栏

3.1 发现异常事件（Finding unusual events）

${x_{test}}$ 是异常吗？引入事件发生的概率：
$p\left( {{x_{test}}} \right) < \varepsilon$ ————flag anomaly
$p\left( {{x_{test}}} \right) \geqslant \varepsilon$ ————“OK”

3.2 高斯 / 正态分布（Gaussian / Normal Distribution）

高斯分布：
$p\left( x \right) = \frac{1}{{\sqrt {2\pi } \sigma }}{e^{\frac{{ - {{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}}}$
其中：
$\mu = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{\left( i \right)}}}$
${\sigma ^2} = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x^{\left( i \right)}} - \mu } \right)} ^2}$

3.3 异常检测算法（Algorithm）

按照不同的特征计算异常发生的概率：
$p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)}$

异常检测算法流程：

Choose n features x_i that you think might be indicative of anomalous examples.
Fit parameters μ，σ： ${\mu _j} = \frac{1}{m}\sum\limits_{i = 1}^m {{x_j}^{\left( i \right)}}$ ${\sigma _j}^2 = \frac{1}{m}{\sum\limits_{i = 1}^m {\left( {{x_j}^{\left( i \right)} - {\mu _j}} \right)} ^2}$
Given new example x, compute p(x)： $p\left( x \right) = \prod\limits_{j = 1}^n {p\left( {{x_j};{\mu _j},\sigma _j^2} \right)} = \prod\limits_{j = 1}^n {\frac{1}{{\sqrt {2\pi } {\sigma _j}}}} \exp \left( { - \frac{{{{\left( {{x_j} - {\mu _j}} \right)}^2}}}{{2{\sigma _j}^2}}} \right)$

3.4 开发与评估异常检测系统（Developing and evaluating on anomaly detection system）

假设现在有一组有标签数据（labeled data），其中有异常样本和非异常样本（y = 0 if normal, y = 1 if anomalous）；
训练集(Test)： ${x^{\left( 1 \right)}},{x^{\left( 2 \right)}}, \ldots ,{x^{\left( m \right)}}$ （assume normal examples/ not anomalous）
交叉验证集(CV)： $\left( {x_{cv}^{(1)},y_{cv}^{(1)}} \right), \ldots ,\left( {x_{cv}^{({m_{cv}})},y_{cv}^{({m_{cv}})}} \right)$
测试集： $\left( {x_{test}^{(1)},y_{test}^{(1)}} \right), \ldots ,\left( {x_{test}^{({m_{test}})},y_{test}^{({m_{test}})}} \right)$

案例：飞机引擎数据集中有10000个正常引擎（good / normal），20个异常引擎(flawed / anomalous)；
方案一：

训练集（Training set）：6000个正常引擎
交叉验证集（CV）：2000个正常引擎（y = 0），10个异常引擎（y = 1）
测试集（Test）：2000个正常引擎（y = 0），10个异常引擎（y = 1）

方案二（效率高但可能过拟合）：

训练集（Training set）：6000个正常引擎
交叉验证集（CV）：4000个正常引擎（y = 0），20个异常引擎（y = 1）
测试集（Test）：无

算法的评估标准：（参考Part2-Week3（下）-4.1和4.2）对于高倾斜数据，可以采用查准率、查全率作为评估指标。

3.5 异常检测与监督学习（Anomaly detection vs. supervised learning）

异常检测（Anomaly detection）	监督学习（Supervised learning）
高倾斜数据：`y = 1`的样本很少（Very small number of positive example: `y = 1`. Large number of negative example: `y = 0`）	正例和反例的样本量都很大（Large number of positive and negative examples）
异常样本的种类多且杂。（Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.）	可以通过正例的学习得到反例的特征。（Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.）
应用：欺诈检测（Fraud detection）	垃圾邮件分类（Email spam classification）
生产制造（Manufacturing- Finding new previously unseen defects in manufacturing, e.g. aircraft engines）	Manufacturing- Finding known, previously seen defects
Monitoring machines in a data center	Weather prediction(sunny/rainy/etc.)
…	Diseases classification