Anomaly detection - problem motivation

最新推荐文章于 2022-09-01 10:52:20 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2022-09-01 10:52:20 发布

阅读量100

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/111504129

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十六章《异常检测》中第123课时《问题动机》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In this next set of videos, we'll talk about a problem called Anomaly Dection. This is a reasonably commonly use type machine learning. One of the interesting aspects is that it's mainly for unsupervised learning problem, but there're some aspects of it that are also very similar to supervised learning. So what's Anomaly Detection?

Imagine that you're a manufacturer of aircraft engines. As your aircraft engines roll off the assembly line, you're doing QA testing. You measure features of your aircraft engines, like the heat generated, the vibrations and so on. You now have a dataset ${x^{(1)}, x^{(2)},...,x^{(m)}}$ . If plot the data, it maybe looks like above. Each cross is one of your unlabeled examples. Let's say the next day, you have a new aircraft engine that rolls off the assembly line and it has some set of features $x_{test}$ . What the anomaly detection problem is, we want to know whether this new aircraft engine is anomalous in any way. If it looks like a point over there (the upper green cross), then it looks like the aircraft engines we've seen before, so maybe we'll say that it looks okay. Whereas if the new aircraft engine were a point here (the bottom green cross), then we'll call that an anomaly.

More formally in the anomaly detection problem, we're given some data set ${x^{(1)}, x^{(2)},...,x^{(m)}}$ . And we usually assume these examples are normal or non-anomalous. We want an algorithm to tell us if some new example $x_{test}$ is anomalous. For this, we're going to build a model for p(x) with these unlabeled training set, where are these features of, say, aircraft engines. For the new engine $x_{test}$ ,

If $p(x_{test}) < \epsilon$ , then we flag this as anomaly

If $p(x_{test}) \geqslant \epsilon$ , then we flag it as OK

And, given the training set plotted above, if you build a model p(x) , hopefully the model will say that points lie somewhere in the middle have pretty high probability. Whereas points a little bit further out have lower probablity. That point way out here (the green cross at the bottom) would be an anomaly. The point way in there (the green cross in the middle) would be OK.

Here're examples of anomaly detection.

Perhaps the most common application of anomaly detection is Fraud detection. If you have many users and if each user take different activities, maybe on the website or in a physical plant or something, you can define features $x^{(i)}$ of the different users activities. And build a model to say what is the probability of different users behaving different ways. What is the probability of a particular vector of features of a user's behavior. Examples of features of users' activity on the website may be things like how often does this user log in ( $x_{1}$ ), the number of web pages visited or the number of transactions( $x_{2}$ ), or the number of posts of the users on the forum ( $x_{3}$ ), or the typing speed of the user ( $x_{4}$ ). Then you can model p(x) based on such data. Then you can try to identify users behaving very strangely on your website by checking which ones have $p(x)< \epsilon$ and maybe send a profile of those users for further review.

Another example is manufacturing. We've already talked about the aircraft engine thing

The third example is monitoring computers in a data center. If you have lots of machines in a computer cluster or data center, you can compute features of each machines: how much memory used ( $x_{1}$ ), number of disk accesses( $x_{2}$ ), CPU load( $x_{3}$ ), CPU load/network traffic( $x_{4}$ ). Then you can model p(x) to show the probability of these machines having different amount of memory use, different number of disks accesses, different CPU loads and so on. If you ever have a machine whose $p(x)< \epsilon$ , then you know that machine is behaving unusually and maybe further review by a system administrator is needed.

Next, I'll talk about a bit about the Gaussian distribution and review the properties of the Gaussian probability distribution. And then to develop an anomaly detection algorithm.

<end>