Anomaly detection - problem motivation

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十六章《异常检测》中第123课时《问题动机》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————

In this next set of videos, we'll talk about a problem called Anomaly Dection. This is a reasonably commonly use type machine learning. One of the interesting aspects is that it's mainly for unsupervised learning problem, but there're some aspects of it that are also very similar to supervised learning. So what's Anomaly Detection?

Imagine that you're a manufacturer of aircraft engines. As your aircraft engines roll off the assembly line, you're doing QA testing. You measure features of your aircraft engines, like the heat generated, the vibrations and so on. You now have a dataset {x^{(1)}, x^{(2)},...,x^{(m)}}. If plot the data, it maybe looks like above. Each cross is one of your unlabeled examples. Let's say the next day, you have a new aircraft engine that rolls off the assembly line and it has some set of features x_{test}. What the anomaly detection problem is, we want to know whether this new aircraft engine is anomalous in any way. If it looks like a point over there (the upper green cross), then it looks like the aircraft engines we've seen before, so maybe we'll say that it looks okay. Whereas if the new aircraft engine were a point here (the bottom green cross), then we'll call that an anomaly.

More formally in the anomaly detection problem, we're given some data set {x^{(1)}, x^{(2)},...,x^{(m)}}. And we usually assume these examples are normal or non-anomalous. We want an algorithm to tell us if some new example x_{test} is anomalous. For this, we're going to build a model for p(x) with these unlabeled training set, where x are these features of, say, aircraft engines. For the new engine x_{test},

If p(x_{test}) < \epsilon, then we flag this as anomaly

If p(x_{test}) \geqslant \epsilon, then we flag it as OK

And, given the training set plotted above, if you build a model p(x), hopefully the model will say that points lie somewhere in the middle have pretty high probability. Whereas points a little bit further out have lower probablity. That point way out here (the green cross at the bottom) would be an anomaly. The point way in there (the green cross in the middle) would be OK.

Here're examples of anomaly detection.

Perhaps the most common application of anomaly detection is Fraud detection. If you have many users and if each user take different activities, maybe on the website or in a physical plant or something, you can define features x^{(i)} of the different users activities. And build a model to say what is the probability of different users behaving different ways. What is the probability of a particular vector of features of a user's behavior. Examples of features of users' activity on the website may be things like how often does this user log in (x_{1}), the number of web pages visited or the number of transactions(x_{2}), or the number of posts of the users on the forum (x_{3}), or the typing speed of the user (x_{4}). Then you can model p(x) based on such data. Then you can try to identify users behaving very strangely on your website by checking which ones have p(x)< \epsilon and maybe send a profile of those users for further review.

Another example is manufacturing. We've already talked about the aircraft engine thing

The third example is monitoring computers in a data center. If you have lots of machines in a computer cluster or data center, you can compute features of each machines: how much memory used (x_{1}), number of disk accesses(x_{2}), CPU load(x_{3}), CPU load/network traffic(x_{4}). Then you can model p(x) to show the probability of these machines having different amount of memory use, different number of disks accesses, different CPU loads and so on. If you ever have a machine whose p(x)< \epsilon, then you know that machine is behaving unusually and maybe further review by a system administrator is needed.

Next, I'll talk about a bit about the Gaussian distribution and review the properties of the Gaussian probability distribution. And then to develop an anomaly detection algorithm.

<end>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值