摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十六章《异常检测》中第128课时《选择要使用的features》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
It turns out when applying anomaly detection, one of the things that has huge effect on how well it does is what features you use.
- Choose features that might take on either very large or very small values
Use example of monitoring the computers in a data center. We might have thousands or tens of thousands of machines. We want to know whether some computers are doing something strange. Followings are features we might choose:
memory use of computer
number of disk accesses/sec
=CPU load
network traffic
Suppose there are a bunch of web servers and if one of my servers is serving a lot of users, we'll have very hight CPU load and very high network traffic. One failure case could be one of my servers code gets stuck in some infinite loop, so the CPU load grows but the network traffic doesn't. To detect that type of anomaly, I might define another and/or . Both of them could help capture anomalies where one of your machines has a very high CPU load, but doesn't have a commensurately large network traffic.
- Play with data transformation so that it is more Gaussian
In anomaly detection, one of the things is to model the feature with . We often need to plot the histogram of the feature to make sure it actually looks vaguely Gaussian before feeding it to our anomaly detection algorithm.
For example, if the histogram of feature is a very asymmetric distribution which has a peak way off to one side, what often need to do is play with different transformations of the data in order to make it look like more Gaussian. We might take a log transformation of the data , and its histogram may look much more like the classic bell shaped curve that we can fit with some mean and variance parameter . Rather than just a log transform, some other things we can do. Let's say we have a different feature , maybe we can replace that with . Or more generally with . This constant could be something that we can play with to try to make it look as Gaussian as possible. For a different feature maybe we can replace it with . And this is another example of a parameter that we can play with. Or for another feature that can be replaced with .
In the video, one example is lively demoed based on data of feature which has dimension of 1000. Its histogram looks like figure-1 by default with 10 bins. Figure-2 is histogram with 50 bins for fine grid. These doesn't look like Gaussian. So, if taking the squared root of the data (), its histogram looks like figure-3. We can play with different parameter 0.2, 0.1, the corresponding histogram looks like figure-4, figure-5. If using 0.05, it looks like pretty Gaussian as figure-6. Then I can define and feed into my anomaly detection algorithm. Of course, there's other ways you can use like which is also very Gaussian.
One note is that the anomaly algorithm will usually work ok even if you don't perform such transformation, but if you use these tranformations to make the data more Gaussian, it might work better.
- Come up with features via error analysis procedure
We would train a complete algorithm. Run the algorithm on the cross validation set and look at the examples it gets wrong. And see if we can come up with extra features to help the algorithm do better on the examples that it got wrong in the cross validataion set. For example, above blue line is the Gaussian that fit to my feature . Let's say we have an anomalous example shown as green cross. It's buried in the middle of a bunch of normal examples. We're hoping that will be large for normal examples and be small for the anomalous examples. But the common problem would be if is comparable, maybe both are large, for both the normal and anomalous examples. And the algorithm fails to flag this example as anomalous. Then we would look at the training examples and look at what went wrong with that particular example (like aricraft engine), and see if it could inspire us to come up with a new feature that helps distinguish this bad example compared to the rest of my red examples.
If I managed to do so, and if re-plot my data, all my training set examples are those red crosses above. Hopefully, I find that the feature of the anomalous example takes on the very unusual value . If I model my data now, I find that my anomaly detection algorithm gives high probability to the data in the central regions, and lower probability to that green cross anomalous example.
<end>