【吴恩达】机器学习第16章异常检测以及ex8部分编程练习

最新推荐文章于 2023-05-11 18:26:44 发布

D.Guan

最新推荐文章于 2023-05-11 18:26:44 发布

阅读量613

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/BRAVE_NO1/article/details/82937988

版权

机器学习专栏收录该内容

39 篇文章 0 订阅

订阅专栏

1.异常检测

1.1思路简述

首先是一些没有标签的数据进行p(x)建模，就是拟合数据得到一个符合数据规律的p(x)，然后根据一个特定的阈值来判断，是否异常。

1.2具体步骤（假设p(x)符合高斯）

在octave中，我们可以使用hist可视化直方图来看数据是否是高斯分布。如果不是，可以使用log(x)变换得到近似高斯分布的数据。

step1:首先是选取特征。选取一些不大不小的重要特征，也可以用创建一些新的特征去捕捉异常。

step2：计算每种特征对应的均值 $u_{j}$ 、以及方差 $\sigma^2_{j}$ .

step3：根据高斯分布p(x)的表达式，计算概率值。

step4:与阈值 $\varepsilon$ 比较，看是否发生异常，小于，则异常。

1.3算法评估

step1:我们将数据集分成三部分：假设我们数据集中有少量的异常样本，大量正常样本。

我们将正常样本按照6:2:2分给训练集、交叉验证集、测试集。然后将异常样本1：1分给交叉验证集、测试集。

step2：用训练集的样本来拟合p(x) 也就是选定用什么概率函数来拟合数据分布。

step3:在交叉验证集上确定阈值 $\varepsilon$ ：比如我们选取p(x)上最大的概率，以及最小的概率。选取合适的步长，进行迭代。从最小的概率开始，计算F1分数，记录最大的F1以及其对应的阈值。（具体见编程作业）。这里选取F1分数的原因是因为有很多正常样本少数异常样本，存在数据倾斜，所以不适合正确比率作为指标。

2.多元高斯的异常检测

步骤与原来的模型基本一致。

考虑了两个特征联合的影响，有时候一个在原模型中两个特征在自己的区域都是正常的，但由于之间存在关联，所以联合来看，他们是异常的，这种情况在原模型中是看不出来的。比如上图的X.

二者对比来看，原模型需要手动创建新特征，而新模型不用。原模型的计算成本比较低，而新模型要算协方差的逆，计算成本高。原模型可以在小样本的情况下使用，但是新模型不适合在小样本的时候使用，样本数m>>特征数n的时候，使用新模型比较合适。当m<n时，协方差矩阵就会变成不可逆的。当存在冗余特征时，也会不可逆。从几何图形来看，原模型是关于特征对称的，而新模型则不是。协方差 $\sum$ 非对角线元素的改变会改变图像的方向。

notes:监督学习与异常检测。监督学习适合大量样本，（大量正样本，大量负样本）。异常检测适合少量正样本，大量负样本。并且负样本种类多种多样。p(x)越小，越可能异常。

3.编程作业

function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a 
%Gaussian distribution using the data in X
%   [mu sigma2] = estimateGaussian(X), 
%   The input X is the dataset with each n-dimensional data point in one row
%   The output is an n-dimensional vector mu, the mean of the data set
%   and the variances sigma^2, an n x 1 vector
% 

% Useful variables
[m, n] = size(X);

% You should return these values correctly
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the mean of the data and the variances
%               In particular, mu(i) should contain the mean of
%               the data for the i-th feature and sigma2(i)
%               should contain variance of the i-th feature.
%
mu=(1/m)*sum(X);%i-th表示的是第i个特征  求解每个特征的均值
sigma2=(1/m)*sum(((X-mu).^2));%求每个特征的方差

% =============================================================

end

function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
%   threshold to use for selecting outliers based on the results from a
%   validation set (pval) and the ground truth (yval).
%

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
    
    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the F1 score of choosing epsilon as the
    %               threshold and place the value in F1. The code at the
    %               end of the loop will compare the F1 score for this
    %               choice of epsilon and set it to be the best epsilon if
    %               it is better than the current choice of epsilon.
    %               
    % Note: You can use predictions = (pval < epsilon) to get a binary vector
    %       of 0's and 1's of the outlier predictions
	tp=0;
	fp=0;
	fn=0;
    predictions=(pval<epsilon);
	for i=1:size(predictions),
	    if  predictions(i)==1 & yval(i)==1,
		     tp=tp+1;
		end
		if  predictions(i)==1 & yval(i)==0,
		     fp=fp+1;
		end
		if  predictions(i)==0 & yval(i)==1,
		     fn=fn+1;
		end
	end
	prec=tp/(tp+fp);
	rec=tp/(tp+fn);
	F1=2*prec*rec/(prec+rec);

    % =============================================================

    if F1 > bestF1,
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

end

D.Guan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【吴恩达】机器学习第16章异常检测以及ex8部分编程练习

1.异常检测1.1思路简述首先是一些没有标签的数据进行p(x)建模，就是拟合数据得到一个符合数据规律的p(x)，然后根据一个特定的阈值来判断，是否异常。1.2具体步骤（假设p(x)符合高斯）在octave中，我们可以使用hist可视化直方图来看数据是否是高斯分布。如果不是，可以使用log(x)变换得到近似高斯分布的数据。step1:首先是选取特征。选取一些不大不小的重要特征，也...
复制链接

扫一扫