吴恩达机器学习 EX8 第一部分异常检测

最新推荐文章于 2023-09-01 10:24:47 发布

lsnow8624

最新推荐文章于 2023-09-01 10:24:47 发布

阅读量844

点赞数 2

分类专栏：吴恩达机器学习作业

本文链接：https://blog.csdn.net/lsnow8624/article/details/89496080

版权

吴恩达机器学习作业专栏收录该内容

14 篇文章 9 订阅

订阅专栏

1、异常检测

异常检测属于无监督算法，根据阀值计算数据是否异常，p(x)为高斯分布公式
在这里插入图片描述

1.1 作业介绍

在本练习中，将实现一个异常检测算法来检测服务器计算机中的异常行为。这些特性度量每个服务器的吞吐量(mb/s)和响应延迟(ms)。在您的服务器运行时，您收集了m = 307个关于它们行为的示例，因此有一个未标记的dataset {x(1)，…，x(m)}。您怀疑这些示例中的绝大多数都是服务器正常运行的“正常”(非异常)示例，但是也可能有一些服务器在这个数据集中异常运行的示例。

将使用高斯模型来检测数据集中的异常示例。您将首先从一个2D数据集开始，该数据集将允许您可视化算法在做什么。在该数据集上，将拟合高斯分布，然后找到概率非常低的值，因此可以考虑异常。之后，您将把异常检测算法应用于具有多个维度的较大数据集

1.2 导入模块和数据

导入模块

import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio

import estimateGaussian as eg # 计算高斯分布参数(u sigma)函数
import multivariateGaussian as mvg # 多变量高斯分布函数
import visualizeFit as vf
import selectThreshold as st # 选择阀值

import imp # 重新加载模块
imp.reload(st)

plt.ion()
# np.set_printoptions(formatter={'float': '{: 0.6f}'.format})

加载数据包括：训练集和交叉验证集

# ===================== Part 1: Load Example Dataset =====================
data = scio.loadmat('ex8data1.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval'].flatten()

训练集共307条，两个属性

X.shape

(307, 2)

交叉验证集，交叉验证集标记哪些记录属于异常数据

print('Xval.shape: ', Xval.shape, '\nyval.shape: ', yval.shape)

Xval.shape:  (307, 2) 
yval.shape:  (307,)

绘制训练集散点图，大部分数据比较集中；一小部分数据比较发散，可能是异常数据

# Visualize the example dataset
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', s=15, linewidth=1)
plt.axis([0, 28, 0, 28])
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

Text(0,0.5,'Throughput (mb/s')

在这里插入图片描述

1.2 计算高斯函数的参数(estimateGaussian.py)

平均值公式：
在这里插入图片描述
方差公式：

import numpy as np


def estimate_gaussian(X):
    # Useful variables
    m, n = X.shape

    # You should return these values correctly
    mu = np.zeros(n)
    sigma2 = np.zeros(n)

    # ==========================================================
    mu = np.mean(X, 0) # 计算平均值
    sigma2 = np.sum(np.power((X - mu), 2), 0) / m #计算方差
    # sigma2 = np.var(X, axis=0) # 两个方法结果相同
    return mu, sigma2

调用函数计算平均值和方差

# Estimate mu and sigma2
mu, sigma2 = eg.estimate_gaussian(X)

平均值

mu

array([14.11222578, 14.99771051])

方差

print(sigma2)
np.var(X, axis=0)

[1.83263141 1.70974533]
array([1.83263141, 1.70974533])

1.3 高斯分布函数(multivariateGaussian.py)

高斯分布公式：
在这里插入图片描述

import numpy as np
def multivariate_gaussian(X, mu, sigma2):
    k = mu.size
    if sigma2.ndim == 1 or (sigma2.ndim == 2 and (sigma2.shape[1] == 1 or sigma2.shape[0] == 1)):
        sigma2 = np.diag(sigma2)

    x = X - mu
    p = (2 * np.pi) ** (-k / 2) * np.linalg.det(sigma2) ** (-0.5) * np.exp(-0.5*np.sum(np.dot(x, np.linalg.pinv(sigma2)) * x, axis=1))

    return p

1.4 绘制等高线函数(visualizeFit.py)

import matplotlib.pyplot as plt
import numpy as np
import multivariateGaussian as mvg

def visualize_fit(X, mu, sigma2):
    grid = np.arange(0, 35.5, 0.5)
    x1, x2 = np.meshgrid(grid, grid)

    Z = mvg.multivariate_gaussian(np.c_[x1.flatten('F'), x2.flatten('F')], mu, sigma2)
    Z = Z.reshape(x1.shape, order='F')

    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], marker='x', c='b', s=15, linewidth=1)
    # Do not plot if there are infinities
    if np.sum(np.isinf(X)) == 0:
        lvls = 10 ** np.arange(-20, 0, 3).astype(np.float)
        plt.contour(x1, x2, Z, levels=lvls, colors='r', linewidths=0.7)

调用函数，计算高斯分布结果，并绘图

# Returns the density of the multivariate normal at each data point(row) of X
p = mvg.multivariate_gaussian(X, mu, sigma2)

# Visualize the fit
vf.visualize_fit(X, mu, sigma2)
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

Text(0,0.5,'Throughput (mb/s')

在这里插入图片描述

p.shape

(307,)

1.5 调试阀值以便预测

调用高斯分布函数，预测交叉验证集结果

# ===================== Part 3: Find Outliers =====================
pval = mvg.multivariate_gaussian(Xval, mu, sigma2)
pval.shape

(307,)

1.5.1 选择阀值函数(selectThreshold.py)

使用查准率和查全率计算F1分数，F1分数公式如下：
在这里插入图片描述
查准率和查全率公式：

在这里插入图片描述

import numpy as np

def select_threshold(yval, pval):
    f1 = 0

    # You have to return these values correctly
    best_eps = 0
    best_f1 = 0

    for epsilon in np.linspace(np.min(pval), np.max(pval), num=1001):
        # ===================== Your Code Here =====================
        predictions = np.where(pval<epsilon, 1, 0) # 小于epsilon则预测为异常
        tp = np.sum(yval[np.where(predictions==1)]) # 预测为真，实际为真
        fp = np.sum(yval[np.where(predictions!=1)]) # 预测为真，实际为假
        fn = np.sum(np.where(yval[np.where(predictions==1)]==0, 1, 0)) # 预测为假，实际为真

        prec = tp / ( tp + fp) # 查准率
        rec = tp / ( tp + fn ) # 查全率
        
        f1 = 2 * prec * rec / ( prec + rec)  # F1分数
        # ==========================================================
        if f1 > best_f1:# 取最好的F1分数及对应的epsilon
            best_f1 = f1
            best_eps = epsilon

    return best_eps, best_f1

调用函数计算F1分数和epsilon

epsilon, f1 = st.select_threshold(yval, pval)
print('Best epsilon found using cross-validation: {:0.4e}'.format(epsilon))
print('Best F1 on Cross Validation Set: {:0.6f}'.format(f1))
print('(you should see a value epsilon of about 8.99e-05 and F1 of about 0.875)')

Best epsilon found using cross-validation: 8.9909e-05
Best F1 on Cross Validation Set: 0.875000
(you should see a value epsilon of about 8.99e-05 and F1 of about 0.875)

1.5.2 绘制等高线图及异常数据点

绘制等高线图形

vf.visualize_fit(X, mu, sigma2)
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')
绘制小于epsilon，即数据异常点
# Find outliers in the training set and plot
outliers = np.where(p < epsilon)
plt.scatter(X[outliers, 0], X[outliers, 1], marker='o', facecolors='none', edgecolors='r')

<matplotlib.collections.PathCollection at 0x8136ba8>

在这里插入图片描述

1.6 多维数据示例

加载数据

# ===================== Part 4: Multidimensional Outliers =====================
# Loads the second dataset.
data = scio.loadmat('ex8data2.mat')
X = data['X'] # 训练集
Xval = data['Xval'] # 交叉验证集
yval = data['yval'].flatten() # 交叉验证集标签数据

训练集维度： 1000 * 11

X.shape

(1000, 11)

交叉验证集维度： 100 * 11

Xval.shape

(100, 11)

交叉验证集标签向量

yval.shape

(100,)

1.6.1 计算训练集平均值和方差

调用函数计算训练集平均值和方差

# Apply the same steps to the larger dataset
mu, sigma2 = eg.estimate_gaussian(X)

平均值：

mu

array([  4.93940034,  -9.63726819,  13.81470749, -10.4644888 ,
        -7.95622922,  10.19950372,  -6.01940755,   7.96982896,
        -6.2531819 ,   2.32451289,   8.47372252])

方差

sigma2

array([60.97489373, 53.20572186, 58.51546272, 84.20403725, 65.26859177,
       89.57487757, 55.63349911, 87.16233783, 29.62926829, 70.7852052 ,
       50.50356719])

1.6.2 调用高斯分布函数预测多维数据

训练集调用高斯分布函数需预测多维数据

# Training set
p = mvg.multivariate_gaussian(X, mu, sigma2)

预测结果维度

p.shape

(1000,)

交叉验证集调用高斯分布函数预测多维数据

# Cross Validation set
pval = mvg.multivariate_gaussian(Xval, mu, sigma2)

交叉验证集预测结果维度

pval.shape

(100,)

1.6.3 调用函数计算最佳F1分数和epsilon

调用函数计算交叉验证集的最佳F1分数和epsilon

# Find the best threshold
epsilon, f1 = st.select_threshold(yval, pval)

print('Best epsilon found using cross-validation: {:0.4e}'.format(epsilon)) # 交叉验证集最佳epsilon， OK
print('Best F1 on Cross Validation Set: {:0.6f}'.format(f1)) # 交叉验证集最佳F1分数， OK
print('# Outliers found: {}'.format(np.sum(np.less(p, epsilon)))) # 训练集异常点， OK 
print('(you should see a value epsilon of about 1.38e-18, F1 of about 0.615, and 117 outliers)')

Best epsilon found using cross-validation: 1.3772e-18
Best F1 on Cross Validation Set: 0.615385
# Outliers found: 117
(you should see a value epsilon of about 1.38e-18, F1 of about 0.615, and 117 outliers)