sklearn文档：真实数据集中的异常值检测（MCD）学习笔记（1）

对方0222

已于 2022-07-07 14:55:19 修改

阅读量801

点赞数

文章标签： sklearn python

于 2022-07-07 14:48:20 首次发布

本文链接：https://blog.csdn.net/weixin_61769687/article/details/125653450

版权

第一个例子：

import numpy as np  # 加载Numpy库
from sklearn.covariance import EllipticEnvelope  # 可实现MCD的类[1]
from sklearn.svm import OneClassSVM  # 可实现OCSVM的类[2]
import matplotlib.pyplot as plt  # 提供类似MATLAB的绘图框架[3]
import matplotlib.font_manager  # 用于跨平台查找、管理和使用字体的模块[4]
from sklearn.datasets import load_wine  # 加载sklearn自带红酒数据集(wine)[5]

classifiers = {
    "Empirical Covariance": 
     EllipticEnvelope(support_fraction=1.0, contamination=0.25),  # 马氏距离
    "Robust Covariance (Minimum Covariance Determinant)": 
     EllipticEnvelope(contamination=0.25),  # 鲁棒马氏距离
    "OCSVM": OneClassSVM(nu=0.25, gamma=0.35),  # OCSVM
}  # 字典
colors = ["m", "g", "b"]  # 列表
legend1 = {}  # 先生成一个空的字典
legend2 = {}

X1 = load_wine()["data"][:, [1, 2]]  # 样本总数n=178，取两个维度p=2
xx1, yy1 = np.meshgrid(np.linspace(0, 6, 500), np.linspace(1, 4.5, 500))
# np.meshgrid代表的是将xx1中每一个数据和yy1中每一个数据组合生成很多点,然后将这些点的x坐标放入到xx1中,y坐标放入yy1中,并且相应位置是对应的[6]
# np.linspace用来创建等差数列[7]

for i, (clf_name, clf) in enumerate(classifiers.items()):
'''
enumerate就是枚举的意思，把元素一个个列举出来，第一个是索引值，第二个是对应的元素
items()函数以列表返回可遍历的(键, 值) 元组数组
classifiers.items() => dict_items([('Empirical Covariance', EllipticEnvelope(contamination=0.25, support_fraction=1.0)),
                                    ('Robust Covariance (Minimum Covariance Determinant)', EllipticEnvelope(contamination=0.25)),
                                    ('OCSVM', OneClassSVM(gamma=0.35, nu=0.25))])
'''
    plt.figure(1)  # 创建自定义图像1,把所有图片放在一张图里[8]
    clf.fit(X1)  # 拟合椭圆模型
    Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
'''
ravel()将矩阵展平
np.c_[a,b,c...]可以拼接多个数组，要求待拼接的多个数组的行数必须相同
decision_function()计算样本点到分割超平面的函数距离
'''
    Z1 = Z1.reshape(xx1.shape)
    legend1[clf_name] = plt.contour(xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])  # plt.contour() 绘制轮廓线(等高线)[9]
'''
print(legend1) => {'Empirical Covariance': <matplotlib.contour.QuadContourSet object at 0x000001929240DFD0>, 
                   'Robust Covariance (Minimum Covariance Determinant)': <matplotlib.contour.QuadContourSet object at 0x0000019291DFCA90>, 
                   'OCSVM': <matplotlib.contour.QuadContourSet object at 0x0000019291DFCD00>}
'''

legend1_values_list = list(legend1.values())
'''
print(legend1_values_list) => [<matplotlib.contour.QuadContourSet object at 0x000002B09073EFD0>, 
                                  <matplotlib.contour.QuadContourSet object at 0x000002B09074DA90>, 
                                  <matplotlib.contour.QuadContourSet object at 0x000002B09074DD00>]
'''
legend1_keys_list = list(legend1.keys())
'''
print(legend1_keys_list) => ['Empirical Covariance', 
                                'Robust Covariance (Minimum Covariance Determinant)', 
                                'OCSVM']
'''

plt.figure(1)[8]
plt.title("Outlier detection on a real data set (wine recognition)")
plt.scatter(X1[:, 0], X1[:, 1], color="black")
bbox_args = dict(boxstyle="round", fc="0.8")  # dict()函数用于创造一个字典
# boxstyle:方形外框 fc:背景颜色
arrow_args = dict(arrowstyle="->")  # '->' head_length=0.4,head_width=0.2
plt.annotate(  # plt.annotate函数用于标注文字[10]
    "outlying points",  # 为注释文本内容
    xy=(4, 2),  # 为被注释的坐标点
    xycoords="data",  # 'data':使用被注释对象的坐标系统(默认)
    textcoords="data",  # 'data':使用被注释对象的坐标系统(默认) 同xycoords
    xytext=(3, 1.25),  # 为被注释文字的坐标位置
    bbox=bbox_args,
    arrowprops=arrow_args,
)

plt.xlim((xx1.min(), xx1.max()))  # 获取或设置当前x轴的最大最小值
plt.ylim((yy1.min(), yy1.max()))
plt.legend(  # 添加图例[11]
    (
        legend1_values_list[0].collections[0],
        legend1_values_list[1].collections[0],
        legend1_values_list[2].collections[0],
    ),
    (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
    loc="upper center",  # 位置在正上方
    prop=matplotlib.font_manager.FontProperties(size=11),# prop：字体属性 FontProperties类用于存储和操作字体的属性[12]
)
'''
print(legend1_values_list[0]) => <matplotlib.contour.QuadContourSet object at 0x000001758F0CEFD0> 
print(legend1_values_list[0].collections[0]) => <matplotlib.collections.LineCollection object at 0x000001758F0DC310>
'''
plt.ylabel("ash")
plt.xlabel("malic_acid")
plt.show()

[1] sklearn.covariance.EllipticEnvelope

文档地址：sklearn.covariance.EllipticEnvelope — scikit-learn 1.1.1 documentation

[2] sklearn.svm.OneClassSVM

文档地址：sklearn.svm.OneClassSVM — scikit-learn 1.1.1 documentation

[3] matplotlib.pyplot

文档地址：pyplot — Matplotlib 2.0.2 文档

[4] matplotlib.font_manager

文档地址：font_manager — Matplotlib 2.0.2 文档

[5] from sklearn.datasets import load_wine：