异常检测-线性检测模型

最新推荐文章于 2023-12-08 20:27:18 发布

莫知我哀

最新推荐文章于 2023-12-08 20:27:18 发布

阅读量482

点赞数

分类专栏： AnomalyDetection 文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_43822124/article/details/112757219

版权

AnomalyDetection 专栏收录该内容

6 篇文章 3 订阅

订阅专栏

github:链接

这里写目录标题

1线性模型的方法
2数据可视化
3pyod包的pca函数
4 对breast-cancer数据集进行异常分析

1线性模型的方法

线性回归
主成分分析

前提假设:

近似线性相关
子空间假设

大致意思就是，数据在生成过程中，由于是同一种内在模型产生的，因此特征之间会存在相关关系，通过建立关系模型，就可以找到不满足模型假设（相关关系）的异常数据

2数据可视化

为了判断数据集是否满足假设的模型，需要进行可视化分析

2.1导入数据集

missingno包的介绍和使用：链接

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# missingno 用于查看数据的缺失情况

Train_data = pd.read_csv('D:\code\Github\data/anomalyDetection/breast-cancer-unsupervised-ad.csv')

2.2观察数据统计特征

#统计量信息
Train_data.describe()

	f0	f1	f2	f3	f4	f5	f6	f7	f8	f9	...	f20	f21	f22	f23	f24	f25	f26	f27	f28	f29
count	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	...	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000
mean	12.251060	17.934768	78.842343	472.806267	0.093072	0.082832	0.049710	0.027601	0.175206	0.063105	...	13.553049	23.583869	88.226540	577.790463	0.125974	0.191583	0.176194	0.078041	0.273496	0.080501
std	1.951637	3.994254	13.055722	156.964788	0.013993	0.038650	0.049282	0.019776	0.025584	0.007118	...	2.320620	5.538491	15.995488	216.381599	0.021036	0.114597	0.155937	0.041798	0.049390	0.016395
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.051850	...	7.930000	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156600	0.055210
25%	11.135000	15.150000	71.095000	380.700000	0.083325	0.056235	0.020540	0.015120	0.158550	0.058540	...	12.125000	19.585000	78.610000	452.900000	0.110800	0.114750	0.079245	0.052595	0.240800	0.070160
50%	12.230000	17.460000	78.310000	461.400000	0.091380	0.076080	0.038000	0.023770	0.172000	0.061550	...	13.450000	22.910000	87.240000	550.600000	0.125600	0.172400	0.144900	0.076320	0.269100	0.077320
75%	13.455000	19.875000	86.735000	554.300000	0.101250	0.101450	0.063610	0.033770	0.190250	0.065940	...	14.910000	26.655000	97.455000	679.250000	0.138700	0.236200	0.230050	0.099515	0.301500	0.086830
max	20.570000	33.810000	135.100000	1326.000000	0.163400	0.283900	0.410800	0.147100	0.274300	0.097440	...	25.380000	41.780000	184.600000	2019.000000	0.209800	1.058000	1.252000	0.265400	0.663800	0.207500

8 rows × 30 columns

# 查看缺失和数据类型
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   f0      367 non-null    float64
 1   f1      367 non-null    float64
 2   f2      367 non-null    float64
 3   f3      367 non-null    float64
 4   f4      367 non-null    float64
 5   f5      367 non-null    float64
 6   f6      367 non-null    float64
 7   f7      367 non-null    float64
 8   f8      367 non-null    float64
 9   f9      367 non-null    float64
 10  f10     367 non-null    float64
 11  f11     367 non-null    float64
 12  f12     367 non-null    float64
 13  f13     367 non-null    float64
 14  f14     367 non-null    float64
 15  f15     367 non-null    float64
 16  f16     367 non-null    float64
 17  f17     367 non-null    float64
 18  f18     367 non-null    float64
 19  f19     367 non-null    float64
 20  f20     367 non-null    float64
 21  f21     367 non-null    float64
 22  f22     367 non-null    float64
 23  f23     367 non-null    float64
 24  f24     367 non-null    float64
 25  f25     367 non-null    float64
 26  f26     367 non-null    float64
 27  f27     367 non-null    float64
 28  f28     367 non-null    float64
 29  f29     367 non-null    float64
 30  label   367 non-null    object 
dtypes: float64(30), object(1)
memory usage: 89.0+ KB

2.3相关性分析

为了避免正负相关性之间的混淆，对相关系数取了绝对值
从下图可以看出，有的特征之间相关性较强（浅色）;有的特征之间相关性较弱（深色）

fig = plt.figure(figsize=(14,14))
sns.heatmap(Train_data.iloc[:,:-1].corr().abs(),square=True)

<matplotlib.axes._subplots.AxesSubplot at 0x22c675dc550>

在这里插入图片描述

2.4绘制变量的概率分布图

# 宽表变成长表
f = pd.melt(Train_data,value_vars=Train_data.columns.values[:-1],var_name='features',value_name='value')
g = sns.FacetGrid(f,col = 'features',col_wrap=6,
sharex=False,sharey=False)
g  = g.map(sns.distplot,'value',rug = True)

在这里插入图片描述

2.5绘制变量两两之间的相关性

sns.pairplot(Train_data.iloc[:,:-1],size = 2,kind = 'scatter',diag_kind = 'kde')

<seaborn.axisgrid.PairGrid at 0x22c6b8c61f0>

在这里插入图片描述

2.6数据降维

from sklearn.manifold import TSNE
tsne = TSNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='pca', random_state=0)
result = tsne.fit_transform(Train_data.iloc[:,:-1])
x_min, x_max = np.min(result, 0), np.max(result, 0)
result = (result - x_min) / (x_max - x_min)
label = Train_data['label']
fig = plt.figure(figsize = (7, 7))
sns.scatterplot(x = result[:,0],y = result[:,1],hue=Train_data.iloc[:,-1])

<matplotlib.axes._subplots.AxesSubplot at 0x22c09f4a2e0>

在这里插入图片描述

3pyod包的pca函数

3/1PCA方法检测异常值的基本原理

PCA方法通过对原始向量进行特征值分解，可以将原始向量转化为多个特征值与特征向量乘积的形式
对于正常样本，在降维后，其主要由特征值较大的特征值对应的特征向量组合而成
对于异常样本，在降维后，其主要由特征值较小的特征值对应的特征向量组合而成
因此就可以区分出异常样本

3.2pyod.pca函数

pyod.models.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)

参数
- n_components 保留的维度
- n_selected_components 用于计算异常点的保留维度
- svd_solver string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}

3.3pyod.pca示例

生成数据

from pyod.utils.data import generate_data,get_outliers_inliers

# 生成二维随机数据
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=5)

# 拆分出异常数据和正常数据
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)

# 绘制生成的数据图
df_train = pd.DataFrame(X_train)
df_train['y'] = Y_train
sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');

在这里插入图片描述

训练模型

from pyod.models.pca import PCA

outlier_fraction = 0.1#异常值比例
pca = PCA(n_components = 2,contamination=outlier_fraction)#生成模型
pca.fit(X_train)#训练模型

PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=2,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

训练结果

y_pred = pca.predict(X_train)# 预测训练样本的标签
from sklearn.metrics import classification_report
print(classification_report(y_true=Y_train,y_pred=y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       180
         1.0       1.00      1.00      1.00        20

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

y_train_pred = pca.labels_
y_train_scores = pca.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');

在这里插入图片描述

4 对breast-cancer数据集进行异常分析

pca = PCA(n_components=20)#生成模型
pca.fit(Train_data.iloc[:,:-1])

PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=20,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

y_true = Train_data.label.map({'n':1,'o':0})
print(classification_report(y_true=y_true,y_pred=pca.labels_))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.73      0.08      0.14       357

    accuracy                           0.07       367
   macro avg       0.36      0.04      0.07       367
weighted avg       0.71      0.07      0.13       367