异常检测-线性检测模型

github:链接

1线性模型的方法

  1. 线性回归
  2. 主成分分析

前提假设:

  • 近似线性相关
  • 子空间假设

大致意思就是,数据在生成过程中,由于是同一种内在模型产生的,因此特征之间会存在相关关系,通过建立关系模型,就可以找到不满足模型假设(相关关系)的异常数据

2数据可视化

为了判断数据集是否满足假设的模型,需要进行可视化分析

2.1导入数据集

  • missingno包的介绍和使用:链接
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# missingno 用于查看数据的缺失情况

Train_data = pd.read_csv('D:\code\Github\data/anomalyDetection/breast-cancer-unsupervised-ad.csv')

2.2观察数据统计特征

#统计量信息
Train_data.describe()
f0f1f2f3f4f5f6f7f8f9...f20f21f22f23f24f25f26f27f28f29
count367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000...367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000367.000000
mean12.25106017.93476878.842343472.8062670.0930720.0828320.0497100.0276010.1752060.063105...13.55304923.58386988.226540577.7904630.1259740.1915830.1761940.0780410.2734960.080501
std1.9516373.99425413.055722156.9647880.0139930.0386500.0492820.0197760.0255840.007118...2.3206205.53849115.995488216.3815990.0210360.1145970.1559370.0417980.0493900.016395
min6.9810009.71000043.790000143.5000000.0526300.0193800.0000000.0000000.1060000.051850...7.93000012.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1566000.055210
25%11.13500015.15000071.095000380.7000000.0833250.0562350.0205400.0151200.1585500.058540...12.12500019.58500078.610000452.9000000.1108000.1147500.0792450.0525950.2408000.070160
50%12.23000017.46000078.310000461.4000000.0913800.0760800.0380000.0237700.1720000.061550...13.45000022.91000087.240000550.6000000.1256000.1724000.1449000.0763200.2691000.077320
75%13.45500019.87500086.735000554.3000000.1012500.1014500.0636100.0337700.1902500.065940...14.91000026.65500097.455000679.2500000.1387000.2362000.2300500.0995150.3015000.086830
max20.57000033.810000135.1000001326.0000000.1634000.2839000.4108000.1471000.2743000.097440...25.38000041.780000184.6000002019.0000000.2098001.0580001.2520000.2654000.6638000.207500

8 rows × 30 columns

# 查看缺失和数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   f0      367 non-null    float64
 1   f1      367 non-null    float64
 2   f2      367 non-null    float64
 3   f3      367 non-null    float64
 4   f4      367 non-null    float64
 5   f5      367 non-null    float64
 6   f6      367 non-null    float64
 7   f7      367 non-null    float64
 8   f8      367 non-null    float64
 9   f9      367 non-null    float64
 10  f10     367 non-null    float64
 11  f11     367 non-null    float64
 12  f12     367 non-null    float64
 13  f13     367 non-null    float64
 14  f14     367 non-null    float64
 15  f15     367 non-null    float64
 16  f16     367 non-null    float64
 17  f17     367 non-null    float64
 18  f18     367 non-null    float64
 19  f19     367 non-null    float64
 20  f20     367 non-null    float64
 21  f21     367 non-null    float64
 22  f22     367 non-null    float64
 23  f23     367 non-null    float64
 24  f24     367 non-null    float64
 25  f25     367 non-null    float64
 26  f26     367 non-null    float64
 27  f27     367 non-null    float64
 28  f28     367 non-null    float64
 29  f29     367 non-null    float64
 30  label   367 non-null    object 
dtypes: float64(30), object(1)
memory usage: 89.0+ KB

2.3相关性分析

  • 为了避免正负相关性之间的混淆,对相关系数取了绝对值
  • 从下图可以看出,有的特征之间相关性较强(浅色);有的特征之间相关性较弱(深色)
fig = plt.figure(figsize=(14,14))
sns.heatmap(Train_data.iloc[:,:-1].corr().abs(),square=True)
<matplotlib.axes._subplots.AxesSubplot at 0x22c675dc550>

在这里插入图片描述

2.4绘制变量的概率分布图

# 宽表变成长表
f = pd.melt(Train_data,value_vars=Train_data.columns.values[:-1],var_name='features',value_name='value')
g = sns.FacetGrid(f,col = 'features',col_wrap=6,
sharex=False,sharey=False)
g  = g.map(sns.distplot,'value',rug = True)

在这里插入图片描述

2.5绘制变量两两之间的相关性

sns.pairplot(Train_data.iloc[:,:-1],size = 2,kind = 'scatter',diag_kind = 'kde')
<seaborn.axisgrid.PairGrid at 0x22c6b8c61f0>

在这里插入图片描述

2.6数据降维

from sklearn.manifold import TSNE
tsne = TSNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='pca', random_state=0)
result = tsne.fit_transform(Train_data.iloc[:,:-1])
x_min, x_max = np.min(result, 0), np.max(result, 0)
result = (result - x_min) / (x_max - x_min)
label = Train_data['label']
fig = plt.figure(figsize = (7, 7))
sns.scatterplot(x = result[:,0],y = result[:,1],hue=Train_data.iloc[:,-1])
<matplotlib.axes._subplots.AxesSubplot at 0x22c09f4a2e0>

在这里插入图片描述

3pyod包的pca函数

3/1PCA方法检测异常值的基本原理

  • PCA方法通过对原始向量进行特征值分解,可以将原始向量转化为多个特征值与特征向量乘积的形式
  • 对于正常样本,在降维后,其主要由特征值较大的特征值对应的特征向量组合而成
  • 对于异常样本,在降维后,其主要由特征值较小的特征值对应的特征向量组合而成
  • 因此就可以区分出异常样本

3.2pyod.pca函数

pyod.models.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)
  • 参数
    • n_components 保留的维度
    • n_selected_components 用于计算异常点的保留维度
    • svd_solver string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}

3.3pyod.pca示例

生成数据

from pyod.utils.data import generate_data,get_outliers_inliers

# 生成二维随机数据
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=5)

# 拆分出异常数据和正常数据
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)

# 绘制生成的数据图
df_train = pd.DataFrame(X_train)
df_train['y'] = Y_train
sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');

在这里插入图片描述

训练模型

from pyod.models.pca import PCA

outlier_fraction = 0.1#异常值比例
pca = PCA(n_components = 2,contamination=outlier_fraction)#生成模型
pca.fit(X_train)#训练模型
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=2,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

训练结果

y_pred = pca.predict(X_train)# 预测训练样本的标签
from sklearn.metrics import classification_report
print(classification_report(y_true=Y_train,y_pred=y_pred))
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       180
         1.0       1.00      1.00      1.00        20

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200
y_train_pred = pca.labels_
y_train_scores = pca.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');

在这里插入图片描述

4 对breast-cancer数据集进行异常分析

pca = PCA(n_components=20)#生成模型
pca.fit(Train_data.iloc[:,:-1])
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=20,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)
y_true = Train_data.label.map({'n':1,'o':0})
print(classification_report(y_true=y_true,y_pred=pca.labels_))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.73      0.08      0.14       357

    accuracy                           0.07       367
   macro avg       0.36      0.04      0.07       367
weighted avg       0.71      0.07      0.13       367
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值