github:链接
这里写目录标题
1线性模型的方法
- 线性回归
- 主成分分析
前提假设:
- 近似线性相关
- 子空间假设
大致意思就是,数据在生成过程中,由于是同一种内在模型产生的,因此特征之间会存在相关关系,通过建立关系模型,就可以找到不满足模型假设(相关关系)的异常数据
2数据可视化
为了判断数据集是否满足假设的模型,需要进行可视化分析
2.1导入数据集
- missingno包的介绍和使用:链接
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# missingno 用于查看数据的缺失情况
Train_data = pd.read_csv('D:\code\Github\data/anomalyDetection/breast-cancer-unsupervised-ad.csv')
2.2观察数据统计特征
#统计量信息
Train_data.describe()
f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | ... | f20 | f21 | f22 | f23 | f24 | f25 | f26 | f27 | f28 | f29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | ... | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 |
mean | 12.251060 | 17.934768 | 78.842343 | 472.806267 | 0.093072 | 0.082832 | 0.049710 | 0.027601 | 0.175206 | 0.063105 | ... | 13.553049 | 23.583869 | 88.226540 | 577.790463 | 0.125974 | 0.191583 | 0.176194 | 0.078041 | 0.273496 | 0.080501 |
std | 1.951637 | 3.994254 | 13.055722 | 156.964788 | 0.013993 | 0.038650 | 0.049282 | 0.019776 | 0.025584 | 0.007118 | ... | 2.320620 | 5.538491 | 15.995488 | 216.381599 | 0.021036 | 0.114597 | 0.155937 | 0.041798 | 0.049390 | 0.016395 |
min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.051850 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156600 | 0.055210 |
25% | 11.135000 | 15.150000 | 71.095000 | 380.700000 | 0.083325 | 0.056235 | 0.020540 | 0.015120 | 0.158550 | 0.058540 | ... | 12.125000 | 19.585000 | 78.610000 | 452.900000 | 0.110800 | 0.114750 | 0.079245 | 0.052595 | 0.240800 | 0.070160 |
50% | 12.230000 | 17.460000 | 78.310000 | 461.400000 | 0.091380 | 0.076080 | 0.038000 | 0.023770 | 0.172000 | 0.061550 | ... | 13.450000 | 22.910000 | 87.240000 | 550.600000 | 0.125600 | 0.172400 | 0.144900 | 0.076320 | 0.269100 | 0.077320 |
75% | 13.455000 | 19.875000 | 86.735000 | 554.300000 | 0.101250 | 0.101450 | 0.063610 | 0.033770 | 0.190250 | 0.065940 | ... | 14.910000 | 26.655000 | 97.455000 | 679.250000 | 0.138700 | 0.236200 | 0.230050 | 0.099515 | 0.301500 | 0.086830 |
max | 20.570000 | 33.810000 | 135.100000 | 1326.000000 | 0.163400 | 0.283900 | 0.410800 | 0.147100 | 0.274300 | 0.097440 | ... | 25.380000 | 41.780000 | 184.600000 | 2019.000000 | 0.209800 | 1.058000 | 1.252000 | 0.265400 | 0.663800 | 0.207500 |
8 rows × 30 columns
# 查看缺失和数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f0 367 non-null float64
1 f1 367 non-null float64
2 f2 367 non-null float64
3 f3 367 non-null float64
4 f4 367 non-null float64
5 f5 367 non-null float64
6 f6 367 non-null float64
7 f7 367 non-null float64
8 f8 367 non-null float64
9 f9 367 non-null float64
10 f10 367 non-null float64
11 f11 367 non-null float64
12 f12 367 non-null float64
13 f13 367 non-null float64
14 f14 367 non-null float64
15 f15 367 non-null float64
16 f16 367 non-null float64
17 f17 367 non-null float64
18 f18 367 non-null float64
19 f19 367 non-null float64
20 f20 367 non-null float64
21 f21 367 non-null float64
22 f22 367 non-null float64
23 f23 367 non-null float64
24 f24 367 non-null float64
25 f25 367 non-null float64
26 f26 367 non-null float64
27 f27 367 non-null float64
28 f28 367 non-null float64
29 f29 367 non-null float64
30 label 367 non-null object
dtypes: float64(30), object(1)
memory usage: 89.0+ KB
2.3相关性分析
- 为了避免正负相关性之间的混淆,对相关系数取了绝对值
- 从下图可以看出,有的特征之间相关性较强(浅色);有的特征之间相关性较弱(深色)
fig = plt.figure(figsize=(14,14))
sns.heatmap(Train_data.iloc[:,:-1].corr().abs(),square=True)
<matplotlib.axes._subplots.AxesSubplot at 0x22c675dc550>
2.4绘制变量的概率分布图
# 宽表变成长表
f = pd.melt(Train_data,value_vars=Train_data.columns.values[:-1],var_name='features',value_name='value')
g = sns.FacetGrid(f,col = 'features',col_wrap=6,
sharex=False,sharey=False)
g = g.map(sns.distplot,'value',rug = True)
2.5绘制变量两两之间的相关性
sns.pairplot(Train_data.iloc[:,:-1],size = 2,kind = 'scatter',diag_kind = 'kde')
<seaborn.axisgrid.PairGrid at 0x22c6b8c61f0>
2.6数据降维
from sklearn.manifold import TSNE
tsne = TSNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='pca', random_state=0)
result = tsne.fit_transform(Train_data.iloc[:,:-1])
x_min, x_max = np.min(result, 0), np.max(result, 0)
result = (result - x_min) / (x_max - x_min)
label = Train_data['label']
fig = plt.figure(figsize = (7, 7))
sns.scatterplot(x = result[:,0],y = result[:,1],hue=Train_data.iloc[:,-1])
<matplotlib.axes._subplots.AxesSubplot at 0x22c09f4a2e0>
3pyod包的pca函数
3/1PCA方法检测异常值的基本原理
- PCA方法通过对原始向量进行特征值分解,可以将原始向量转化为多个特征值与特征向量乘积的形式
- 对于正常样本,在降维后,其主要由特征值较大的特征值对应的特征向量组合而成
- 对于异常样本,在降维后,其主要由特征值较小的特征值对应的特征向量组合而成
- 因此就可以区分出异常样本
3.2pyod.pca函数
pyod.models.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)
- 参数
- n_components 保留的维度
- n_selected_components 用于计算异常点的保留维度
- svd_solver string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
3.3pyod.pca示例
生成数据
from pyod.utils.data import generate_data,get_outliers_inliers
# 生成二维随机数据
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=5)
# 拆分出异常数据和正常数据
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)
# 绘制生成的数据图
df_train = pd.DataFrame(X_train)
df_train['y'] = Y_train
sns.scatterplot(x=0, y=1, hue='y', data=df_train);
plt.title('Ground Truth');
训练模型
from pyod.models.pca import PCA
outlier_fraction = 0.1#异常值比例
pca = PCA(n_components = 2,contamination=outlier_fraction)#生成模型
pca.fit(X_train)#训练模型
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=2,
n_selected_components=None, random_state=None, standardization=True,
svd_solver='auto', tol=0.0, weighted=True, whiten=False)
训练结果
y_pred = pca.predict(X_train)# 预测训练样本的标签
from sklearn.metrics import classification_report
print(classification_report(y_true=Y_train,y_pred=y_pred))
precision recall f1-score support
0.0 1.00 1.00 1.00 180
1.0 1.00 1.00 1.00 20
accuracy 1.00 200
macro avg 1.00 1.00 1.00 200
weighted avg 1.00 1.00 1.00 200
y_train_pred = pca.labels_
y_train_scores = pca.decision_scores_
sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');
4 对breast-cancer数据集进行异常分析
pca = PCA(n_components=20)#生成模型
pca.fit(Train_data.iloc[:,:-1])
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=20,
n_selected_components=None, random_state=None, standardization=True,
svd_solver='auto', tol=0.0, weighted=True, whiten=False)
y_true = Train_data.label.map({'n':1,'o':0})
print(classification_report(y_true=y_true,y_pred=pca.labels_))
precision recall f1-score support
0 0.00 0.00 0.00 10
1 0.73 0.08 0.14 357
accuracy 0.07 367
macro avg 0.36 0.04 0.07 367
weighted avg 0.71 0.07 0.13 367