在进行数据分析或挖掘时,不免会遇到dirty data,而清洗数据的过程,异常值是不可忽视的一部分,产生异常值的原因多种多样,本文对此不进行展开。
对于这些异常值,可以利用sklearn中的EllipticEnvelope进行识别,利用KNNImputer进行快速处理。
话不多说,上代码:
# 以回归数据集为例
from sklearn.datasets import make_regression
data,_ = make_regression(n_samples=10, # 这里仅需要x变量进行展示
n_features=3,
n_targets=1)
data
# data输出结果
array([[ 0.88054648, -0.01786369, -0.74498508],
[ 0.30095811, -0.96697815, 1.05205028],
[ 0.48414616, -0.07507472, -0.10621494],
[ 0.8468415 , -0.10985183, 1.32810419],
[-0.80338906, 0.53577874, -0.74502406],
[-1.75876019, 1.87751452, 0.32122004],
[ 0.08488059, 0.53519016, 0.65906643],
[ 2.85527912, -0.49698752, -0.14953991],
[-0.88683056, 0.84816828, 0.51564286],
[-1.22034335, 0.71323155, -0.35030481]])
# 将data的第一行(也就是第一个观察值)全部改为100,意为第一个观察值为异常值
data[0,:] = 100
data
# 修改后的data输出结果
array([[ 1.00000000e+02, 1.00000000e+02, 1.00000000e+02],
[ 3.00958109e-01, -9.66978155e-01, 1.05205028e+00],
[ 4.84146160e-01, -7.50747213e-02, -1.06214940e-01],
[ 8.46841502e-01, -1.09851829e-01, 1.32810419e+00],
[-8.03389064e-01, 5.35778740e-01, -7.45024056e-01],
[-1.75876019e+00, 1.87751452e+00, 3.21220038e-01],
[ 8.48805931e-02, 5.35190165e-01, 6.59066426e-01],
[ 2.85527912e+00, -4.96987523e-01, -1.49539907e-01],
[-8.86830561e-01, 8.48168280e-01, 5.15642865e-01],
[-1.22034335e+00, 7.13231553e-01, -3.50304809e-01]])
# 接着导入EllipticEnvelope
from sklearn.covariance import EllipticEnvelope
detector = EllipticEnvelope() # 构造异常值识别器
detector.fit(data) # 拟合识别器
detector.predict(data) # 预测异常值
# 异常值预测结果
array([-1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) # -1意为异常值
识别出异常值之后,将异常值改为缺失值,而后使用sklearn中的另一个class——KNNImputer,进行缺失值填充。
# 将第一个观察值的数值改为缺失值
import numpy as np
data[0,:] = np.nan
data
# data输出结果
array([[ nan, nan, nan],
[ 0.30095811, -0.96697815, 1.05205028],
[ 0.48414616, -0.07507472, -0.10621494],
[ 0.8468415 , -0.10985183, 1.32810419],
[-0.80338906, 0.53577874, -0.74502406],
[-1.75876019, 1.87751452, 0.32122004],
[ 0.08488059, 0.53519016, 0.65906643],
[ 2.85527912, -0.49698752, -0.14953991],
[-0.88683056, 0.84816828, 0.51564286],
[-1.22034335, 0.71323155, -0.35030481]])
# 缺失值填充过程
# 导入KNNImputer
from sklearn.impute import KNNImputer
imputer = KNNImputer() # 这里可以设置参数n_neighbors,默认值为5
imputer.fit_transform(data)
# 填充后的data输出结果
array([[-0.01080196, 0.31788789, 0.28055557],
[ 0.30095811, -0.96697815, 1.05205028],
[ 0.48414616, -0.07507472, -0.10621494],
[ 0.8468415 , -0.10985183, 1.32810419],
[-0.80338906, 0.53577874, -0.74502406],
[-1.75876019, 1.87751452, 0.32122004],
[ 0.08488059, 0.53519016, 0.65906643],
[ 2.85527912, -0.49698752, -0.14953991],
[-0.88683056, 0.84816828, 0.51564286],
[-1.22034335, 0.71323155, -0.35030481]])
总而言之,EllipticEnvelope与KNNImputer也有其缺点和不适合的场合,在对数据属性特征不是特别清楚时,尤其是脱敏数据,可考虑使用这两个sklearn中class,实现数据的快速处理。