文章目录
回归填补
首先导入所需要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import missingno as mno
import warnings
warnings.filterwarnings('ignore')
接着导入数据
data=np.loadtxt('data\Magic.txt')
tmp_columns=list('abcdefghij')
tmp_columns.append('class')
magic=pd.DataFrame(data=data,columns=tmp_columns)
随机抽出10条数据以观察
magic.sample(10)
查看数据集的缺失值情况
magic.isnull().sum()
我们发现没有缺失值
我们画出特征之间的热图,观察特征之间的相关性
%matplotlib inline
'''''
可以看出a-b,a-c,b-c,d-e,j-a,j-b,j-c
'''
complete_features=magic.loc[:,magic.columns.difference(['class'])]
# 绘制热力图
plt.figure(figsize=(10,10))
sns.heatmap(complete_features.corr(),annot=True)
接着,根据特征之间的相关性,我们选择a,b,c
作为含有缺失值的列
我们随机从a,b,c
中抽取10%
的数据清空
prob_missing = 0.1
col_incomplete=['a','b','c']
ind_incomplete=[magic.columns.get_loc(i) for i in col_incomplete]
df_incomplete = magic.copy()
ix = [(row, col) for row in range(magic.shape[0]) for col in ind_incomplete]
for row, col in random.sample(ix, int(round(prob_missing * len(ix)))):
df_incomplete.iat[row, col] = np.nan
# 原始的特征列
df_complete=magic[col_incomplete]
df_incomplete_copy=df_incomplete.copy()
df_incomplete.isna().sum()
mno.matrix(df_incomplete, figsize = (20, 6))
处理后的数据表可视化之后如下图所示
random imputation
接着,我们要对a,b,c
列进行回归填充,我打算采用knn
回归模型,训练集的特征为除了待预测的特征之外的所有特征。可是,由于不只一行含有空缺值,所以我们不能直接去进行预测,于是我们用a,b,c
列中不为空的值来随机填充a,b,c
列为新的三个特征a_tmp, b_tmp, c_tmp
,当我们预测a时,就可以将b_tmp, c_tmp
作为特征一起参与训练
missing_columns=col_incomplete
def random_imputation(df,feature):
num_missing=df[feature].isnull().sum()
observed_values=df.loc[df[feature].notnull(),feature]
df.loc[df[feature].isnull(),feature+'_imp']=np.random.choice(
observed_values,num_missing,replace=True
)
return df
for feature in missing_columns:
df_incomplete[feature+'_imp']=df_incomplete[feature]
df_incomplete=random_imputation(df_incomplete,feature)
mno.matrix(df_incomplete,figsize=[20,6])
填充新特征后的数据表如下图所示
deterministic regression imputation
接着,我们采用knn
(n_neighbour=3
)模型来分别对每一个缺失特征进行预测填补缺失值
from sklearn.neighbors import KNeighborsRegressor
deter_data=pd.DataFrame(columns=['Det'+name for name in missing_columns])
for feature in missing_columns:
deter_data['Det'+feature]=df_incomplete[feature+'_imp']
para=list(