处理数据中的缺失值--填充缺失值

无风听海

已于 2023-11-28 06:06:17 修改

阅读量232

点赞数

分类专栏：数据科学文章标签： python 机器学习人工智能数据科学缺失值填充缺失值

于 2023-11-28 06:03:32 首次发布

本文链接：https://blog.csdn.net/hou478410969/article/details/134658350

版权

数据科学专栏收录该内容

5 篇文章 0 订阅

订阅专栏

填充指的是利用现有知识/数据来确定缺失的数量值并填充的行为。我们有几个选择，最常见的是用此列其余部分的均值填充缺失值；

我们可以看到原始的数据集中有五行数据的plasma_glucose_concentration有缺失值；

print(pima['plasma_glucose_concentration'].isnull().sum())
# 5

我们可以先看下这5行的数据的具体情况

plasma_empty_index = pima[pima['plasma_glucose_concentration'].isnull()].index
print(pima.loc[plasma_empty_index])
#      times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin   bmi  pedigree_function  age  onset_diabetes
# 75                1                         None                       48                20          None  24.7              0.140   22               0
# 182               1                         None                       74                20            23  27.7              0.299   21               0
# 342               1                         None                       68                35          None  32.0              0.389   22               0
# 349               5                         None                       80                32          None  41.0              0.346   37               1
# 502               6                         None                       68                41          None  39.0              0.727   41               1

我们使用列的均值进行填充之后，可以看到缺失值的行数为0，同时可以看到5行对应字段的值都是121.686763；

pima['plasma_glucose_concentration'].fillna(pima['plasma_glucose_concentration'].mean(), inplace=True)
print(pima['plasma_glucose_concentration'].isnull().sum())
print(pima.loc[plasma_empty_index])
 # 0
#      times_pregnant  plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin   bmi  pedigree_function  age  onset_diabetes
# 75                1                    121.686763                       48                20          None  24.7              0.140   22               0
# 182               1                    121.686763                       74                20            23  27.7              0.299   21               0
# 342               1                    121.686763                       68                35          None  32.0              0.389   22               0
# 349               5                    121.686763                       80                32          None  41.0              0.346   37               1
# 502               6                    121.686763                       68                41          None  39.0              0.727   41               1

我们可以直接使用sklearn的SimpleImputer来进行数据的填充；

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
pima_imputed = imputer.fit_transform(pima)
pima_imputed = pd.DataFrame(pima_imputed, columns=pima_column_names)
print(pima_imputed.loc[plasma_empty_index])
#      times_pregnant  plasma_glucose_concentration  diastolic_blood_pressure  triceps_thickness  serum_insulin   bmi  pedigree_function   age  onset_diabetes
# 75              1.0                    121.686763                      48.0               20.0     155.548223  24.7              0.140  22.0             0.0
# 182             1.0                    121.686763                      74.0               20.0      23.000000  27.7              0.299  21.0             0.0
# 342             1.0                    121.686763                      68.0               35.0     155.548223  32.0              0.389  22.0             0.0
# 349             5.0                    121.686763                      80.0               32.0     155.548223  41.0              0.346  37.0             1.0
# 502             6.0                    121.686763                      68.0               41.0     155.548223  39.0              0.727  41.0             1.0

可以看到所有的列都已经填充，sklearn的填充处理类确实减少了繁琐的填充工作；

print(pima_imputed.isnull().sum())
# times_pregnant                  0
# plasma_glucose_concentration    0
# diastolic_blood_pressure        0
# triceps_thickness               0
# serum_insulin                   0
# bmi                             0
# pedigree_function               0
# age                             0
# onset_diabetes                  0
# dtype: int64

如果我们直接使用0来填充缺失值，同样使用KNN模型得到的准确率是0.7357185298361768，有所降低的；

from sklearn.neighbors import  KNeighborsClassifier
from sklearn.model_selection import  GridSearchCV

pima_zero = pima.fillna(0)
X_zero = pima_zero.drop('onset_diabetes', axis=1)
print('learning from {} rows'.format(X_zero.shape[0]))
y_zero = pima_zero['onset_diabetes']

knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, knn_params)
grid.fit(X_zero, y_zero)
print(grid.best_score_, grid.best_params_)
# learning from 768 rows
# 0.7357185298361768 {'n_neighbors': 7}

scikit-learn有一个用于构建流水线的内置模块，其会对原始输入数据进行各种预处理；实际的训练过程中，是需要首先划分数据集，如果我们在应用算法之前直接对整个数据集填充值，我们就是在作弊，模型其实学不到任何模式。可以看见，模型的准确率是66%（并不好，但这不是重点）。重点是，训练集和测试集都是用整个X矩阵的均值填充的。这违反了机器学习流程的核心原则。当预测测试集的响应值时，不能假设我们已经知道了整个数据集的均值。

from sklearn.model_selection import train_test_split
from sklearn.neighbors import  KNeighborsClassifier

X = pima[['serum_insulin']].copy()
y = pima['onset_diabetes'].copy()
entire_data_set_mean = X.mean()
X = X.fillna(entire_data_set_mean)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
print(score)
# 0.6666666666666666

我们使用训练样本的均值填充训练样本和测试样本的缺失值，然后进行训练；

from sklearn.model_selection import train_test_split
from sklearn.neighbors import  KNeighborsClassifier

X = pima[['serum_insulin']].copy()
y = pima['onset_diabetes'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
X_train_mean = X_train.mean()
X_train = X_train.fillna(X_train_mean)
X_test = X_test.fillna(X_train_mean)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
print(score)
# 0.6822916666666666

我们使用Pipeline结合SimpleImputer来使用均值填充数据，查看训练结果；

from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_para = {'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
mean_impute = Pipeline([('imputer', SimpleImputer(strategy='mean')),('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']
grid = GridSearchCV(mean_impute, knn_para)
grid.fit(X, y)
print(grid.best_score_, grid.best_params_)
# 0.7305407011289364 {'classify__n_neighbors': 7}

我们使用Pipeline结合SimpleImputer来使用中位数填充数据，查看训练结果；

from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_para = {'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier()
mean_impute = Pipeline([('imputer', SimpleImputer(strategy='median')),('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']
grid = GridSearchCV(mean_impute, knn_para)
grid.fit(X, y)
print(grid.best_score_, grid.best_params_)
# 0.7292589763177999 {'classify__n_neighbors': 7}

无风听海

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
处理数据中的缺失值--填充缺失值

实际的训练过程中，是需要首先划分数据集，如果我们在应用算法之前直接对整个数据集填充值，我们就是在作弊，模型其实学不到任何模式。填充指的是利用现有知识/数据来确定缺失的数量值并填充的行为。如果我们直接使用0来填充缺失值，同样使用KNN模型得到的准确率是0.7357185298361768，有所降低的；我们使用列的均值进行填充之后，可以看到缺失值的行数为0，同时可以看到5行对应字段的值都是121.686763；可以看到所有的列都已经填充，sklearn的填充处理类确实减少了繁琐的填充工作；
复制链接

扫一扫