缺失值处理

最新推荐文章于 2024-09-16 14:57:35 发布

hjmbt

最新推荐文章于 2024-09-16 14:57:35 发布

阅读量134

点赞数

分类专栏： kaggle 机器学习 Python 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_43039301/article/details/102893009

版权

Python 同时被 3 个专栏收录

11 篇文章 0 订阅

订阅专栏

机器学习

10 篇文章 0 订阅

订阅专栏

kaggle

2 篇文章 0 订阅

订阅专栏

缺失值处理

数据缺失问题处理方式:

当列数据缺失严重的时候 --删除缺失值列

data_without_missing_values = original_data.dropna(axis=1)

通常情况下我们需要将训练数据和测试数据一起处理,因此当训练数据由变动时测试数据一样需要处理.

cols_with_missing = [col for col in original_data.columns if original_data[col].isnull().any()]
reduced_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)

预测缺失值,虽然不一定准确但是比删除缺失值列效果好

from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)

只将有缺失的值进行预测缺失值

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns