缺失值处理
数据缺失问题处理方式:
-
当列数据缺失严重的时候 --删除缺失值列
data_without_missing_values = original_data.dropna(axis=1)
通常情况下我们需要将训练数据和测试数据一起处理,因此当训练数据由变动时测试数据一样需要处理.
cols_with_missing = [col for col in original_data.columns if original_data[col].isnull().any()] reduced_original_data = original_data.drop(cols_with_missing, axis=1) reduced_test_data = test_data.drop(cols_with_missing, axis=1)
-
预测缺失值,虽然不一定准确但是比删除缺失值列效果好
from sklearn.impute import SimpleImputer my_imputer = SimpleImputer() data_with_imputed_values = my_imputer.fit_transform(original_data)
-
只将有缺失的值进行预测缺失值
# make copy to avoid changing original data (when Imputing) new_data = original_data.copy() # make new columns indicating what will be imputed cols_with_missing = (col for col in new_data.columns if new_data[col].isnull().any()) for col in cols_with_missing: new_data[col + '_was_missing'] = new_data[col].isnull() # Imputation my_imputer = SimpleImputer() new_data = pd.DataFrame(my_imputer.fit_transform(new_data)) new_data.columns = original_data.columns