一、处理缺失值
- 删除缺失值所在列,
# # delete columns with missing value
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
效果:简单粗暴
2. 填充缺失值,用该列数据的均值,
# # replace missing value with mean value in column with missing value
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train) #默认均值填充缺失值,first fit_transform,second transform,
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
效果:比1好,操作难度一般般
fit_transform,transform的作用详见 https://blog.csdn.net/weixin_38278334/article/details/82971752
- 添加缺失值拓展列,起标志作用
# 通过添加缺失值的标识列,但在这个例子中效果不太佳
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
效果:在有些数据集表现不错,但不稳定
完整代码示例见 https://github.com/firdameng/kaggle_ml/blob/master/handl_missing_value.py
参考:https://www.kaggle.com/dansbecker/handling-missing-values
二、对离散型数据one-hot编码
原始数据中的值为红色、黄色和绿色。我们为每个可能的值创建一个单独的列。当原始值是红色,我们在红色列中放置1。
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
pandas.get_dummies可以实现对离散型数据列one-hot编码,例如下图1,到图2的过程
完整代码见:https://github.com/firdameng/kaggle_ml/blob/master/one_hot.py