Kaggle机器学习二级水平内容回顾1,2

一、处理缺失值

  1. 删除缺失值所在列,
# # delete columns with missing value
cols_with_missing = [col for col in X_train.columns
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

效果:简单粗暴
2. 填充缺失值,用该列数据的均值,

# # replace missing value with mean value in column with missing value
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)   #默认均值填充缺失值,first fit_transform,second transform,
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

效果:比1好,操作难度一般般

fit_transform,transform的作用详见 https://blog.csdn.net/weixin_38278334/article/details/82971752

  1. 添加缺失值拓展列,起标志作用
# 通过添加缺失值的标识列,但在这个例子中效果不太佳
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

效果:在有些数据集表现不错,但不稳定
完整代码示例见 https://github.com/firdameng/kaggle_ml/blob/master/handl_missing_value.py
参考:https://www.kaggle.com/dansbecker/handling-missing-values

二、对离散型数据one-hot编码

one-hot编码
原始数据中的值为红色、黄色和绿色。我们为每个可能的值创建一个单独的列。当原始值是红色,我们在红色列中放置1。

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

pandas.get_dummies可以实现对离散型数据列one-hot编码,例如下图1,到图2的过程
图1
图2
完整代码见:https://github.com/firdameng/kaggle_ml/blob/master/one_hot.py

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值