这是机器学习入门教程系列的第二部分,点击这里跳转到第一部分,有英文阅读能力的人点这里。
part4
让我们回到数据部分。一份数据,理想的状况是能直接拿来用。现实中往往不是这样,需要我们先对原始数据进行处理。这里我们这里讨论了一种情况:对缺失值的处理方式。
处理缺失值
1.直接删除缺失值
cols_with_missing =[col for col in X_train.columns if X_train[col].isnull().any()]
reduced_x_train = X_train.drop(cols_with_missing,axis=1)
reduced_x_test = X_test.drop(cols_with_missing,axis=1)
2.缺失值估算(Imputation)
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
imputed_x_train = my_imputer.fit_transform(X_train)
imputed_x_test = my_imputer.fit_transform(X_test)
3.缺失值估算的扩展版
imputed_x_train_plus = X_train.copy()
imputed_x_test_plus = X_test.copy()
for col in cols_with_missing:
imputed_x_train_plus[col+"_was_missing"] = imputed_x_train_plus[col].isnull()
imputed_x_test_plus[col+"_was_missing"] = imputed_x_test_plus[col].isnull()
imputed_x_train_plus=my_imputer.fit_transform(imputed_x_train_plus)
imputed_x_test_plus= my_imputer.transform(imputed_x_test_plus)
完整代码
1.第一部分:数据和方法准备
FILE_PATH ="C:\\Users\\Administrator\\Desktop\\kaggle\\data\\"
FILE_INDEX ="melb_data.csv"
import pandas as pd
data = pd.read_csv(FILE_PATH+FILE_INDEX)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
target = data.Price
# predictors = data.drop(['Price'],axis=1)
numeric_predictors = data.select_dtypes(exclude=['object'])
X_train,X_test,y_train,y_test = train_test_split(numeric_predictors,
target,
train_size=0.7,
test_size =0.3,
random_state=0)
def socre_dataset(X_train,X_test,y_train,y_test):
model = RandomForestRegressor()
model.fit(X_train,y_train)
preds = model.predict(X_test)
return mean_absolute_error(y_test,preds)
2.第二部分:三种缺失值处理方式对比
#Get Model Score from Dropping Columns with Missing Values
cols_with_missing =[col for col in X_train.columns if X_train[col].isnull().any()]
reduced_x_train = X_train.drop(cols_with_missing,axis=1)
reduced_x_test = X_test.drop(cols_with_missing,axis=1)
print("Mean Absolute Error form dropping colunms with Missing Values:")
print(socre_dataset(reduced_x_train,reduced_x_test,y_train,y_test))
# Get Model Score from Imputation
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
imputed_x_train = my_imputer.fit_transform(X_train)
imputed_x_test = my_imputer.fit_transform(X_test)
print("Mean Absolute Error from Imputation:")
print(socre_dataset(imputed_x_train,imputed_x_test,y_train,y_test))
#Get Score from Imputation with Extra Columns Showing What Was Imputed
imputed_x_train_plus = X_train.copy()
imputed_x_test_plus = X_test.copy()
for col in cols_with_missing:
imputed_x_train_plus[col+"_was_missing"] = imputed_x_train_plus[col].isnull()
imputed_x_test_plus[col+"_was_missing"] = imputed_x_test_plus[col].isnull()
imputed_x_train_plus=my_imputer.fit_transform(imputed_x_train_plus)
imputed_x_test_plus= my_imputer.transform(imputed_x_test_plus)
print("Mean Absolute Error form Imputation while Track What Was Imputed:")
print(socre_dataset(imputed_x_train_plus,imputed_x_test_plus,y_train,y_test))
在代码中,细心的你可能发现了numeric_predictors这个变量,它是通过select_dtypes方法筛选出的非”object”型数据。如果你不这样做,你的程序可能报错:
ValueError: could not convert string to float: ‘Western Metropolitan’
但是实际应用中,这些数据可能对预测结果很重要,并不能简单丢弃。我们接下来就会讨论这种情况。在开始之前你需要去这里下载所需数据,提取码:bn9c。
part5
这部分我们讨论分类数据(Categorical data)。分类数据是反映事物类别的数据。分类数据一种标准的处理方式是:独热码(one hot encoding)。
独热码
独热码放在这里来说,就是多少个分类就有多少个比特,并且只能有一个比特为1,其它全为0的一种码制,例如:0000010。
完整的代码
1.第一部分:数据准备
FILE_PATH ="C:\\Users\\Administrator\\Desktop\\kaggle\\data\\"
TEST_INDEX="test.csv"
TRAIN_INDEX="train.csv"
import pandas as pd
train_data=pd.read_csv(FILE_PATH+TRAIN_INDEX)
test_data =pd.read_csv(FILE_PATH+TEST_INDEX)
train_data.dropna(axis=0,subset=["SalePrice"],inplace=True)
target = train_data.SalePrice
cols_with_missing=[col for col in train_data.columns
if train_data[col].isnull().any()]
condidate_train_p= train_data.drop(['Id',"SalePrice"]+cols_with_missing,axis=1)
condidate_test_p= test_data.drop(['Id']+cols_with_missing,axis=1)
low_cardnality_cols=[cname for cname in condidate_train_p.columns if
condidate_train_p[cname].nunique()<10 and condidate_train_p[cname].dtype=='object']
numeric_cols=[cname for cname in condidate_train_p.columns if
condidate_train_p[cname].dtype in ["int64","float64"]]
my_cols= low_cardnality_cols+numeric_cols
train_predictors = condidate_train_p[my_cols]
test_predictors = condidate_test_p[my_cols]
# print(train_predictors.dtypes.sample(10))
2.第二部分:分类数据独热码处理和简单丢弃对比
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
def get_mae(X,y):
return -1*cross_val_score(RandomForestRegressor(50),X,y,
scoring='neg_mean_absolute_error').mean()
# core
one_hot_encoded_training_p = pd.get_dummies(train_predictors)
predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])
mae_wihtout_c = get_mae(predictors_without_categoricals,target)
mae_one_hot_encoding = get_mae(one_hot_encoded_training_p,target)
print("Mean Absolute Error When Dropping Categoricals: ",str(int(mae_wihtout_c)))
print("Mean Absolute Error With One-hot Encoding: ",str(int(mae_one_hot_encoding)))
- get_mae()方法中用到了交叉验证,后面会有讲解。
- 独热码编码是通过pd.get_dummies()方法实现的。
结果显示
总结
在这一章中我们学到了处理缺失值的三种方式以及处理分类数据的标准方式——独热码。到目前为止,模型对我们来说就是一个黑箱,内部细节一概不知。这或许会让你不安,接下来我们就从一个独特的角度去认识模型。