Learn: Model Validation

Mean Absolute Error(MAE)

There are many metrics for summarizing model quality, but we’ll start with one called Mean Absolute Error (also called MAE). 广义上,error:
error= actual- predicted

MAE:
mean absolute error

MSE:
mean squared error

Model

#Model
import pandas as pd

melbourne_data= pd.read_csv(r'G:\kaggle\melb_data.csv')
filtered_melbourne_data= melbourne_data.dropna( axis=0 )
y= filtered_melbourne_data.Price
melbourne_features=['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X=filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
melbourne_model= DecisionTreeRegressor( random_state=1 )
melbourne_model.fit(X,y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')
filtered_melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']].head()
#注意:里面双括号哈,只能一个参数
RoomsBathroomLandsizeBuildingAreaYearBuiltLattitudeLongtitude
121.0156.079.01900.0-37.8079144.9934
232.0134.0150.01900.0-37.8093144.9944
441.0120.0142.02014.0-37.8072144.9941
632.0245.0210.01910.0-37.8024144.9993
721.0256.0107.01890.0-37.8060144.9954
help(DecisionTreeRegressor)

DecisionTreeRegressor():
参数:
criterion 默认为’mse’:mean squared error

算MAE:mean absolute error

Training data上,训练误差:

from sklearn.metrics import mean_absolute_error

predicted_home_price= melbourne_model.predict(X)
mean_absolute_error( y, predicted_home_price) #returns:loss
434.71594577146544

“In-Sample” Scores的问题

我们将所有的data都拿来训练模型,然后我们还将拿来训练模型的data去算误差。
而模型的意义是对new data进行预测。也许我们在这些data上误差小,拟合地特别好,但是来了新数据呢?也许就会很糟糕了。解决:Validation data验证集

解决"In-Sample" Scores问题

The scikit-learn library has a function train_test_split to break up the data into two pieces。
We'll use some of that data as training data to fit the model, 
and we'll use the other data as validation data to calculate mean_absolute_error.

scikit_learn库中提供了函数train_test_split,将data分为两部分:
一部分data作为:training data去fit model
另一部分data作为:validation data去计算误差

分割数据集为训练集和验证集(train_test_split())

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y= train_test_split(X, y, test_size=0.33, random_state=0)#默认test_size=0.25
help(train_test_split)
len(train_X),len(train_y)
(4151, 4151)
len(val_X),len(val_y)
(2045, 2045)

用训练集训练模型

#melbourne_model= DecisionTreeRegressor()#define模型
melbourne_model.fit(train_X, train_y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

用验证集测试模型

val_prediction_y= melbourne_model.predict(val_X)
mean_absolute_error(val_y, val_prediction_y)
254577.2400977995
val_y.mean()
1088835.6136919316

模型需要改进

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值