1---Model Validation-MAE
I will use model validation to measure the quality of my model. Measuring model quality is the key to iteratively improving your models.
In most applications, the relevant measure of model quality is predictive accuracy. But be careful, I can’t make predictions with my training data and compare those predictions to the target values in the training data when I measuring predictive accuracy. Because it will mix of good and bad predictions and look through them would be pointless.
So you'd first need to summarize this into a single metric.
Here I’ll start with one called Mean Absolute Error (also called MAE).(Metric评价指标及损失函数-Error系列之平均绝对误差MAE)
MAE converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality.
2---How to calculate MAE
To calculate MAE, we first need a model.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)
#Calculate MAE
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
runcell(0, '/Users/mac/Desktop/untitled8.py')
Out[1]: DecisionTreeRegressor()
runcell(0, '/Users/mac/Desktop/untitled8.py')
Out[2]: 434.71594577146544
3---The problem with "In-Sample" Scores
The value we just calculated can be called an “in-sample”score.
In the large real estate market, door🚪color is related to home🏠 price. In the sample of data we used to build the model, all homes with green🍏 doors were very expensive. So the model’s job is to find an accurate patterns that predict home prices.
Since this pattern was derived from the training data, the model will appear accurate in the training data. But when the model encounters new data, are the predictions still accurate?🤷♀️ If not, then the predictions of the model in practice will not be accurate.
Since model’s practical value come from predicting data that the model has never seen before. So we need to measure the performance of data that has never been modeled before. The most straightforward way to do this is to exclude some common data and keeping only data that has never been seen before. The database that remains is called validation data.
4---"train_test_split"
The scikit-learn library has a function “train_test_split” to break up the data into two pieces:1. Training data-fit the model;2.Validation data-calculate MAE.
We need to give the random.seed()(=the random_state)a value to make sure that we get the same split every time we run the script.
from sklearn.model_selection import train_test_split
#Split data into training and validation data
#The split is based on a random number generator.
#Supplying a numeric value to the random_state argument
#guarantees we get the same split every time we run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
#Define model
melbourne_model = DecisionTreeRegressor()
#Fit model
melbourne_model.fit(train_X, train_y)
#get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
runcell(0, '/Users/mac/untitled17.py')
261764.75790832794
5---Analysis
My MAE for the "in-sample" data was about 500 dollars.
Out-of-sample it is more than 250,000 dollars.
As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.