Public bicycle usage forecast
1. Introduction
Public bicycles are low-carbon, environmentally friendly, healthy, and it solve the “last mile” problem in transportation, and it is becoming more and more popular in cities across the country. The data for this project was taken from several public bicycle parking lots on a street in two cities. We hope to predict the number of public bicycles borrowed in the neighborhood within one hour based on time, weather and other information.
Data source:
http://sofasofa.io/competition.php?id=1#c1
# Import packages
import numpy as np
import pandas as pd
import math
from scipy.stats import norm, skew
from scipy import stats
# Modelling Algorithms :
# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV, ElasticNet
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
# Modelling Helpers :
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score
#preprocessing :
from sklearn.preprocessing import MinMaxScaler , StandardScaler, Imputer, LabelEncoder
#evaluation metrics :
# Regression
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error
# Classification
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.metrics import confusion_matrix, classification_report
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
params = {
'axes.labelsize': "large",
'xtick.labelsize': 'x-large',
'legend.fontsize': 20,
'figure.dpi': 150,
'figure.figsize': [25, 7]
}
plt.rcParams.update(params)
2. Explore the dataset
# Import data
df = pd.read_csv("train.csv")
# How the data looks
df.head()
id | city | hour | is_workday | weather | temp_1 | temp_2 | wind | y | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 22 | 1 | 2 | 3.0 | 0.7 | 0 | 15 |
1 | 2 | 0 | 10 | 1 | 1 | 21.0 | 24.9 | 3 | 48 |
2 | 3 | 0 | 0 | 1 | 1 | 25.3 | 27.4 | 0 | 21 |
3 | 4 | 0 | 7 | 0 | 1 | 15.7 | 16.2 | 0 | 11 |
4 | 5 | 1 | 10 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
The dataset has 9 variables:
- Id
- y: the number of public bicycle borrowed in the neighborhood within one hour
- city: there are two cities in the dataset
- hour: local time in hour, 24-hour timing
- is_workday: 1 indicaite is workday, 0 indicate is hoilday or weekend
- temp_1: local temperature in Celsius degree
- temp_2: Somatosensory temperature in Celsius degree
- Weather: 1 indicate sunny, 2 indicate partly cloudy, 3 indicate mild precipitation weather, 4 indicate heavy precipitation weather
- Wind: wind speed, the larger the value, the greater the wind speed.
3. Visualization of the data
# Correlation of entire dataset
corr = df.corr()
sns.heatmap(data=corr, square=True , annot=True, cbar=True)
<matplotlib.axes._subplots.AxesSubplot at 0x214682d6860>
# Histogram of variable y
plt.hist('y' , data=df , bins=25)
(array([2372., 996., 872., 857., 747., 753., 651., 467., 376.,
343., 260., 257., 245., 172., 141., 117., 99., 94.,
52., 33., 43., 28., 16., 4., 5.]),
array([ 0. , 9.96, 19.92, 29.88, 39.84, 49.8 , 59.76, 69.72,
79.68, 89.64, 99.6 , 109.56, 119.52, 129.48, 139.44, 149.4 ,
159.36, 169.32, 179.28, 189.24, 199.2 , 209.16, 219.12, 229.08,
239.04, 249. ]),
<a list of 25 Patch objects>)
We can see from the graph, variable y extremely skew to right. We need to do log transform before apply it in the model.
# Histogram of variable temp_1
plt.hist('temp_1' , data=df , bins=25)
(array([ 16., 18., 66., 181., 300., 486., 531., 678., 754., 536., 681.,
487., 577., 620., 515., 801., 705., 668., 581., 276., 262., 138.,
83., 33., 7.]),
array([-7.6 , -5.752, -3.904, -2.056, -0.208, 1.64 , 3.488, 5.336,
7.184, 9.032, 10.88 , 12.728, 14.576, 16.424, 18.272, 20.12 ,
21.968, 23.816, 25.664, 27.512, 29.36 , 31.208, 33.056, 34.904,
36.752, 38.6 ]),
<a list of 25 Patch objects>)
temp_1 is appromixtaly normal distributed, and it is numeric variable. I decide to leave it there in the dataset.
# Boxplot between y and city
sns.factorplot(x="city",y="y",data=df,kind="box",aspect=1)
<seaborn.axisgrid.FacetGrid at 0x214632625f8>
We can see the y variable is lightly higher in city 0.
# Boxplot between y and hours
sns.factorplot(x="hour",y="y",data=df,kind="box",aspect=2.5)
<seaborn.axisgrid.FacetGrid at 0x2146b42f9e8>
We can see the borrow amount is concentrated in 7, 8, 9, 17, 18, 19 o’clock. It is exactly the commute time during a day.
The borrow amount is very low in 0,1,2,3,4,5 o’clock.
The borrow amount is decreasing from 17 to 23 o’clock.
Since the hour is categorical variable, we need change it into dummy variables before apply it in the model.
# Boxplot between y and is_workday
sns.factorplot(x="is_workday",y="y",data=df,kind="box",aspect=1)
<seaborn.axisgrid.FacetGrid at 0x2146b3f5400>
There is no big difference of borrow amount whether the day is workday.
# Boxplot between y and is_workday
sns.factorplot(x="weather",y="y",data=df,kind="box",aspect=1.5)
<seaborn.axisgrid.FacetGrid at 0x21477f30f98>
The borrow amount is larger when the weather is sunny and cloudy.
The borrow amount is very low when raining.
Since the weather is categorical variable, we need to change it into dummy variable.
sns.factorplot(x="wind",y="y",data=df,kind="box",aspect=2.5)
<seaborn.axisgrid.FacetGrid at 0x21477f30fd0>
4. Data Pre-processing
a. Drop the “id” column as we already have index
df.drop(["id"],axis=1,inplace=True)
df.head()
city | hour | is_workday | weather | temp_1 | temp_2 | wind | y | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 22 | 1 | 2 | 3.0 | 0.7 | 0 | 15 |
1 | 0 | 10 | 1 | 1 | 21.0 | 24.9 | 3 | 48 |
2 | 0 | 0 | 1 | 1 | 25.3 | 27.4 | 0 | 21 |
3 | 0 | 7 | 0 | 1 | 15.7 | 16.2 | 0 | 11 |
4 | 1 | 10 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
b. log transform in y
df["y"] = np.log(df["y"]+1)
c. Exclude y before doing dummy process
X = df.drop("y", 1)
y = df.y
X.head()
city | hour | is_workday | weather | temp_1 | temp_2 | wind | |
---|---|---|---|---|---|---|---|
0 | 0 | 22 | 1 | 2 | 3.0 | 0.7 | 0 |
1 | 0 | 10 | 1 | 1 | 21.0 | 24.9 | 3 |
2 | 0 | 0 | 1 | 1 | 25.3 | 27.4 | 0 |
3 | 0 | 7 | 0 | 1 | 15.7 | 16.2 | 0 |
4 | 1 | 10 | 1 | 1 | 21.1 | 25.0 | 2 |
d. change categorical data into dummy variables
todummy_list = ["hour","weather"]
def dummy_df(df, todummy_list):
for x in todummy_list:
dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
df = df.drop(x, 1)
df = pd.concat([df, dummies], axis=1)
return df
X = dummy_df(X, todummy_list)
print(X.head(5))
city is_workday temp_1 temp_2 wind hour_0 hour_1 hour_2 hour_3 \
0 0 1 3.0 0.7 0 0 0 0 0
1 0 1 21.0 24.9 3 0 0 0 0
2 0 1 25.3 27.4 0 1 0 0 0
3 0 0 15.7 16.2 0 0 0 0 0
4 1 1 21.1 25.0 2 0 0 0 0
hour_4 ... hour_18 hour_19 hour_20 hour_21 hour_22 hour_23 \
0 0 ... 0 0 0 0 1 0
1 0 ... 0 0 0 0 0 0
2 0 ... 0 0 0 0 0 0
3 0 ... 0 0 0 0 0 0
4 0 ... 0 0 0 0 0 0
weather_1 weather_2 weather_3 weather_4
0 0 1 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
[5 rows x 33 columns]
X.head()
city | is_workday | temp_1 | temp_2 | wind | hour_0 | hour_1 | hour_2 | hour_3 | hour_4 | ... | hour_18 | hour_19 | hour_20 | hour_21 | hour_22 | hour_23 | weather_1 | weather_2 | weather_3 | weather_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 3.0 | 0.7 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 21.0 | 24.9 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 25.3 | 27.4 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 15.7 | 16.2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 1 | 21.1 | 25.0 | 2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 33 columns
5. Model Contribution
I run eight models in this part and collect the R2 score at the bottom
a. Split the dataset into two parts, training has 80% of the entire dataset, and testing have the rest.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=66)
b. Scale the data using StandardSacler() function
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
c. Model contribution
# Collect all R2 Scores.
R2_Scores = []
models = ['Linear Regression' , 'knn' , 'AdaBoost Regression' ,'GradientBoosting Regression',
'RandomForest Regression' ,"SVM", "Neural Network", "Stacked model"]
# Linear Regression
clf_lr = LinearRegression()
clf_lr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_lr, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_lr.predict(X_test)
print('')
print('####### Linear Regression #######')
print('Score : %.4f' % clf_lr.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
####### Linear Regression #######
Score : 0.8011
[ 7.78652865e-01 -2.88601860e+25 7.96945499e-01 8.06271181e-01
8.09304528e-01 7.78931963e-01 7.92729190e-01 8.13661958e-01
8.10548852e-01 8.13604200e-01]
MSE : 0.39
MAE : 0.47
RMSE : 0.62
R2 : 0.80
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished
# k nearest neighbor
clf_knn = KNeighborsRegressor()
clf_knn.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_knn, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_knn.predict(X_test)
print('')
print('###### KNeighbours Regression ######')
print('Score : %.4f' % clf_knn.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 5.9s finished
###### KNeighbours Regression ######
Score : 0.9050
[0.8890756 0.89151628 0.90407412 0.8992113 0.91491348 0.90014117
0.90287772 0.91229941 0.91322939 0.9018737 ]
MSE : 0.18
MAE : 0.30
RMSE : 0.43
R2 : 0.91
# AdaBoost
clf_ada = AdaBoostRegressor(n_estimators=1000)
clf_ada.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_ada, X = X_train, y = y_train, cv = 5,verbose = 1)
y_pred = clf_ada.predict(X_test)
print('')
print('###### AdaBoost Regression ######')
print('Score : %.4f' % clf_ada.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
###### AdaBoost Regression ######
Score : 0.4474
[0.40772985 0.45865976 0.46406953 0.42710509 0.46842508]
MSE : 1.07
MAE : 0.88
RMSE : 1.04
R2 : 0.45
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.6s finished
# GradientBoosting
clf_gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls',verbose = 1)
clf_gbr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_gbr, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_gbr.predict(X_test)
print('')
print('###### Gradient Boosting Regression #######')
print('Score : %.4f' % clf_gbr.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
Iter Train Loss Remaining Time
1 1.9045 0.30s
2 1.8553 0.31s
3 1.8115 0.32s
4 1.7711 0.28s
5 1.7314 0.32s
6 1.6960 0.33s
7 1.6617 0.33s
8 1.6290 0.31s
9 1.5976 0.32s
10 1.5686 0.33s
20 1.3236 0.26s
30 1.1465 0.22s
40 1.0164 0.19s
50 0.9181 0.17s
60 0.8398 0.13s
70 0.7762 0.10s
80 0.7235 0.07s
90 0.6791 0.03s
100 0.6417 0.00s
Iter Train Loss Remaining Time
1 1.9166 0.36s
2 1.8676 0.41s
3 1.8230 0.43s
4 1.7820 0.40s
5 1.7425 0.41s
6 1.7065 0.40s
7 1.6707 0.41s
8 1.6379 0.40s
9 1.6057 0.40s
10 1.5761 0.39s
20 1.3301 0.31s
30 1.1513 0.27s
40 1.0197 0.22s
50 0.9207 0.17s
60 0.8421 0.15s
70 0.7777 0.10s
80 0.7242 0.07s
90 0.6793 0.03s
100 0.6412 0.00s
Iter Train Loss Remaining Time
1 1.9059 0.65s
2 1.8553 0.54s
3 1.8112 0.44s
4 1.7699 0.42s
5 1.7300 0.45s
6 1.6943 0.42s
7 1.6607 0.41s
8 1.6271 0.42s
9 1.5954 0.38s
10 1.5661 0.40s
20 1.3203 0.33s
30 1.1436 0.30s
40 1.0134 0.25s
50 0.9148 0.20s
60 0.8368 0.17s
70 0.7734 0.14s
80 0.7210 0.10s
90 0.6770 0.05s
100 0.6396 0.00s
Iter Train Loss Remaining Time
1 1.9075 0.00s
2 1.8603 0.85s
3 1.8150 0.59s
4 1.7747 0.44s
5 1.7365 0.35s
6 1.6999 0.54s
7 1.6652 0.52s
8 1.6336 0.48s
9 1.6024 0.47s
10 1.5724 0.46s
20 1.3260 0.32s
30 1.1478 0.27s
40 1.0176 0.24s
50 0.9190 0.21s
60 0.8413 0.18s
70 0.7779 0.13s
80 0.7249 0.08s
90 0.6804 0.04s
100 0.6429 0.00s
Iter Train Loss Remaining Time
1 1.8941 0.40s
2 1.8454 0.44s
3 1.8007 0.37s
4 1.7606 0.40s
5 1.7212 0.34s
6 1.6852 0.39s
7 1.6524 0.37s
8 1.6199 0.38s
9 1.5889 0.34s
10 1.5596 0.37s
20 1.3160 0.33s
30 1.1411 0.27s
40 1.0122 0.23s
50 0.9149 0.19s
60 0.8372 0.15s
70 0.7735 0.11s
80 0.7208 0.08s
90 0.6766 0.04s
100 0.6395 0.00s
Iter Train Loss Remaining Time
1 1.8856 0.39s
2 1.8365 0.34s
3 1.7927 0.39s
4 1.7526 0.40s
5 1.7133 0.35s
6 1.6780 0.38s
7 1.6446 0.40s
8 1.6119 0.40s
9 1.5810 0.38s
10 1.5523 0.37s
20 1.3117 0.28s
30 1.1371 0.21s
40 1.0083 0.17s
50 0.9111 0.14s
60 0.8336 0.11s
70 0.7705 0.09s
80 0.7186 0.06s
90 0.6745 0.03s
100 0.6377 0.00s
Iter Train Loss Remaining Time
1 1.9095 0.30s
2 1.8598 0.33s
3 1.8150 0.36s
4 1.7744 0.38s
5 1.7355 0.32s
6 1.6993 0.38s
7 1.6644 0.39s
8 1.6314 0.35s
9 1.6004 0.35s
10 1.5707 0.36s
20 1.3246 0.31s
30 1.1472 0.26s
40 1.0163 0.22s
50 0.9180 0.19s
60 0.8397 0.15s
70 0.7761 0.11s
80 0.7228 0.07s
90 0.6779 0.03s
100 0.6402 0.00s
Iter Train Loss Remaining Time
1 1.9216 0.38s
2 1.8713 0.34s
3 1.8284 0.38s
4 1.7869 0.41s
5 1.7464 0.42s
6 1.7118 0.41s
7 1.6770 0.38s
8 1.6435 0.40s
9 1.6110 0.36s
10 1.5815 0.39s
20 1.3336 0.31s
30 1.1549 0.27s
40 1.0228 0.23s
50 0.9233 0.19s
60 0.8438 0.15s
70 0.7793 0.12s
80 0.7261 0.08s
90 0.6813 0.04s
100 0.6433 0.00s
Iter Train Loss Remaining Time
1 1.9006 0.36s
2 1.8519 0.38s
3 1.8091 0.35s
4 1.7676 0.40s
5 1.7284 0.40s
6 1.6938 0.34s
7 1.6590 0.41s
8 1.6266 0.40s
9 1.5946 0.40s
10 1.5657 0.40s
20 1.3212 0.34s
30 1.1450 0.30s
40 1.0163 0.26s
50 0.9182 0.21s
60 0.8404 0.16s
70 0.7771 0.12s
80 0.7250 0.08s
90 0.6809 0.04s
100 0.6440 0.00s
Iter Train Loss Remaining Time
1 1.8992 0.40s
2 1.8493 0.39s
3 1.8064 0.43s
4 1.7656 0.41s
5 1.7262 0.35s
6 1.6915 0.35s
7 1.6583 0.37s
8 1.6252 0.33s
9 1.5943 0.36s
10 1.5656 0.35s
20 1.3231 0.29s
30 1.1477 0.25s
40 1.0175 0.21s
50 0.9197 0.17s
60 0.8415 0.14s
70 0.7776 0.10s
80 0.7246 0.07s
90 0.6799 0.03s
100 0.6427 0.00s
Iter Train Loss Remaining Time
1 1.9038 0.20s
2 1.8552 0.35s
3 1.8129 0.27s
4 1.7728 0.38s
5 1.7335 0.38s
6 1.6985 0.37s
7 1.6641 0.35s
8 1.6317 0.35s
9 1.6003 0.35s
10 1.5714 0.36s
20 1.3256 0.30s
30 1.1480 0.25s
40 1.0175 0.21s
50 0.9192 0.18s
60 0.8410 0.14s
70 0.7771 0.11s
80 0.7248 0.07s
90 0.6806 0.04s
100 0.6434 0.00s
###### Gradient Boosting Regression #######
Score : 0.6816
[0.65609682 0.66126518 0.67487789 0.67301113 0.66640737 0.65136067
0.65592692 0.69187255 0.67727584 0.68215423]
MSE : 0.62
MAE : 0.63
RMSE : 0.79
R2 : 0.68
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 3.9s finished
# Random Forest
clf_rf = RandomForestRegressor()
clf_rf.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_rf, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_rf.predict(X_test)
print('')
print('###### Random Forest ######')
print('Score : %.4f' % clf_rf.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
###### Random Forest ######
Score : 0.9017
[0.89311041 0.88891544 0.90256345 0.89246356 0.90829249 0.89037486
0.89787445 0.91075113 0.90245791 0.91223694]
MSE : 0.19
MAE : 0.31
RMSE : 0.44
R2 : 0.90
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 6.3s finished
# SVM
clf_svr = SVR()
clf_svr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_svr, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_svr.predict(X_test)
print('')
print('###### SVM ######')
print('Score : %.4f' % clf_svr.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.0min finished
###### SVM ######
Score : 0.9209
[0.91375922 0.91446739 0.91658973 0.9213643 0.93426432 0.92555275
0.91839594 0.93011461 0.92602424 0.92820973]
MSE : 0.15
MAE : 0.27
RMSE : 0.39
R2 : 0.92
# Neural Network
clf_nn = MLPRegressor()
clf_nn.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_nn, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = clf_nn.predict(X_test)
print('')
print('###### Neural Network ######')
print('Score : %.4f' % clf_nn.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
###### Neural Network ######
Score : 0.9219
[0.91211438 0.89218197 0.91710665 0.92285755 0.93468093 0.92464651
0.91431442 0.92979532 0.92579826 0.93059589]
MSE : 0.15
MAE : 0.27
RMSE : 0.39
R2 : 0.92
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 23.3s finished
# Stacked model
from mlxtend.regressor import StackingRegressor
stregr = StackingRegressor(regressors=[clf_nn],
meta_regressor=clf_svr)
stregr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = stregr, X = X_train, y = y_train, cv = 10,verbose = 1)
y_pred = stregr.predict(X_test)
print('')
print('###### Stacked Model ######')
print('Score : %.4f' % stregr.score(X_test, y_test))
print(accuracies)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.2f ' % r2)
R2_Scores.append(r2)
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 55.7s finished
###### Stacked Model ######
Score : 0.9168
[0.91291617 0.91417198 0.91740103 0.9216346 0.9348423 0.92472979
0.91767926 0.93272922 0.92369016 0.92903444]
MSE : 0.16
MAE : 0.27
RMSE : 0.40
R2 : 0.92
compare = pd.DataFrame({'Algorithms' : models , 'R2-Scores' : R2_Scores})
compare.sort_values(by='R2-Scores' ,ascending=False)
Algorithms | R2-Scores | |
---|---|---|
6 | Neural Network | 0.921914 |
5 | SVM | 0.920919 |
7 | Stacked model | 0.916799 |
1 | knn | 0.905014 |
4 | RandomForest Regression | 0.901718 |
0 | Linear Regression | 0.801112 |
3 | GradientBoosting Regression | 0.681629 |
2 | AdaBoost Regression | 0.447422 |
6. Feature importance
# Feature importance based on rf
feature_labels = np.array(['city','is_workday','temp_1','temp_2','wind','hour_0','hour_1',"hour_2","hour_3",
"hour_4","hour_5","hour_6","hour_7","hour_8","hour_9","hour_10","hour_11","hour_12",
"hour_13","hour_14","hour_15","hour_16","hour_17","hour_18","hour_19","hour_20",
"hour_21","hour_22","hour_23","weather_1","weather_2","weather_3","weather_4"])
importance = clf_rf.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
print('{} - {:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))
weather_4 - 0.00%
hour_12 - 0.07%
hour_15 - 0.07%
hour_13 - 0.07%
hour_14 - 0.09%
hour_11 - 0.13%
hour_16 - 0.19%
weather_2 - 0.24%
hour_20 - 0.25%
hour_10 - 0.26%
weather_1 - 0.31%
hour_21 - 0.32%
hour_9 - 0.32%
hour_19 - 0.33%
hour_22 - 0.68%
hour_18 - 0.84%
hour_17 - 0.86%
wind - 1.01%
hour_8 - 1.15%
city - 1.19%
hour_7 - 1.33%
hour_23 - 1.65%
weather_3 - 1.76%
hour_6 - 3.28%
hour_0 - 4.16%
temp_2 - 4.77%
is_workday - 7.65%
hour_1 - 8.02%
temp_1 - 9.47%
hour_5 - 10.00%
hour_2 - 10.80%
hour_3 - 14.25%
hour_4 - 14.50%