Kaggle-Rohlik Orders Forecasting Challenge。使用xgb.XGBRegressor和LinearRegression达到分数0.0333,排名前2%

首先导入必要的库。

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import math
from math import pi
import warnings
from scipy.stats import chi2_contingency
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statlearning import plot_regressions, plot_boxes, barplots, plot_dist, plot_dists, plot_stripplot
from sklearn.model_selection import RepeatedKFold, GridSearchCV, train_test_split
from sklearn.preprocessing import MinMaxScaler
from statlearning import forward
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV, LinearRegression
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score,  mean_absolute_error, mean_absolute_percentage_error

warnings.filterwarnings('ignore') 

# Plot settings
sns.set_context('notebook') # optimises figures for notebook display
sns.set_style('ticks') # set default plot style
tableau10 = ['#4E79A7','#F28E2C','#E15759','#76B7B2','#59A14F', 
          '#EDC949','#AF7AA1','#FF9DA7','#9C755F','#BAB0AB']
sns.set_palette(tableau10) # set custom color scheme
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)

 导入数据:

train = pd.read_csv("train.csv", index_col = "id")
train_calendar = pd.read_csv("train_calendar.csv")
test = pd.read_csv("test.csv", index_col = "id")
test_calendar = pd.read_csv("test_calendar.csv")
solution_df = pd.read_csv("solution_example.csv", index_col = "id")
solution = np.array(solution_df).reshape(-1, 1)

数据和相关库已经全部导入完成,接下来开始进行正式环节,包括所有必要的环节,包括探索性数据分析、特征工程、建模、调节超参数、模型评估等过程,因为在分析过程中,随时会有新的有价值的信息被发现,所以数据分析的过程并不会一定遵守固有套路。

探索性数据分析(EDA)

train.info()

输出:

可以发现训练集有三个字段是缺失的,分别是holiday_name、precipitation、snow,同时发现后两个字段缺失行数是一样的。并且holiday_name的缺失值是一种非随机的缺失,因此将缺失值填补为“no feature”。又因为precipitation和snow不涉及到test数据集的预测,因此简单将其填充为平均值,仅为了绘图观察。

train["holiday_name"] = train["holiday_name"].fillna("no feature")
train["precipitation"] = train["precipitation"].fillna(train["precipitation"].mean())
train["snow"] = train["snow"].fillna(train["snow"].mean())
test.info()

输出:

可以发现对于测试集来说,列更少,且同样缺失holiday_name。

接下来进行数据可视化,方便发现数据之间的关联以及识别数据的分布及有无异常值等等情况。

为方便绘图,绘图之前要将数据按照度量类型进行分类,方便统一绘图。因此,将数据分为连续型(continuous)、离散型(discrete)、名义变量(norminal)和序数变量(ordinal)。顺便定义因变量response为“orders”。

response = "orders"
predictors = list(test.columns.difference([response]))
predictors.remove("date")
print(train.columns)
continuous = ["precipitation", "snow", "user_activity_1", "user_activity_2"]
discrete = ["mov_change"]
ordinal = []
norminal = ["warehouse", "holiday_name", "holiday", "shutdown",
       "mini_shutdown", "shops_closed", "winter_school_holidays",
       "school_holidays", "blackout", "frankfurt_shutdown"]

绘制交叉表查看连续型变量与因变量之间的Pearson相关系数。需要对连续型变量进行中心化处理(MinMaxScaler),使相关系数计算准确。

scaler = MinMaxScaler()
train_tran = train.copy()
train_tran[continuous + [response]] = scaler.fit_transform(train_tran[continuous + [response]])
train_tran[continuous + [response]].corr()

输出:

绘制散点图:

plot_regressions(train[continuous], train[response])
plt.show()

绘制柱状图:

barplots(train[norminal])
plt.show()

绘制箱线图:

plot_boxes(train[norminal], train[response])
plt.show()

查看名义变量的大致分布:

for i in norminal:
    print(train[i].value_counts())
    print("\n")

warehouse
Prague_1       1193
Brno_1         1193
Prague_2       1193
Prague_3       1193
Budapest_1     1154
Munich_1        785
Frankfurt_1     629
Name: count, dtype: int64


holiday_name
no feature                                                     7122
International womens day                                         26
Christmas Eve                                                    23
2nd Christmas Day                                                16
New Years Day                                                    16
Cyrila a Metodej                                                 12
Den boje za svobodu a demokracii                                 12
Den ceske statnosti                                              12
Jan Hus                                                          12
Den vzniku samostatneho ceskoslovenskeho statu                   12
Den osvobozeni                                                   12
Labour Day                                                       12
Easter Monday                                                    12
Good Friday                                                      12
Memorial Day of the Republic                                      4
Memorial Day for the Victims of the Communist Dictatorships       4
Memorial Day for the Victims of the Holocaust                     3
National Defense Day                                              3
Day of National Unity                                             3
Independent Hungary Day                                           3
Memorial Day for the Martyrs of Arad                              3
Peace Festival in Augsburg                                        2
Reformation Day                                                   2
1848 Revolution Memorial Day (Extra holiday)                      1
All Saints' Day Holiday                                           1
Name: count, dtype: int64


holiday
0    7140
1     200
Name: count, dtype: int64


shutdown
0    7339
1       1
Name: count, dtype: int64


mini_shutdown
0    7336
1       4
Name: count, dtype: int64


shops_closed
0    7260
1      80
Name: count, dtype: int64


winter_school_holidays
0    7120
1     220
Name: count, dtype: int64


school_holidays
0    7288
1      52
Name: count, dtype: int64


blackout
0    7333
1       7
Name: count, dtype: int64


frankfurt_shutdown
0    7338
1       2
Name: count, dtype: int64

plot_dists(train[continuous])
plt.show()


查看因变量的分布:

fig, ax = plot_dist(train[response])
ax.set_title('Distribution plot for orders', fontsize = 14)
plt.show()


因变量的分布明显右偏,且存在离群值,尝试对其进行对数处理。

fig, ax = plot_dist(np.log(train[response]))
ax.set_title('Distribution plot for orders', fontsize = 14)
plt.show()

对数处理后的结果如下:

数据预处理

首先删除因变量的离群值,方法是利用zscore删除绝对值大于临界值的行。

z = stats.zscore(train["orders"])
z_outlier = (z > 3) | (z < -3)
outliers = train[z_outlier].index
train.drop(outliers, axis = 0, inplace = True)

 删除"2023-03-14"之前的数据,仅用近一年的数据进行预测。

outliers = train.loc[(train["date"] <= "2023-3-14")].index
train.drop(outliers, axis = 0, inplace = True)

特征工程

train.groupby("warehouse")["orders"].median().sort_values(ascending = False)
warehouse_orders_rank_ordinalmap = {"Frankfurt_1": 7,"Munich_1": 6,"Prague_3": 5,"Prague_2": 4,"Budapest_1": 3, "Brno_1":2, "Prague_1":1}
def feature_engineering(data):
    
    data["warehouse_orders_rank"] = data["warehouse"].map(warehouse_orders_rank_ordinalmap)

    data["date"] = pd.to_datetime(data["date"])
    data["year"] = data["date"].dt.year
    data["month"] = data["date"].dt.month
    data["day"] = data["date"].dt.day
    data["week"] = data["date"].dt.isocalendar().week.astype(int)
    data["weekday"] = data["date"].dt.weekday
    data["quarter"] = data["date"].dt.quarter

    data.loc[data["weekday"] == 7, "is_month_7"] = 1
    data.loc[data["weekday"] != 7, "is_month_7"] = 0
    data["is_month_7"] = data["is_month_7"].astype("object")

    data.loc[data["month"] < 6, "before_month_seven"] = 0
    data.loc[(data["month"] > 6) & (data["month"] < 8), "before_month_seven"] = 1
    data.loc[data["month"] >= 8, "before_month_seven"] = 2
    data["before_month_seven"] = data["before_month_seven"].astype("object")

    data.loc[data["year"] <= 2022, "before_year_2022"] = 0
    data.loc[data["year"] > 2022, "before_year_2022"] = 1
    data["before_year_2022"] = data["before_year_2022"].astype("object")

    data.loc[data["weekday"] == 4, "is_weekday_4"] = 1
    data.loc[data["weekday"] != 4, "is_weekday_4"] = 0
    data["is_weekday_4"] = data["is_weekday_4"].astype(str)

    data.loc[data["weekday"] == 27, "is_week_27"] = 1
    data.loc[data["weekday"] != 27, "is_week_27"] = 0
    data["is_week_27"] = data["is_week_27"].astype("object")


    data.loc[(data["weekday"] == 1) | (data["weekday"] == 2), "is_weekday1and2"] = 1
    data.loc[(data["weekday"] != 1) & (data["weekday"] != 2), "is_weekday1and2"] = 0
    data["is_weekday1and2"] = data["is_weekday1and2"].astype(str)

    data["compo"] = (data["month"] * (data["year"] - 2020) * data["weekday"]).apply(lambda x: math.pow(x, 1/3))


    data['month_cos'] = data['month'] * np.cos(2 * pi * data['month'])
    data['day_cos'] = data['day'] * np.cos(2 * pi * data['day'])



    data['holiday_before'] = data['holiday'].shift(1).fillna(0).astype(int)
    data['holiday_after'] = data['holiday'].shift(-1).fillna(0).astype(int)

    data['holiday_before_2d'] = data['holiday'].shift(2).fillna(0).astype(int)
    data['holiday_after_2d'] = data['holiday'].shift(-2).fillna(0).astype(int)

    data['holiday_before_3d'] = data['holiday'].shift(3).fillna(0).astype(int)
    data['holiday_after_3d'] = data['holiday'].shift(-3).fillna(0).astype(int)

    data['holiday_before_4d'] = data['holiday'].shift(6).fillna(0).astype(int)
    data['holiday_after_4d'] = data['holiday'].shift(-6).fillna(0).astype(int)

    data['shops_closed_before'] = data['shops_closed'].shift(1).fillna(0).astype(int)
    data['shops_closed_after'] = data['shops_closed'].shift(-1).fillna(0).astype(int)


    data["month_str"] = data["month"].astype("str")
    data["weekday_str"] = data["weekday"].astype("str")


    return



feature_engineering(train)

在特征工程后,新增了很多特征,对新增的特征进行可视化,观察是否和因变量有相关性,再决定是否加入最终的预测。

x = train.groupby("month")[["orders"]].median()
fig, ax= plt.subplots()
ax.plot(x)
ax.set_xlabel("month")
ax.set_ylabel("median of orders")
sns.despine()
plt.show()

 可以看到月份和订单之间存在极强的时间关联,上半年订单是逐月减少,下半年订单是逐月增加,且7月为订单量最少的月份。

x = train.groupby("year")[["orders"]].median()
fig, ax= plt.subplots()
ax.plot(x)
ax.set_xlabel("year")
ax.set_ylabel("median of orders")
sns.despine()
plt.show()

 

x = train.groupby("day")[["orders"]].median()
fig, ax= plt.subplots()
ax.plot(x)
ax.set_xlabel("day")
ax.set_ylabel("median of orders")
sns.despine()
plt.show()

 可以看到每月的第一天的订单量是最少的。

x = train.groupby("weekday")[["orders"]].median()
fig, ax= plt.subplots()
ax.plot(x)
ax.set_xlabel("weekday")
ax.set_ylabel("median of orders")
sns.despine()
plt.show()

 

可以看到周五的订单量是最高的,周末的订单量反而没有周五高。前四天的订单量都大幅减少。

x = train.groupby("week")[["orders"]].median()
fig, ax= plt.subplots()
ax.plot(x)
ax.set_xlabel("week")
ax.set_ylabel("median of orders")
sns.despine()
plt.show()

再看周数变量,也能发现与月份相似的分布,因为"week"和"month"这两个变量存在较大关联,因此仅使用一个变量。 

通过上述可视化以及直觉常识,新增以下变量(已通过feature_engineering()将需要新增的变量增添到train数据集中了,因此这步仅将特征添加到predictors集中,因为只有predictors中的特征才会参与训练):

predictors.append("year")
predictors.append("day")
predictors.append("week")
predictors.append("quarter")
predictors.append("is_weekday_4")
predictors.append("is_weekday1and2")
predictors.append("is_week_27")
predictors.append("is_month_7")
predictors.append("month_str")
predictors.append("weekday_str")
predictors.append("warehouse_orders_rank")
predictors.append("holiday_before")
predictors.append("holiday_after")
predictors.append("holiday_before_2d")
predictors.append("holiday_after_2d")
predictors.append("holiday_before_3d")
predictors.append("holiday_after_3d")

训练及拟合

包括对训练集train进行train_test_split分割,用70%的数据训练模型,其余的数据来评估模型。并且对因变量y进行对数化处理。

index_val_train, index_val  = train_test_split(np.array(train.index), train_size = 0.7, random_state = 12)
val_train = train.loc[index_val_train]
val = train.loc[index_val]
dummies = pd.get_dummies(train[predictors], drop_first = True)
dummies_2 = pd.get_dummies(train[predictors_2], drop_first = True)
x_val_train = dummies.loc[index_val_train].to_numpy()
x_val = dummies.loc[index_val].to_numpy()

y_val_train = train.loc[index_val_train, response].to_numpy().reshape(-1, 1)
y_val = train.loc[index_val, response].to_numpy().reshape(-1, 1)
log_y_val_train = np.log(train.loc[index_val_train, response].to_numpy().reshape(-1, 1))
log_y_val = np.log(train.loc[index_val, response].to_numpy().reshape(-1, 1))

分别使用LinearRegression, forward, LassoCV, RidgeCV, ElasticNetCV, xgb.XGBRegressor这六种模型进行初步的训练,评估指标使用"MAPE", "Val RMSE", "Val R2", "Val MAE"。过程如下:

columns = ["Val MAPE", "Train MAPE", "Val RMSE", "Val R2", "Val MAE"]
methods = [LinearRegression, forward, LassoCV, RidgeCV, ElasticNetCV, xgb.XGBRegressor, "voting"]
results = pd.DataFrame(0.0, columns = columns, index = [method.__name__ for method in methods if method != "ensemble"] + ["ensemble"]) 
y_preds = []
y_train_preds = []
errors = []
train_errors = []
for i, method in enumerate(methods):
        if method == xgb.XGBRegressor:
            model = method(n_estimators = 1500, learning_rate = 0.1, max_depth = 5) 
        elif method == "ensemble":
            results.iloc[i,0] = 0
            results.iloc[i,1] = 0
            results.iloc[i,2] = 0
            results.iloc[i,3] = 0
        else:
            model = method()
        model.fit(x_val_train, log_y_val_train)
        log_y_pred  = model.predict(x_val).reshape(-1, 1) 
        log_y_train_pred = model.predict(x_val_train).reshape(-1, 1) 
        y_pred = np.exp(log_y_pred)
        y_train_pred = np.exp(log_y_train_pred)

        error = ((y_val - y_pred) / y_val) * 100
        train_error = ((y_val_train - y_train_pred) / y_val_train) * 100
        results.iloc[i,0] = mean_absolute_percentage_error(y_val, y_pred)
        results.iloc[i,1] = mean_absolute_percentage_error(y_val_train, y_train_pred)
        results.iloc[i,2] = np.sqrt(mean_squared_error(y_val, y_pred))
        results.iloc[i,3] = r2_score(y_val, y_pred)
        results.iloc[i,4] = mean_absolute_error(y_val, y_pred)
        y_preds.append(y_pred)
        y_train_preds.append(y_train_pred)
        errors.append(error)
        train_errors.append(train_error)

y_pred = np.mean(np.array(y_preds[:2]), axis = 0)
y_train_pred = np.mean(np.array(y_train_preds[:2]), axis = 0)
error = ((y_val - y_pred) / y_val) * 100
train_error = ((y_val_train - y_train_pred) / y_val_train) * 100
y_preds.append(y_pred)
y_train_preds.append(y_train_pred)
errors.append(error)
train_errors.append(train_error)
results.iloc[-1,0] = mean_absolute_percentage_error(y_val, y_pred)
results.iloc[-1,1] = mean_absolute_percentage_error(y_val_train, y_train_pred)
results.iloc[-1,2] = np.sqrt(mean_squared_error(y_val, y_pred))
results.iloc[-1,3] = r2_score(y_val, y_pred)
results.iloc[-1,4] = mean_absolute_error(y_val, y_pred)

results.round(5)

 输出:

通过训练结果看到XGBRegressor作为复杂度最高的模型,其MAPE在训练集上表现最优,而其他模型在训练集的表现均差强人意。直觉上应该将XGBRegressor作为最终模型提交。可是实际提交后,XGBRegressor在测试集上的表现确非常糟糕,反而相对简单的模型的表现要好于XGBRegressor。这说明XGBRegressor在这个数据集中存在较大的过拟合问题,为避免这个问题,应该使用其他模型。其中LinearRegression的表现是最好的,因此将LinearRegression作为基模型参与后续的训练。

对训练误差再拟合

绘制误差-预测值图来评估模型,发现是否符合模型假设,模型是否存在问题:

x = "y_pred_LinearRegression"
y = "error_LinearRegression"
hue = "warehouse"
plot_stripplot(x, y, df_pred, hue)
plt.show()

输出:

通过绘制误差-预测图,可以轻易地发现误差和预测值存在明显的相关性,且各个仓库(warehouse)的情况也有所不同,比如Munich_1的误差方差明显大于其他仓库,以Munich_1为例:

x = "y_pred_LinearRegression"
y = "error_LinearRegression"
hue = "year"
plot_stripplot(x, y, df_pred.loc[df_pred["warehouse"] == "Munich_1"], hue)
plt.show()

可以更明显地看到Munich_1的预测值与误差是正相关关系。

通过计算相关系数进一步查看:

print("Prague_1", round(ye_Prague_1[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Brno_1", round(ye_Brno_1[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Prague_2", round(ye_Prague_2[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Prague_3", round(ye_Prague_3[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Munich_1", round(ye_Munich_1[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Frankfurt_1", round(ye_Frankfurt_1[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))
print("Budapest_1", round(ye_Budapest_1[["y_pred_LinearRegression", "error_LinearRegression"]].corr().values[0][1], 2))

 输出:

可以看到各个仓库的预测值与误差存在较大的相关性,有些是正相关、有些是负相关。通过对误差值再拟合可以进一步改善模型。

首先将训练集的训练误差(error)作为新列添加到训练集中:

error = ((y_val - y_pred) / y_val) * 100
df_train = pd.DataFrame(
    {
        "id":index_val_train,
        "y_val_train":y_val_train.reshape(-1), 
        "y_pred_LinearRegression":y_train_preds[0].reshape(-1), 
        "y_pred_forward":y_train_preds[1].reshape(-1), 
        "y_pred_LassoCV":y_train_preds[2].reshape(-1), 
        "y_pred_RidgeCV":y_train_preds[3].reshape(-1), 
        "y_pred_ElasticNetCV":y_train_preds[4].reshape(-1), 
        "y_pred_XGBRegressor":y_train_preds[5].reshape(-1),
        "y_pred_ensemble":y_train_preds[6].reshape(-1),
        "error_LinearRegression":train_errors[0].reshape(-1), 
        "error_forward":train_errors[1].reshape(-1), 
        "error_LassoCV":train_errors[2].reshape(-1), 
        "error_RidgeCV":train_errors[3].reshape(-1), 
        "error_ElasticNetCV":train_errors[4].reshape(-1), 
        "error_XGBRegressor":train_errors[5].reshape(-1),
        "error_ensemble":train_errors[6].reshape(-1),
        }
        )
df_train = df_train.set_index("id")
df_train = val_train.merge(df_train, how = "inner", left_on = "id", right_on = "id")

df_pred = pd.DataFrame(
    {
        "id":index_val,
        "y_val":y_val.reshape(-1), 
        "y_pred_LinearRegression":y_preds[0].reshape(-1), 
        "y_pred_forward":y_preds[1].reshape(-1), 
        "y_pred_LassoCV":y_preds[2].reshape(-1), 
        "y_pred_RidgeCV":y_preds[3].reshape(-1), 
        "y_pred_ElasticNetCV":y_preds[4].reshape(-1), 
        "y_pred_XGBRegressor":y_preds[5].reshape(-1),
        "y_pred_ensemble":y_preds[6].reshape(-1),
        "error_LinearRegression":errors[0].reshape(-1), 
        "error_forward":errors[1].reshape(-1), 
        "error_LassoCV":errors[2].reshape(-1), 
        "error_RidgeCV":errors[3].reshape(-1), 
        "error_ElasticNetCV":errors[4].reshape(-1), 
        "error_XGBRegressor":errors[5].reshape(-1),
        "error_ensemble":errors[6].reshape(-1),
        }
        )
# df_pred["error_LinearRegression_b1"] = df_pred.groupby("warehouse")["error_LinearRegression"].shift(1)
# df_pred[f"error_LinearRegression_b1"] = df_pred.groupby("warehouse")[f"error_LinearRegression_b1"].fillna(method = "bfill")
df_pred = df_pred.set_index("id")
df_pred = val.merge(df_pred, how = "inner", left_on = "id", right_on = "id")

pred_result = dummies_2.loc[index_val].merge(df_pred, how = "inner", left_on = "id", right_on = "id", suffixes = ("", "_drop"))
df_train = dummies_2.loc[index_val_train].merge(df_train, how = "inner", left_on = "id", right_on = "id", suffixes = ("", "_drop"))
cat = dummies_2.loc[index_val_train].columns.to_list()

之后通过训练误差对预测值y进行调整得到adjusted_y:

公式为:

class Adjust_y():
    def __init__(self, df_pred):
        self.df_pred = df_pred
        pass

    def fit(self, x, y, model, para = None):
        x = self.df_pred[x].to_numpy()
        y = self.df_pred[y].to_numpy().reshape(-1, 1)
        if para != None:
            self.model = model(**para)
        else:
            self.model = model()
        self.model.fit(x, y)
        return self
    
    def predict(self, x, target_y, data):
        x_train = data[x].to_numpy()
        error_pred = self.model.predict(x_train).reshape(-1, 1)
        
        adjusted_y = data[target_y].to_numpy().reshape(-1 , 1) / (1 - (error_pred / 100))
        return adjusted_y
    
    def add_to_solution(self, x, target_y, data):
        adjusted_y = self.predict(x, target_y, data)
        data["orders"] = adjusted_y.ravel()
        return data

执行该过程:

para = {
    "n_estimators":1500, 
    "max_depth":9,
    "learning_rate":0.09,
    "reg_lambda":14,
    "colsample_bytree":0.8,}
emsemble = Adjust_y(df_train)
emsemble.fit(cat + ["y_pred_ensemble"], "error_ensemble", xgb.XGBRegressor, para = para)
adjusted_y_xgb = emsemble.predict(cat + ["y_pred_ensemble"], "y_pred_ensemble", pred_result.copy())
t_error = ((y_val - adjusted_y_xgb) / y_val) * 100
df_pred["t_error"] = t_error
df_pred["adjusted_y"] = adjusted_y_xgb
mape = mean_absolute_percentage_error(y_val, adjusted_y_xgb)   
print(mape)

在执行后为进一步提高模型表现,可进一步剔除对模型训练无效的数据此过程通过发现并剔除误差过大的数据点来实现,例如:

df_pred.loc[np.abs(df_pred["t_error"])>20].index

输出:

Index(['Munich_1_2023-12-02', 'Munich_1_2023-07-16'], dtype='object', name='id')

删除误差较大的数据行:

outliers = ['Munich_1_2023-12-02', 'Munich_1_2023-07-16']
train.drop(outliers, axis = 0, inplace = True)

删除后再次训练模型,然后不断重复此过程大概3次,最终MAPE达到极限。

本人的输出结果如下:

MAPE = 0.028254926163139207

可见通过训练误差对预测值进行调整后的MAPE可以得到较大的改善。

评估训练误差

x = df_pred["adjusted_y"]
y = df_pred["t_error"]
fig, ax = plt.subplots()
sns.regplot(x = x, y = y, ci = None, scatter_kws = {'s':5}, 
            line_kws = {'color':'black', 'alpha':0.6})
ax.set_xlabel(x.name)
ax.set_ylabel(y.name)
sns.despine()
plt.show()

此时观察误差-预测值图,误差已和预测值基本上不存在明显的相关性,且误差基本上满足了正态分布且方差一致。

预测:

feature_engineering(test)

train_index = train.index
test_index = test.index
pp = pd.concat([train[predictors], test[predictors]], axis = 0)
predict_dummies = pd.get_dummies(pp, drop_first = True)
x_train = predict_dummies.loc[train_index].to_numpy()
x_test = predict_dummies.loc[test_index].to_numpy()
y_train = train[response].to_numpy().reshape(-1, 1)
log_y_train = np.log(train[response].to_numpy().reshape(-1, 1))


columns = ["Val MAPE", "Train MAPE", "Val RMSE", "Val R2", "Val MAE"]
methods = [LinearRegression, forward, LassoCV, RidgeCV, ElasticNetCV, xgb.XGBRegressor, "ensemble"]
results = pd.DataFrame(0.0, columns = columns, index = [method.__name__ for method in methods if method != "ensemble"] + ["ensemble"]) 
y_preds = []
y_train_preds = []
errors = []
train_errors = []
for i, method in enumerate(methods):
        if method == xgb.XGBRegressor:
            model = method(n_estimators = 1500, learning_rate = 0.1, max_depth = 5) 
        elif method == "ensemble":
            results.iloc[i,0] = 0
            results.iloc[i,1] = 0
            results.iloc[i,2] = 0
            results.iloc[i,3] = 0
        else:
            model = method()
        model.fit(x_train, log_y_train)
        log_y_pred  = model.predict(x_test).reshape(-1, 1) 
        log_y_train_pred = model.predict(x_train).reshape(-1, 1) 
        y_pred = np.exp(log_y_pred)
        y_train_pred = np.exp(log_y_train_pred)

        train_error = ((y_train - y_train_pred) / y_train) * 100
        results.iloc[i,0] = mean_absolute_percentage_error(y_train, y_train_pred)
        results.iloc[i,1] = np.sqrt(mean_squared_error(y_train, y_train_pred))
        results.iloc[i,2] = r2_score(y_train, y_train_pred)
        results.iloc[i,3] = mean_absolute_error(y_train, y_train_pred)
        y_preds.append(y_pred)
        y_train_preds.append(y_train_pred)
        train_errors.append(train_error)

y_pred = np.mean(np.array(y_preds[:2]), axis = 0)
y_train_pred = np.mean(np.array(y_train_preds[:2]), axis = 0)
train_error = ((y_train - y_train_pred) / y_train) * 100
y_preds.append(y_pred)
y_train_preds.append(y_train_pred)
train_errors.append(train_error)
results.iloc[-1,0] = mean_absolute_percentage_error(y_train, y_train_pred)
results.iloc[-1,1] = mean_absolute_percentage_error(y_train, y_train_pred)
results.iloc[-1,2] = np.sqrt(mean_squared_error(y_train, y_train_pred))
results.iloc[-1,3] = r2_score(y_train, y_train_pred)
results.iloc[-1,4] = mean_absolute_error(y_train, y_train_pred)

results.round(5)

pred_result = pd.DataFrame({"id":test_index, "orders":y_preds[0].ravel()})
pred_result = pred_result.set_index("id")
pred_result = predict_dummies.loc[test_index].merge(pred_result, how = "inner", left_on = "id", right_on = "id")

df_train_pred = pd.DataFrame(
    {
        "id":train_index,
        "y_train":y_train.reshape(-1), 
        "y_pred_LinearRegression":y_train_preds[0].reshape(-1), 
        "y_pred_forward":y_train_preds[1].reshape(-1), 
        "y_pred_LassoCV":y_train_preds[2].reshape(-1), 
        "y_pred_RidgeCV":y_train_preds[3].reshape(-1), 
        "y_pred_ElasticNetCV":y_train_preds[4].reshape(-1), 
        "y_pred_XGBRegressor":y_train_preds[5].reshape(-1),
        "y_pred_ensemble":y_train_preds[6].reshape(-1),
        "error_LinearRegression":train_errors[0].reshape(-1), 
        "error_forward":train_errors[1].reshape(-1), 
        "error_LassoCV":train_errors[2].reshape(-1), 
        "error_RidgeCV":train_errors[3].reshape(-1), 
        "error_ElasticNetCV":train_errors[4].reshape(-1), 
        "error_XGBRegressor":train_errors[5].reshape(-1),
        "error_ensemble":train_errors[6].reshape(-1),
        }
        )
df_train_pred = df_train_pred.set_index("id")
df_train_pred = predict_dummies.loc[train_index].merge(df_train_pred, how = "inner", left_on = "id", right_on = "id")

df_train_pred = predict_dummies.loc[train_index].merge(df_train_pred, how = "inner", left_on = "id", right_on = "id", suffixes = ("", "_drop"))
cat = predict_dummies.loc[train_index].columns.to_list()




para = {
    "n_estimators":1500, 
    "max_depth":9,
    "learning_rate":0.09,
    "reg_lambda":14,
    "colsample_bytree":0.8,}
emsemble = Adjust_y(df_train_pred)
emsemble.fit(cat + ["y_pred_LinearRegression", ], "error_LinearRegression", xgb.XGBRegressor, para = para)
solu = emsemble.add_to_solution(cat + ["orders",], "orders", pred_result.copy())
pred_xgb = solu["orders"]

solu["orders"] = pred_xgb
solution = solu["orders"]
solution.to_csv("results.csv")

提交结果:

  • 13
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值