二手车价格预测——Task4 建模调参


前言

利用前面筛选留下来的特征和为模型准备好的数据,就可以开始建立模型了。当然模型也有很多,一般我们会建立多个单模,然后进行比较,选择效果比较好的,进行调参。
相关的模型算法参考如下:

  • 线性回归模型
    -https://zhuanlan.zhihu.com/p/49480391

  • 决策树模型
    -https://zhuanlan.zhihu.com/p/65304798

  • GBDT模型
    -https://zhuanlan.zhihu.com/p/45145899

  • XGBoost模型
    -https://zhuanlan.zhihu.com/p/86816771

  • LightGBM模型
    -https://zhuanlan.zhihu.com/p/89360721


一、代码示例

1.读取数据

代码如下(示例):

#1 读取数据
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#reduce_mem_usage 函数通过调整数据类型,帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
sample_feature = reduce_mem_usage(pd.read_csv('C:\\Users\\TINKPAD\\Desktop\\python_work\\kaggle\二手车交易价格预测\\data_for_tree.csv'))

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%

2.线性回归 & 五折交叉验证 & 模拟真实业务情况

代码如下(示例):

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']
intercept:-110670.68277255078
Out[30]: 
[('v_6', 3367064.341641925),
 ('v_8', 700675.5609399723),
 ('v_9', 170630.2772322865),
 ('v_7', 32322.661932090483),
 ('v_12', 20473.67079697052),
 ('v_3', 17868.079541493866),
 ('v_11', 11474.938996734494),
 ('v_13', 11261.764560011092),
 ('v_10', 2683.9200905845105),
 ('gearbox', 881.8225039248697),
 ('fuelType', 363.90425072173207),
 ('bodyType', 189.60271012074244),
 ('city', 44.94975120525054),
 ('power', 28.553901616755937),
 ('brand_price_median', 0.5103728134078707),
 ('brand_price_std', 0.45036347092629764),
 ('brand_amount', 0.14881120395066877),
 ('brand_price_max', 0.003191018670314537),
 ('SaleID', 5.355989919863198e-05),
 ('offerType', 4.333909600973129e-06),
 ('seller', 3.433728124946356e-06),
 ('train', 3.91155481338501e-08),
 ('brand_price_sum', -2.175006868187838e-05),
 ('name', -0.0002980012713055598),
 ('used_time', -0.0025158943328461103),
 ('brand_price_average', -0.4049048451010788),
 ('brand_price_min', -2.2467753486901927),
 ('power_bin', -34.42064411729223),
 ('v_14', -274.7841180773193),
 ('kilometer', -372.8975266607293),
 ('notRepairedDamage', -495.1903844629135),
 ('v_0', -2045.0549573516316),
 ('v_5', -11022.986240203032),
 ('v_4', -15121.731109853234),
 ('v_2', -26098.29992055256),
 ('v_1', -45556.18929729469)]
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

#绘制特征v_9的值与标签的散点图,图片发现模型的预测结果(蓝色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况,说明我们的模型存在一些问题
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

data_v_9与price的散点图通过作图我们发现数据的标签(price)呈现长尾分布,不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布,而长尾分布的数据违背了这一假设。参考博客:
https://blog.csdn.net/Noob_daniel/article/details/76087829

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

price的长尾分布图
在这里我们对标签进行了 l o g ( x + 1 ) log(x+1) log(x+1) 变换,使标签贴近于正态分布

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

在这里插入图片描述

model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:18.750745460089753
Out[34]: 
[('v_9', 8.052411927759717),
 ('v_5', 5.764248502270255),
 ('v_12', 1.6182066744745203),
 ('v_1', 1.4798302934386072),
 ('v_11', 1.1669014496988603),
 ('v_13', 0.9404706038653075),
 ('v_7', 0.7137295307890688),
 ('v_3', 0.6837865320359081),
 ('v_0', 0.00850052523853878),
 ('power_bin', 0.008497967226210152),
 ('gearbox', 0.007922377819952147),
 ('fuelType', 0.006684768278647269),
 ('bodyType', 0.004523520659139563),
 ('power', 0.0007161896117536551),
 ('brand_price_min', 3.3343530827535034e-05),
 ('brand_amount', 2.897880010254035e-06),
 ('brand_price_median', 1.2571119996582382e-06),
 ('brand_price_std', 6.65913427853293e-07),
 ('brand_price_max', 6.194957240892813e-07),
 ('brand_price_average', 5.99942948919983e-07),
 ('SaleID', 2.1194162066547266e-08),
 ('seller', 5.018723214789134e-10),
 ('offerType', 1.2497025636548642e-10),
 ('train', -5.4569682106375694e-12),
 ('brand_price_sum', -1.51265104458101e-10),
 ('name', -7.015510649965327e-08),
 ('used_time', -4.122477171058617e-06),
 ('city', -0.0022187835425505065),
 ('v_14', -0.0042341869054340845),
 ('kilometer', -0.01383586688757622),
 ('notRepairedDamage', -0.27027942062483196),
 ('v_4', -0.831569687754648),
 ('v_2', -0.9470831015207516),
 ('v_10', -1.6261473673129143),
 ('v_8', -40.34300698770074),
 ('v_6', -238.7903582804579)]
#再次进行可视化,发现预测结果与真实值较为接近,且未出现异常状况
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

在这里插入图片描述

## 2) 五折交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
print('AVG:', np.mean(scores))
AVG: 1.3658024042408266

使用线性回归模型,对未处理标签的特征数据进行五折交叉验证(Error 1.36)

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))
AVG: 0.19325301535176903

使用线性回归模型,对处理过标签的特征数据进行五折交叉验证(Error 0.19)

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores
          cv1       cv2       cv3       cv4       cv5
MAE  0.190792  0.193758  0.194132  0.191825  0.195758
##3) 模拟真实业务情况
import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))
 0.19577667040507446
##4) 绘制学习率曲线与验证曲线
from sklearn.model_selection import learning_curve, validation_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1) 

在这里插入图片描述

3 多种模型对比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)
##1) 线性模型 & 嵌入式特征选择
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
          Ridge(),
          Lasso()]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
Ridge is finished
Lasso is finished
##对三种方法的效果对比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
     LinearRegression     Ridge     Lasso
cv1          0.190792  0.194832  0.383899
cv2          0.193758  0.197632  0.381893
cv3          0.194132  0.198123  0.384090
cv4          0.191825  0.195670  0.380526
cv5          0.195758  0.199676  0.383611
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:18.750720806549992

在这里插入图片描述

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:4.67171085713921

在这里插入图片描述

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:8.672182455497687

在这里插入图片描述

##2) 非线性模型
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished

中间三个模型花费的时间特别长.大约两个小时左右了!

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
     LinearRegression  DecisionTreeRegressor  ...  XGBRegressor  LGBMRegressor
cv1          0.190792               0.199829  ...      0.142378       0.141544
cv2          0.193758               0.193254  ...      0.140922       0.145501
cv3          0.194132               0.189478  ...      0.139393       0.143887
cv4          0.191825               0.190465  ...      0.137492       0.142497
cv5          0.195758               0.204864  ...      0.143733       0.144852

[5 rows x 7 columns]

5.模型调参

## LGB的参数集合:
objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
##1) 贪心调参
best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], 
             y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), 
            min(best_depth.values())])
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], 
             y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), 
            min(best_depth.values())])

在这里插入图片描述

## 2) Grid Search 调参
from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

 0.13754832925624427

网格搜索需要的时间非常多,不建议使用,约一两个小时.

##3) 贝叶斯调参
pip install bayesian-optimization #没安装的 此行代码安装

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
rf_bo.maximize()
|   iter    |  target   | max_depth | min_ch... | num_le... | subsample |
-------------------------------------------------------------------------
|  1        |  0.8688   |  37.88    |  72.99    |  98.01    |  0.5618   |
|  2        |  0.8657   |  24.36    |  60.07    |  63.62    |  0.8379   |
|  3        |  0.8641   |  17.35    |  24.97    |  53.32    |  0.6477   |
|  4        |  0.8609   |  63.2     |  7.641    |  40.95    |  0.1872   |
|  5        |  0.8594   |  31.16    |  90.25    |  35.92    |  0.9085   |
|  6        |  0.8693   |  26.29    |  55.65    |  98.07    |  0.1431   |
|  7        |  0.8508   |  5.2      |  89.66    |  99.65    |  0.4294   |
|  8        |  0.8693   |  37.14    |  75.04    |  98.33    |  0.3255   |
|  9        |  0.8683   |  31.73    |  40.65    |  85.85    |  0.1572   |
|  10       |  0.8692   |  53.77    |  90.16    |  98.3     |  0.476    |
|  11       |  0.8686   |  11.37    |  35.25    |  100.0    |  0.1      |
|  12       |  0.869    |  36.21    |  11.03    |  99.62    |  0.555    |
|  13       |  0.867    |  9.852    |  2.0      |  93.34    |  0.1      |
|  14       |  0.8692   |  61.01    |  31.03    |  99.84    |  0.383    |
|  15       |  0.869    |  65.93    |  3.765    |  99.35    |  0.3418   |
|  16       |  0.8694   |  91.08    |  25.56    |  99.47    |  0.1372   |
|  17       |  0.8693   |  83.79    |  56.89    |  99.76    |  0.7144   |
|  18       |  0.8674   |  97.87    |  43.42    |  77.56    |  0.1484   |
|  19       |  0.8692   |  95.71    |  87.94    |  99.71    |  0.7504   |
|  20       |  0.8672   |  80.85    |  80.57    |  75.43    |  0.6457   |
|  21       |  0.8678   |  99.83    |  2.11     |  82.42    |  0.6184   |
|  22       |  0.802    |  99.71    |  99.35    |  3.206    |  0.4641   |
|  23       |  0.7719   |  2.0      |  2.0      |  2.0      |  0.1      |
|  24       |  0.7719   |  100.0    |  2.0      |  2.0      |  0.1      |
|  25       |  0.8665   |  50.53    |  2.0      |  68.26    |  0.1      |
|  26       |  0.7719   |  2.0      |  100.0    |  2.0      |  0.1      |
|  27       |  0.8655   |  45.85    |  100.0    |  63.7     |  0.1      |
|  28       |  0.8625   |  56.5     |  49.53    |  46.87    |  0.1      |
|  29       |  0.8663   |  75.07    |  20.53    |  68.51    |  1.0      |
|  30       |  0.8659   |  98.75    |  99.59    |  66.68    |  0.5644   |
=========================================================================
print(1 - rf_bo.max['target'])
0.13060198121645017

关于贝叶斯调参,具体原理可参考下文链接:
https://blog.csdn.net/weixin_44052055/article/details/107974298?utm_source=app&app_version=4.5.5

plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])

在本节中,我们完成了建模与调参的工作,并对我们的模型进行了验证。此外,我们还采用了一些基本方法来提高预测的精度,提升如下图所示:
在这里插入图片描述


总结

以上就是我们在一般数据挖掘过程中可能会用到的建模调参过程,通过比较常规的一些模型的效果选出效果最好的两至三个模型,再对这两至三个模型进行调参,调参的话方法建议贪心调参和贝叶斯调参吧,当然这些方法各有好坏,就需要我们具体问题具体分析,自主去选择了

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值