二手车价格预测task04:建模调参

最新推荐文章于 2023-01-07 08:52:53 发布

DDxuexi

最新推荐文章于 2023-01-07 08:52:53 发布

阅读量174

点赞数

分类专栏：二手车价格预测

原文链接：https://blog.csdn.net/sinat_38069794/article/details/116021623

版权

二手车价格预测专栏收录该内容

5 篇文章 1 订阅

订阅专栏

二手车价格预测task04:建模调参

以下是代码示例

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

1.读取数据

定义一个节省内存空间的函数

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv('task03_data_for_tree.csv'))

Memory usage of dataframe is 114704768.00 MB
Memory usage after optimization is: 30468548.00 MB
Decreased by 73.4%

sample_feature.shape

(298710, 48)

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

# continuous_feature_names

2.线性回归 & 五折交叉验证 & 模拟真实业务情况

1.线性回归

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
# 下面这行是将数值表示的类别型object转为float,这里没必要,本来就是数值型了
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

# 用数值型特征进行线性回归的构造
train_X = train[continuous_feature_names]
train_y = train['price']

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model = model.fit(train_X, train_y)

查看训练的线性回归模型的截距（intercept）与权重(coef)

print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:-489230.9248715933





[('v_10', 5636839.734320901),
 ('v_12', 4902516.854836175),
 ('v_11', 1697334.587547197),
 ('v_14', 1463625.437577964),
 ('v_15', 278664.3642238767),
 ('v_20', 71282.44674469868),
 ('v_2', 55982.92799075289),
 ('v_4', 38267.28637228996),
 ('v_18', 38082.76781164496),
 ('v_1', 22887.32173232099),
 ('v_22', 5002.381963558255),
 ('v_6', 3753.5741668999203),
 ('offerType', 1049.23881005371),
 ('v_0', 910.1437626611769),
 ('power', 29.5159225315999),
 ('fuelType', 25.299609226607554),
 ('v_23', 3.245172141201384),
 ('power_bin', 2.0203143408343274),
 ('brand_price_min', 1.8499808234363577),
 ('brand_price_std', 0.2531165329655573),
 ('used_time', 0.13527874322030056),
 ('brand_price_median', 0.1330657460878217),
 ('brand_price_average', 0.1315855365748243),
 ('brand_amount', 0.10355654793376057),
 ('regionCode', 0.03450227722323108),
 ('name', 0.0027497995293785642),
 ('train', -1.1175870895385742e-08),
 ('brand_price_sum', -1.6427359316569683e-05),
 ('SaleID', -0.0009815602590379285),
 ('brand_price_max', -0.005517713184601633),
 ('bodyType', -161.13254845951934),
 ('kilometer', -309.6087187322273),
 ('v_7', -2340.6530332991388),
 ('seller', -4459.549388181375),
 ('v_21', -4624.9115978103655),
 ('v_17', -7423.7297744074685),
 ('v_3', -13301.882052954807),
 ('v_16', -13399.802485426297),
 ('gearbox', -23843.482371478265),
 ('notRepairedDamage', -34060.4108633693),
 ('v_19', -44481.928749009145),
 ('v_5', -72478.33016016654),
 ('v_9', -274938.4606576606),
 ('v_8', -533328.1944080114),
 ('v_13', -2529428.705515671)]

from matplotlib import pyplot as plt

# 构造50个用来随机选取数据进行测试的索引
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

The predicted price is obvious different from true price

在这里插入图片描述

绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)]) # 0.9分位数只有价格15000,原数据100000+

It is clear to see the price shows a typical exponential distribution





<AxesSubplot:xlabel='price', ylabel='Density'>

在这里插入图片描述

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

在这里我们对标签进行了 $l o g (x + 1)$ 变换，使标签贴近于正态分布

train_y_ln = np.log(train_y + 1)

print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

The transformed price seems like normal distribution





<AxesSubplot:xlabel='price', ylabel='Density'>

在这里插入图片描述

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:-41.390128156306965





[('v_12', 1479.937176591679),
 ('v_10', 263.8860957474769),
 ('v_14', 188.64202712829922),
 ('v_13', 98.35678052238535),
 ('v_19', 27.41703634258705),
 ('v_2', 14.801674452065448),
 ('v_20', 12.004690194651229),
 ('v_1', 10.461875872806372),
 ('v_18', 4.333754407039257),
 ('v_11', 3.8951678934350342),
 ('v_22', 1.6015312781012703),
 ('v_6', 1.580044457114316),
 ('v_3', 0.9138385740100368),
 ('offerType', 0.0671835633947501),
 ('fuelType', 0.01050092480772997),
 ('v_8', 0.009587908138661881),
 ('v_0', 0.009437028031935945),
 ('power', 0.00020190199561082603),
 ('kilometer', 0.00020062442315623685),
 ('brand_price_min', 9.970622670295317e-05),
 ('bodyType', 6.305404255790717e-05),
 ('brand_price_average', 3.699097172665082e-05),
 ('used_time', 1.6205716931555507e-06),
 ('name', 1.245790216871555e-07),
 ('brand_price_sum', 2.2915114440637423e-10),
 ('train', -9.094947017729282e-13),
 ('brand_price_max', -1.8026988054389394e-08),
 ('SaleID', -4.054520711096229e-08),
 ('regionCode', -9.57241565932528e-07),
 ('brand_amount', -1.0417851297716551e-06),
 ('brand_price_std', -1.4925350412634868e-05),
 ('brand_price_median', -2.975384834913128e-05),
 ('power_bin', -0.0008864684425239622),
 ('v_23', -0.012860910466005836),
 ('seller', -0.07314739535722348),
 ('v_16', -0.10817857454314908),
 ('gearbox', -0.145743376967758),
 ('v_7', -1.2171851982208803),
 ('v_21', -1.59798911562427),
 ('notRepairedDamage', -3.7889632520279513),
 ('v_5', -12.805727735975681),
 ('v_17', -13.67826666008227),
 ('v_4', -24.025293773546313),
 ('v_15', -24.54190449070495),
 ('v_9', -234.70077086474745)]

再次进行可视化，发现预测结果与真实值较为接近，且未出现异常状况

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

The predicted price seems normal after np.log transforming

在这里插入图片描述

2. 5折交叉验证

from sklearn.model_selection import cross_val_score # 交叉验证,在model_selection中
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.4s finished

使用线性回归模型，对未处理标签的特征数据进行五折交叉验证

print('AVG:', np.mean(scores))

AVG: 2324.3864212351245

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

	cv1	cv2	cv3	cv4	cv5
MAE	2343.234585	2313.131291	2311.801086	2347.464498	2306.300646

使用线性回归模型，对处理过标签的特征数据进行五折交叉验证

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.2s finished

print('AVG:', np.mean(scores))

AVG: 0.09217568560497175

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

	cv1	cv2	cv3	cv4	cv5
MAE	0.092562	0.091428	0.092415	0.092054	0.092418

3. 模拟真实业务情况

在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大

import datetime

sample_feature = sample_feature.reset_index(drop=True)

split_point = len(sample_feature) // 5 * 4

# 划分数据集和训练集
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

0.09242880401383756

4 绘制学习率曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve

? learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

<module 'matplotlib.pyplot' from 'e:\\app\\anaconda\\envs\\python3.7\\lib\\site-packages\\matplotlib\\pyplot.py'>

在这里插入图片描述

1) 线性模型 & 嵌入式特征选择

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
Ridge is finished
Lasso is finished

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	Ridge	Lasso
cv1	0.092762	0.116542	0.316228
cv2	0.091228	0.114048	0.314371
cv3	0.092115	0.116265	0.314932
cv4	0.091450	0.114239	0.310759
cv5	0.092392	0.115881	0.318162

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:-39.073530970796035





<AxesSubplot:>

在这里插入图片描述

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:18.037255935852176





<AxesSubplot:>

在这里插入图片描述

L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。如下图，我们发现power与userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:7.975987324277401





<AxesSubplot:>

在这里插入图片描述

除此之外，决策树通过信息熵或GINI指数选择分裂节点时，优先选择的分裂特征也更加重要，这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的

2)非线性模型

除了线性模型以外，还有许多我们常用的非线性模型如下，在此篇幅有限不再一一讲解原理。我们选择了部分常用模型与线性模型进行效果比对。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
          DecisionTreeRegressor(),
#           RandomForestRegressor(),
#           GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
DecisionTreeRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	DecisionTreeRegressor	MLPRegressor	XGBRegressor	LGBMRegressor
cv1	0.092762	0.109677	1169.113918	0.083229	0.086999
cv2	0.091228	0.109357	900.424733	0.082707	0.084173
cv3	0.092115	0.110854	431.206292	0.082923	0.086228
cv4	0.091450	0.109760	490.508170	0.082975	0.086228
cv5	0.092392	0.109964	1214.548615	0.083191	0.087028

可以看到XGB/LGB模型在每一个fold中均取得了更好的效果

3.模型调参

在此我们介绍了三种常用的调参方法如下：

贪心算法 https://www.jianshu.com/p/ab89df9759c8
网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154

## LGB的参数集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

1)贪心调参

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score

sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

<AxesSubplot:>

在这里插入图片描述

2)Grid Search调参

from sklearn.model_selection import GridSearchCV

# parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
# model = LGBMRegressor()
# clf = GridSearchCV(model, parameters, cv=5)
# clf = clf.fit(train_X, train_y)

# clf.best_params_

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

0.08052112555416666

3)贝叶斯调参

from bayes_opt import BayesianOptimization

# !pip install bayes_opt

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()

1 - rf_bo.max['target']

DDxuexi

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
二手车价格预测task04:建模调参

二手车价格预测task04:建模调参学习了pandas中的dropna()函数和np.dropna()函数学习了pandas的quantile函数敲代码其他待整理以下是代码示例import pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')1.读取数据定义一个节省内存空间的函数def reduce_mem_usage(df): """ iterate thr
复制链接

扫一扫

专栏目录