“零基础入门数据挖掘 - 二手车交易价格预测”学习赛的Task04-学习日志

最新推荐文章于 2024-05-23 21:34:38 发布

SummerT1996

最新推荐文章于 2024-05-23 21:34:38 发布

阅读量163

点赞数

文章标签： python 机器学习数据挖掘

本文链接：https://blog.csdn.net/summert1996/article/details/116020253

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

本文章为天池“零基础入门数据挖掘 - 二手车交易价格预测”学习赛的Task04-学习日志,旨在了解预测模型的建立，并根据二手车数据特点选择线性回归模型构建相关预测模型，并检测模拟模型的适配度，进行相应调参。
学习网址添加链接描述：

一、建模调参学习思维导图

在这里插入图片描述

二、常用回归模型

1.决策树模型：

定义&分类：决策树（Decision Tree）是一种基本的分类与回归方法，当决策树用于分类时称为分类树，用于回归时称为回归树。

特点：决策树由结点和有向边组成。结点有两种类型：内部结点和叶结点，其中内部结点表示一个特征或属性，叶结点表示一个类。一般的，一棵决策树包含一个根结点、若干个内部结点和若干个叶结点。叶结点对应于决策结果，其他每个结点则对应于一个属性测试。每个结点包含的样本集合根据属性测试的结果被划分到子结点中，根结点包含样本全集，从根结点到每个叶结点的路径对应了一个判定测试序列。

在这里插入图片描述

学习网址：添加链接描述

2.GBDT模型：

定义：GBDT是一种基于集成思想的决策树模型，本质是基于残差学习。特点在于可处理各种类型的数据，有着较高的准确率，对异常值的鲁棒性强，不能并行训练数据。

处理过程：采用加法模型，通过不断减小训练过程产生的残差，以此对数据进行回归或分类。GBDT进行多轮迭代，每轮迭代产生一个弱分类器CART回归树，该分类器是在上一轮分类器的残差结果基础上训练得到的。

学习网址：添加链接描述

3.XGBoost模型：

作用：XGBoost 主要是用来解决有监督学习问题，此类问题利用包含多个特征的训练数据，来预测目标变量。

基本思路：和GBDT相同，但是做了一些优化，比如二阶导数使损失函数更精准；正则项避免树过拟合；Block存储可以并行计算等。XGBoost具有高效、灵活和轻便的特点。

学习网址：添加链接描述

4.LightGBM模型：

其是微软2017年新提出的，比Xgboost更强大、速度更快的模型，性能上有很大的提升，与传统算法相比具有的优点：低内存使用、更高的准确率、支持并行化学习、可处理大规模数据、原生支持类别特征，不需要对类别特征再进行0-1编码。
学习网址：添加链接描述

5.线性回归模型：

5.1 基本假设

1.因变量与自变量具有线性关系;
2.在重复实验中自变量的取值是固定的，即假定x为非随机的，与随机误差项无关;
3.误差项是一个期望为0的随机变量;
4．对所有x误差项的方差均相同;
5.误差项服从正太分布.

5.2 最小二乘法

定义：使因变量的观测者和预测值的离差平法和最小来估计系数。
最小二乘估计的优良性质：
1.使得离差平法和最小；
2.参数估计量的抽样分布可知；
3.该方法求出的参数同其他估计量相比，其抽样具有较小的标准差。

5.3 以二手车为例求线性模型

第一步：读取数据并通过reduce_mem_usage 函数来调整数据类型，从而减少数据占用空间,代码如下：

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def reduce_mem_usage(df):
    """ reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
    
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))
#'data_for_tree.csv'为特征工程所得文件

结果如下

Memory usage of dataframe is 60507328.00 MB
Memory usage after optimization is: 15724107.00 MB
Decreased by 74.0%

第二步：线性回归简单建模并查看权重和截距,代码如下：

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']
#将价格和其他特征分割，方便后其将真实值和预测值比较

from sklearn.linear_model import LinearRegression
#导入线性回归相关库
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)
'intercept:'+ str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
#查看训练的线性回归模型的截距（intercept）与权重(coef)

结果如下

[('v_6', 3342612.384537345),
 ('v_8', 684205.534533214),
 ('v_9', 178967.94192530424),
 ('v_7', 35223.07319016895),
 ('v_5', 21917.550249749802),
 ('v_3', 12782.03250792227),
 ('v_12', 11654.925634146672),
 ('v_13', 9884.194615297649),
 ('v_11', 5519.182176035517),
 ('v_10', 3765.6101415594258),
 ('gearbox', 900.3205339198406),
 ('fuelType', 353.5206495542567),
 ('bodyType', 186.51797317460046),
 ('city', 45.17354204168846),
 ('power', 31.163045441455335),
 ('brand_price_median', 0.535967111869784),
 ('brand_price_std', 0.4346788365040235),
 ('brand_amount', 0.15308295553300566),
 ('brand_price_max', 0.003891831020467389),
 ('seller', -1.2684613466262817e-06),
 ('offerType', -4.759058356285095e-06),
 ('brand_price_sum', -2.2430642281682917e-05),
 ('name', -0.00042591632723759166),
 ('used_time', -0.012574429533889028),
 ('brand_price_average', -0.414105722833381),
 ('brand_price_min', -2.3163823428971835),
 ('train', -5.392535065078232),
 ('power_bin', -59.24591853031839),
 ('v_14', -233.1604256172217),
 ('kilometer', -372.96600915402496),
 ('notRepairedDamage', -449.29703564695365),
 ('v_0', -1490.6790578168238),
 ('v_4', -14219.648899108111),
 ('v_2', -16528.55239086934),
 ('v_1', -42869.43976200439)]

第三步：检验模型拟合程度(此案例通过散点图检验,下图以特征v_9为例）,代码如下：

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()
#预测值为蓝，真实值为黑

结果如下
在这里插入图片描述
结果分析：预测结果和实际值的分布相差较大，说明模型拟合性较低，需要调整模型，之前EDA和特征工程均分析出大多指标存在长尾问题，而线性模型的假设中要求随机误差项服从正太分布，因此以处理偏态分布（长尾问题）为出发点优化模型。

第四步：处理长尾分布,
查看price分布代码如下：

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
plt.show()

结果如下
在这里插入图片描述

对标签进行了 $l o g (x + 1)$ 变换，使标签贴近于正态分布并查看分布，代码如下：

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

结果如下
在这里插入图片描述
查看分布散点图，代码如下

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

结果如下

在这里插入图片描述
结果分析：经过Log处理，数据趋于正太分布，再次进行预测分析，结果显示预测结果与真实值较为接近，且未出现异常状况。

三、模型性能检验

需要进行性能检验原因：
1.通过实验评估方法获得的是测试集上的性能，两者对比的结果未必相同（待理解）；
2.测试集上的性能与测试集本身的选择有很大的关系，且不论使用不同大小的测试集会得到不同的结果，不同测试集结果会不同；
3.算法本身有随机性，需要多次实验。
检验方法：假设检验、交叉验证 t 检验、McNemar检验（麦克尼马尔检验）等，本次主要进行五折交叉验证并模拟真实业务情况，绘制学习率曲线与验证曲线。

1. 五折交叉验证

训练中，训练集通常分为三部分，如分为训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分，其作用为保证和检验成果设计的，模拟出来的模型对本组训练数据拟合较好，但对于其他数据不具有普遍性，因此需要将采集的样本分出来进行测试，即交叉实验（Cross Validation）。

使用线性回归模型，对未处理标签的特征数据进行五折交叉验证代码如下：

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
print('AVG:', np.mean(scores))

结果如下

AVG: 1.3641908155886227
#错误均值为1.36

使用线性回归模型，对处理过标签的特征数据进行五折交叉验证，代码如下：

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

结果如下

AVG: 0.19382863663604424
#处理后，模型的普遍适用性更好

2. 模拟真实业务情况

模拟原因：五折交叉实验对于时间序列数据检验有效性较差，可采用时间顺序对数据集进行分隔，本次选择按时间排序，前4/5为训练集，剩余为试验集，代码如下：

import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

结果如下

0.19443858353490887
#与五折交叉验证差距不大，可能数据时间性对模型影响不大

3. 绘制学习率曲线与验证曲线

代码如下

from sklearn.model_selection import learning_curve, validation_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1) 
plt.show()

结果如下

在这里插入图片描述
结果分析：结果显示准确率较低，存在高偏差，此模型未能很好的拟合数据.可以增加模型参数的数量，或者降低正则化程度来修正。

四、多种模型对比

1. 线性模型 VS 嵌入式特征

嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

处理代码如下

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

#对三种方法的效果对比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

结果如下

	LinearRegression	Ridge	Lasso
cv1	0.191642	0.195665	0.382708
cv2	0.194986	0.198841	0.383916
cv3	0.192737	0.196629	0.380754
cv4	0.195329	0.199255	0.385683
cv5	0.194450	0.198173	0.383555

三种方法可视化，代码如下

#线性回归
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
plt.show()

#岭回归
model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

# Lasso回归
model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

结果如下：
在这里插入图片描述

结果分析：L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。而L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择（如最后一张图，power与userd_time突出）。

2. 非线性模型比较，代码如下：

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

结果如下

	LinearRegression	DecisionTreeRegressor	RandomForestRegressor	GradientBoostingRegressor	MLPRegressor	XGBRegressor	LGBMRegressor
cv1	0.191642	0.184566	0.136266	0.168626	124.299426	0.168698	0.141159
cv2	0.194986	0.187029	0.139693	0.171905	257.886236	0.172258	0.143363
cv3	0.192737	0.184839	0.136871	0.169553	236.829589	0.168604	0.142137
cv4	0.195329	0.182605	0.138689	0.172299	130.197264	0.172474	0.143461
cv5	0.194450	0.186626	0.137420	0.171206	268.090236	0.170898	0.141921```

#随机森林模型效果较好

五、模型调参

1. 贪心调参

简介：贪心算法是指，在对问题求解时，总是做出在当前看来是最好的选择。也就是说，不从整体最优上加以考虑，它所做出的仅仅是在某种意义上的局部最优解。
思路：将问题化为子问题，对子问题求最有，在对子问题的最优解合成为原问题的解。
学习链接：添加链接描述

代码如下

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
plt.show()

结果如下

在这里插入图片描述

2. Grid Search 调参(网格调参）

简介：当算法模型效果不佳，可通过Grid Search 调参,循环遍历，尝试每一种参数组合，返回最好的得分值的参数组合。
学习地址：添加链接描述

代码如下

parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

结果如下

0.13626164479243302

3. 贝叶斯调参

简介：贝叶斯优化通过基于目标函数的过去评估结果建立替代函数（概率模型），来找到最小化目标函数的值。贝叶斯方法与随机或网格搜索的不同之处在于，它在尝试下一组超参数时，会参考之前的评估结果，因此可以省去很多无用功。
学习地址：添加链接描述

代码如下

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
rf_bo.maximize()

结果如下

0.1296693644053145

SummerT1996

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
“零基础入门数据挖掘 - 二手车交易价格预测”学习赛的Task04-学习日志

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、建模调参学习思维导图二、常用回归模型1.决策树模型：2.GBDT模型：3.XGBoost模型：4.LightGBM模型：5.线性回归模型：5.1 基本假设5.2 最小二乘法5.3 以二手车为例求线性模型三、模型性能检验1. 五折交叉验证2. 模拟真实业务情况3. 绘制学习率曲线与验证曲线四、多种模型对比1. 线性模型 VS 嵌入式特征2. 非线性模型比较，代码如下：五、模型调参1. 贪心调参2. Grid Search 调参
复制链接

扫一扫