Used-car-Task-4

最新推荐文章于 2022-04-05 18:15:05 发布

花下和风

最新推荐文章于 2022-04-05 18:15:05 发布

阅读量259

点赞数

分类专栏： Used-car 文章标签：机器学习数据分析 python 数据挖掘

本文链接：https://blog.csdn.net/weixin_45242930/article/details/105163007

版权

Used-car 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

建模调参

学习目标

了解常用机器学习模型
掌握机器学习建模流程
调参流程

学习内容

线性回归模型：

线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；

模型性能验证：

评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；

嵌入式特征选择：

Lasso回归；
Ridge回归；
决策树；

模型对比：

常用线性模型；
常用非线性模型；

模型调参：

贪心调参方法；
网格调参方法；
贝叶斯调参方法；

代码

导入模块

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):
    """
    对数据进行压缩，从而减少内存消耗
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

Pandas如何进行内存优化和数据加速度取（附代码详解）
https://blog.csdn.net/wlx19970505/article/details/102920112?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522158538403619724845039284%2522%252C%2522scm%2522%253A%252220140713.130056874…%2522%257D&request_id=158538403619724845039284&biz_id=0&utm_source=distribute.pc_search_result.none-task

导入数据

sample_feature = reduce_mem_usage(pd.read_csv('E:\Train_Test_data\second_hand_car\data_for_tree.csv'))

查看优化的结果，数据内存压缩至73.1%
运行结果截图
把连续型特征挑选出来，组成一个列表

这里是引用

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

线性回归 & 五折交叉验证 & 模拟真实业务情况

清洗特征，构造训练集和测试集

sample_feature = sample_feature.dropna().replace('-',0).reset_index(drop=True)
# - 替换为0，数据中有很多
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
# 更改类型
train = sample_feature[continuous_feature_names + ['price']]
train_X = train[continuous_feature_names]
train_y = train['price']

df.reset_index(drop=True)：重置索引，不保留原索引

简单建模-线性回归

导入模块

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize =True)
model = model.fit(train_X, train_y)

线性回归参数：
fit_intercept：布尔值，默认为true
说明：是否对训练数据进行中心化。为false，表明输入的数据已经进行了中心化，在下面过程不进行中心化处理；否则，对输入的数据进行中心化处理。（intercept 拦截）
normalize：布尔型，默认为false
说明：是否对数据进行标准化处理。（normalize 使标准化）
copy_X：布尔型，默认为true
说明：是否对X复制，为false，直接对原数据进行覆盖。即经过中心化，标准化后，是否把新数据覆盖到原数据上。
n_jobs：整型，默认为1
说明：计算时设置的任务个数（number of jobs）。为-1，代表所有的CPU。这一参数对于目标个数>1（n_targets > 1）且足够大规模的问题有加速作用。

查看训练的线性回归模型的截距（intercept）与权重(coef)

'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

zip() - 可以将两个可迭代的对象,组合返回成一个元组数据
dict() - 使用元组数据构建字典
items方法 - items() 函数以列表返回可遍历的(键, 值) 元组数组
sort(iterable, cmp, key, reverse) - 排序函数
iterable - 指定要排序的list或者iterable
key - 指定取待排序元素的哪一项进行排序 - 这里x[1]表示按照列表中第二个元素排序
reverse - 是一个bool变量，表示升序还是降序排列，默认为False(升序)

绘制v_9的预测值与真实值的散点图

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index],color='black')
plt.scatter(train_X['v_9'][subsample_index],model.predict(train_X.loc[subsample_index]),color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price' , 'Predicted Price'],loc='upper right')
print('The predicted price is obvious different from ture price')
plt.show()

运行结果
观察上图，可以发现price出现负值，不符合实际情况
接下来可以检查一下price的分布

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
# np.quantile(train_y,0.9) 求90%分位数，简单的长尾截断，图二和图一的变化，区间分布更好些
sns.distplot(train_y[train_y < np.quantile(train_y,0.9)])

运行结果
使用log(x+1)变换，使数据更贴近正态

train_y_ln = np.log(train_y + 1)
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln,0.9)]

运行结果
左图比右图更接近正态，右图是做了长尾截断，说明进行log转换后长尾截断是多余的。
再次训练

model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

# 再次可视化
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

运行结果
较第一次好些，吻合更高

五折交叉验证

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,make_scorer
def log_transfer(func):
    def wrapper(y,yhat):
        result = func(np.log(y),np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
# np.nan_to_num(x)可以将x中的nan替换为0
# 使用线性回归模型，对未处理标签的特征数据进行五折交叉验证（Error 1.36）
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
# verbose=1输出进度条记录 make_scorer()工厂函数,自定义评分标准
# 这里的log_transfer()是返回log化的标签预测值和真实值
# 5次MAE的平均值
print('AVG:',np.mean(scores))

运行结果
使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（Error 0.19）

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))

运行结果

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv'+str(x) for x in range(1,6)]
scores.index = ['MAE']
scores

运行结果

模拟真实业务情况-考虑进时间

但在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大。

import datetime
# 重设索引
sample_feature = sample_feature.reset_index(drop=True)
# 分割数据集
split_point = len(sample_feature)//5*4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()
# 训练集
train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price']+1)
# 测试集
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price']+1)
model = model.fit(train_X , train_y_ln)
mean_absolute_error(val_y_ln , model.predict(val_X))

运行结果
这个结果与之前的结果差别并不大

绘制学习率曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve
? learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, 
                        train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, 
                                                            n_jobs=n_jobs, train_sizes=train_size, 
                                                            scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()  #区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt

运行结果
我这里的数据有问题

多模型对比

train = sample_feature[continuous_feature_names + ['price']].dropna()
train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y+1)

线性模型 & 嵌入式特征选择

用简单易懂的语言描述「过拟合 overfitting」？ https://www.zhihu.com/question/32246256/answer/55320482
模型复杂度与模型的泛化能力 http://yangyingming.com/article/434/
正则化的直观理解 https://blog.csdn.net/jinping_shi/article/details/52433975
在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),Ridge(),Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model,X=train_X , y=train_y_ln ,
                             verbose=0,cv=5 , scoring = make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name+' is finished')

result = pd.DataFrame(result)
result.index=['cv'+str(x) for x in range(1,6)]
result

运行结果
模型解释力

model = LinearRegression().fit(train_X,train_y_ln)
print('intercept:'+str(model.intercept_))
sns.barplot(abs(model.coef_),continuous_feature_names)

运行结果
可以看到其中有几个指标异常

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
# L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

运行结果

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
#L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。如下图，我们发现power与userd_time特征非常重要。

运行结果
可以看出这两个指标对模型影响更大

非线性模型

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC  # 用于分类问题
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
# 模型列表
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

运行结果

模型调参

贪心算法 https://www.jianshu.com/p/ab89df9759c8
网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154

## LGB的参数集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

贪心调参

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
# 将运行得到的最优参数放入模型对比
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

对比MAE
运行结果

GridSearch调参

网格搜索

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

clf.best_params_
model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

MAE为
运行结果
参数最好的结果为

贝叶斯调参

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()

运行结果
这里有些数字是紫色，意味着target值更高

1 - rf_bo.max['target']
plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])

运行结果

总结

模型
线性回归模型:https://zhuanlan.zhihu.com/p/49480391
决策树模型:https://zhuanlan.zhihu.com/p/65304798
GBDT模型:https://zhuanlan.zhihu.com/p/45145899
XGBoost模型:https://zhuanlan.zhihu.com/p/86816771
XGBRegressor - 梯度提升回归树,也叫梯度提升机
采用连续的方式构造树，每棵树都试图纠正前一棵树的错误，与随机森林不同,梯度提升回归树没有使用随机化，而是用到了强预剪枝，使得梯度提升树往往深度很小，这样模型占用的内存少,预测的速度也快
LightGBM模型:https://zhuanlan.zhihu.com/p/89360721
推荐教材
《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《Python大战机器学习》 https://book.douban.com/subject/26987890/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《数据科学家访谈录》 https://book.douban.com/subject/30129410/