第四次打卡建模调参

椰汁黑糯米

于 2020-04-01 19:55:11 发布

阅读量109

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/lianqi1020/article/details/105236982

版权

笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

建模调参

模型介绍

线性回归
f=w’ x +b
损失函数 (f - y) ^2
常采用最小二乘和梯度下降优化
决策树模型
GBDT模型：梯度提升树
一个集成模型，可以看做很多基模型的线性相加，基模型是CART回归树。这是一个决策树模型，主要特征是，二分树，节点特征取值为是和不是。
决策树可能还需要进行剪枝，根据需求调节精度。
Xgboost
是GBDT算法的一个工程实现，并进行一些优化。增加了衰减银子，减少加的树对于原模型的影响；在GBDT上增加了一个正则项，对于树的叶子节点的权重做了约束；增加了在随机森林上常用的col subsample策略；使用二阶泰勒展开去拟合损失函数，加快优化效率；不需要遍历所有分裂点，提出了一个估计分裂点的算法。
lightGBM
LGB不需要通过所有样本计算信息增益，而且内置特征降维技术。采用少量的样本计算信息增益。

代码实战

将整理好的数据先进行一次简单建模，看表现情况。

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)

# 查看训练的线性回归模型的截距（intercept）与权重(coef)
'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题。

回归分析的五个基本假设

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。因此需要进行变换，使得数据满足正态分布进行下一步预测。

train_y_ln = np.log(train_y + 1)  #进行log(x+1)变换


model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()
#再次进行可视化，发现预测结果与真实值较为接近，且未出现异常状况

五折交叉验证

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
#使用线性回归模型，对未处理标签的特征数据进行五折交叉验证（Error 1.36）
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
print('AVG:', np.mean(scores))

#使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（Error 0.19）
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))
#查看评分
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

由于交叉检验也具有一定的局限性，还可以采用时间顺序对数据集进行分隔。

#绘制学习率曲线与验证曲线
from sklearn.model_selection import learning_curve, validation_curve

模型对比

线性：
在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names) #修改model分别看评价指标

对于三个不同模型进行对比。L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型，抗扰动能力强；L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。

非线性：

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor