Datawhale零基础入门数据挖掘-Task4 建模调参笔记

最新推荐文章于 2022-11-01 16:01:55 发布

l_yiyu

最新推荐文章于 2022-11-01 16:01:55 发布

阅读量508

点赞数

分类专栏：数据挖掘文章标签：数据挖掘

本文链接：https://blog.csdn.net/l_yiyu/article/details/105256493

版权

数据挖掘专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Task4 建模调参

一.常见模型

1.1 线性回归模型
参考博客：线性回归
1.2 决策树模型
参考博客：决策树模型
1.3 GBDT模型
参考博客：GBDT
1.4 XGBoost模型
参考博客：XGBoost模型
1.5 LightGBM模型
参考博客：LightGBM模型
推荐教材：
《机器学习》
《统计学习方法》
《Python大战机器学习》
《面向机器学习的特征工程》
《数据科学家访谈录》

二.简单建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True) #线性回归模型
model = model.fit(train_X, train_y)

#查看训练线性回归模型的截距与权重
'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

绘制特征V_9的数据与标签的散点图，查看预测结果与（蓝色）与真实标签（黑色）的分布

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

图1 预测与真实数据
观察上图，发现预测与标签自己的分布差异较大，且部分预测值出现了小于0的情况，说明模型存在问题
以下对数据标签作图如图2

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

观察图2，发现数据标签呈现长尾分布，这样的数据不利于建模，因为很多模型的预测都假设数据误差项符合正态分布，长尾分布的数据与此假设背道而驰。因此，我们将标签进行log（x+1）变换，使标签贴近于正态分布，如图2所示。

log（x+1）变换代码如下：

train_y_ln = np.log(train_y + 1)

显示正态分布的数据代码：

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

图3 标签正态分布化
再次对数据进行预测：

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

图4 再次预测
如图4所示，新的预测结果与真实值较为接近，未出现异常情况

三.多种模型对比

3.1 线性模型 & 嵌入式特征选择

本章节默认，学习者已经了解关于过拟合、模型复杂度、正则化等概念。否则请寻找相关资料或参考如下连接：
用简单易懂的语言描述「过拟合 overfitting 添加链接描述
模型复杂度与模型的泛化能力添加链接描述
正则化的直观理解添加链接描述

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。
嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归.
三种模型对比：

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
          Ridge(),
          Lasso()]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
    
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

图5 三种模型对比

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)