Datawale3建模与调参数_wale模型-CSDN博客

本文链接：https://blog.csdn.net/qq_19917979/article/details/105253132

训练模型以及预测的一般流程：

在模型的预测上面需要注意：

模型选择：

1、依据在验证集上的效果选择

2、除了关注效果的均值，还要关注稳健性

3、还需考虑线上效果；可将线上效果视为一折数据

参数选择：

1、不建议将精力放在参数调优上；

2、容易过拟合大体的设置参数即可应将精力重点放在特征工程；

3、其次是模型融合

TODO://感谢小雨姑娘提供祖传代码，降低内存消耗

1、1简单建模

数据内容:

train_feature = train_feature.dropna().replace('-',0).reset_index(drop = True)
train_feature['notRepairedDamage'] = train_feature['notRepairedDamage'].astype(np.float32)
train =train_feature[continuous_feature_names + ['price']]

train_X =train[continuous_feature_names]
train_y =train['price']

查看模型截距以及权重信息：
sorted(dict(zip(continuous_feature_names,model.coef_)).items(), key= lambda x : x[1], reverse= True)

以上很多权重特别大，筛选了一个V_9特征,绘制特征v_9 的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点)的分布差异较大，且部分预测值出现了小于0的情况，说明我们模型存在一些问题

通过作图我们发现的标签（price) 呈现长尾分布，不利于我们建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设.

import seaborn as sns

plt.figure(figsize = (15,5)) plt.subplot(1,2,1)

sns.distplot(train_y)

plt.subplot(1,2,2)

sns.distplot(train_y[train_y < np.quantile(train_y,0.9)])

于是对y进行log变化，常用方式。

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize =( 15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

model =model.fit(train_X, train_y_ln)

print('intercept:'+str(model.intercept_))

sorted(dict(zip(continuous_feature_names,model.coef_)).items(), key= lambda x: x[1], reverse=True)

于是截距和W开始产生了变化

再次可视化，发现预测值与真实值比较接近：

plt.scatter(train_X['v_9'][subsample_index],train_y[subsample_index], color = 'black')

plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index]))-1, color ='blue') plt.xlabel('v_9') plt.ylabel('price') plt.legend(['True price','Predicted Price'], loc='upper right')

plt.show()