集成学习下

Blending集成学习算法

1算法流程:
(1) 将数据划分为训练集和测试集(test_set),其中训练集需要再次划分为训练集(train_set)和验证集(val_set);
(2) 创建第一层的多个模型,这些模型可以使同质的也可以是异质的;
(3) 使用train_set训练步骤2中的多个模型,然后用训练好的模型预测val_set和test_set得到val_predict, test_predict1;
(4) 创建第二层的模型,使用val_predict作为训练集训练第二层的模型;
(5) 使用第二层训练好的模型对第二层测试集test_predict1进行预测,该结果为整个测试集的结果。
2 优点:方式简单,易于操作
缺点:对数据使用不充分
3 具体例子
(1)对人造数据集进行预测

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0)
# 训练集和测试集
X_train1, X_test, y_train1, y_test = train_test_split(data, target, test_size=0.2, random_state=1)
# 训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)
# print("The shape of training X:",X_train.shape)
# print("The shape of training y:",y_train.shape)
# print("The shape of test X:",X_test.shape)
# print("The shape of test y:",y_test.shape)
# print("The shape of validation X:",X_val.shape)
# print("The shape of validation y:",y_val.shape)


from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# 第一层分类器
clfs = [SVC(probability=True), RandomForestClassifier(n_estimators=5, n_jobs=-1), KNeighborsClassifier()]

from sklearn.linear_model import LinearRegression

# 第二层分类器
lr = LinearRegression()

# 输出第一层结果
val_features = np.zeros((X_val.shape[0], len(clfs)))
test_features = np.zeros((X_test.shape[0], len(clfs)))

for i, clf in enumerate(clfs):
    clf.fit(X_train, y_train)
    val_feature = clf.predict_proba(X_val)[:, -1]
    test_feature = clf.predict_proba(X_test)[:, -1]
    val_features[:, i] = val_feature
    test_features[:, i] = test_feature

# 输出第二层结果
lr.fit(val_features, y_val)
from sklearn.model_selection import cross_val_score

print(cross_val_score(lr, test_features, y_test, cv=5))

最终输出结果为[1. 1. 1. 1. 1.],即每一折交叉验证的结果都很好
(2)对真实的iris数据集进行预测

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris=load_iris()
X=iris.data
y=iris.target
# 训练集和测试集
X_train1, X_test, y_train1, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# 训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# 第一层分类器
clfs = [SVC(probability=True), RandomForestClassifier(n_estimators=5, n_jobs=-1), KNeighborsClassifier()]

from sklearn.linear_model import LinearRegression

# 第二层分类器
lr = LinearRegression()

# 输出第一层结果
val_features = np.zeros((X_val.shape[0], len(clfs)))
test_features = np.zeros((X_test.shape[0], len(clfs)))

for i, clf in enumerate(clfs):
    clf.fit(X_train, y_train)
    val_feature = clf.predict_proba(X_val)[:, -1]
    test_feature = clf.predict_proba(X_test)[:, -1]
    val_features[:, i] = val_feature
    test_features[:, i] = test_feature

# 输出第二层结果
lr.fit(val_features, y_val)
from sklearn.model_selection import cross_val_score

print(cross_val_score(lr, test_features, y_test, cv=5))

最终输出结果为[0.67117829 0.83095025 0.36879004 0.06180187 0.82367824],可见blending在真实数据集上的预测结果不是很好。

Stacking集成学习算法

1算法流程:
(1)首先将所有数据集生成测试集和训练集(假如训练集为10000,测试集为2500行),那么上层会进行5折交叉检验,使用训练集中的8000条作为训练集,剩余2000行作为验证集。
(2)每次验证相当于使用了8000条数据训练出一个模型,使用模型对验证集进行验证得到2000条数据,并对测试集进行预测,得到2500条数据,这样经过5次交叉检验,可以得到5* 2000条验证集的结果(相当于每条数据的预测结果),5* 2500条测试集的预测结果。
(3)接下来会将验证集的5* 2000条预测结果拼接成10000行长的矩阵,标记为 A1 ,而对于5* 2500行的测试集的预测结果进行加权平均,得到一个2500一列的矩阵,标记为 B1 。
(4)上面得到一个基模型在数据集上的预测结果 A1 、 B1 ,这样当我们对3个基模型进行集成的话,相于得到了 A1 、 A2 、 A3 、 B1 、 B2 、 B3 六个矩阵。
(5)之后我们会将 A1 、 A2 、 A3 并列在一起成10000行3列的矩阵作为training data, B1 、 B2 、 B3 合并在一起成2500行3列的矩阵作为testing data,让下层学习器基于这样的数据进行再训练。
(6)再训练是基于每个基础模型的预测结果作为特征(三个特征),次学习器会学习训练如何往这样的基学习的预测结果上赋予权重w,来使得最后的预测最为准确。
2 具体应用
(1)堆叠3折CV分类

from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier

RANDOM_SEED = 42

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
# 先验为高斯分布的朴素贝叶斯,主要参数仅有一个,即先验概率priors,对应Y的各个类别的先验概率P(Y=Ck)。
# 这个值默认不给出,如果不给出此时P(Y=Ck)=mk/m。其中m为训练集样本总数量,mk为输出为第k类别的训练集样本数。
clf3 = GaussianNB()
lr=LogisticRegression()

sclf=StackingCVClassifier(classifiers=[clf1,clf2,clf3],
                          meta_classifier=lr,
                          random_state=RANDOM_SEED)
print('3-fold cross validation:\n')

for clf,label in zip([clf1,clf2,clf3,sclf],['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier']):
    scores=cross_val_score(clf,X,y,cv=3,scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

# 画出决策边界
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import itertools

gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
for clf, lab, grd in zip([clf1, clf2, clf3, sclf],
                         ['KNN',
                          'Random Forest',
                          'Naive Bayes',
                          'StackingCVClassifier'],
                          itertools.product([0, 1], repeat=2)):
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(lab)
plt.show()

决策边界为:
在这里插入图片描述
(2)结合网格搜索调参优化

# 堆叠5折CV分类与网格搜索(结合网格搜索调参优化)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingCVClassifier
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

RANDOM_SEED=41

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                            meta_classifier=lr,
                            random_state=42)

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta_classifier__C': [0.1, 10.0]}

grid=GridSearchCV(estimator=sclf,
                  param_grid=params,
                  cv=5,
                  refit=True)
grid.fit(X,y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r,_ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))


print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

结果为:
Best parameters: {‘kneighborsclassifier__n_neighbors’: 1, ‘meta_classifier__C’: 10.0, ‘randomforestclassifier__n_estimators’: 50}
Accuracy: 0.95

(3)绘制ROC曲线

# ROC曲线
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

iris = datasets.load_iris()
X, y = iris.data[:, [0, 1]], iris.target

# 将y进行二值化
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]    # 3

RANDOM_SEED = 42

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=RANDOM_SEED)

clf1 = LogisticRegression()
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = SVC(random_state=RANDOM_SEED)
lr = LogisticRegression()

sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                            meta_classifier=lr,
                            random_state=RANDOM_SEED)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(sclf)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述
3 Blending与Stacking对比:
Blending的优点在于:
比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)
缺点在于:
使用了很少的数据
blender可能会过拟合
stacking使用多次的CV会比较稳健

集成学习案例:幸福感预测

1 数据信息
使用 139 维的特征,8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)。
2 评价指标
最终的评价指标为均方误差MSE。
3 数据处理
(1)数据预处理
对于缺失值,采取的方式是将缺失值补全,使用fillna。
对于连续值,使用了均值进行缺失值的补全。
(2)数据增广
要进一步分析每一个特征之间的关系,从而进行数据增广。经过思考,添加了如下的特征:第一次结婚年龄、最近结婚年龄、是否再婚、配偶年龄、配偶年龄差、各种收入比(与配偶之间的收入比、十年后预期收入与现在收入之比等等)、收入与住房面积比(其中也包括10年后期望收入等等各种情况)、社会阶级(10年后的社会阶级、14年后的社会阶级等等)、悠闲指数、满意指数、信任指数,对于同一省、市、县进行了归一化。同时也考虑了对于同龄人之间的相互比较,即在同龄人中的收入情况、健康情况等等。
(3)特征删除
应该删去有效样本数很少的特征,例如负值太多的特征或者是缺失值太多的特征。
(4)构建新特征
选择最重要的49个特征,作为除了以上263维特征外的另外一组特征;
选择需要进行onehot编码的离散变量进行one-hot编码,再合成为第三类特征,共383维。
(5)特征建模
对于原始的263维的特征,使用lightGBM、xgboost、 RandomForestRegressor随机森林、 GradientBoostingRegressor梯度提升决策树和 ExtraTreesRegressor 极端随机森林回归对数据进行处理,这里使用5折交叉验证的方法。至此,得到了以上5种模型的预测结果以及模型架构及参数。其中在每一种特征工程中,进行5折的交叉验证,并重复两次(Kernel Ridge Regression,核脊回归),取得每一个特征数下的模型的结果。接下来对于49维的数据和对于383维的数据进行与上述263维数据相同的操作。同时,由于49维的特征是最重要的特征,所以这里考虑增加更多的模型进行49维特征的数据的构建工作。
(6)模型融合
将以上的4中集成学习模型再次进行集成学习的训练,直接使LinearRegression简单线性回归的进行集成。
(7)结果保存
进行结果保存,这里我们预测出的值是1-5的连续值,但是我们的ground truth是整数值,所以为了进一步优化我们的结果,我们对于结果进行了整数解的近似,并保存到了csv文件中。

import os
import time
import pandas as pd
import numpy as np
# import seaborn as sns
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error, mean_absolute_error, f1_score
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.ensemble import ExtraTreesRegressor as etr
from sklearn.linear_model import BayesianRidge as br
from sklearn.ensemble import GradientBoostingRegressor as gbr
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression as lr
from sklearn.linear_model import ElasticNet as en
from sklearn.kernel_ridge import KernelRidge as kr
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold, RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
import logging
import warnings

warnings.filterwarnings('ignore')  # 消除warning

train = pd.read_csv("train.csv", parse_dates=['survey_time'], encoding='latin-1')
test = pd.read_csv("test.csv", parse_dates=['survey_time'], encoding='latin-1')  # latin-1向下兼容ASCII
train = train[train["happiness"] != -8].reset_index(drop=True)
train_data_copy = train.copy()  # 删去"happiness" 为-8的行
target_col = "happiness"  # 目标列
target = train_data_copy[target_col]
del train_data_copy[target_col]  # 去除目标列

data = pd.concat([train_data_copy, test], axis=0, ignore_index=True)


# 数据预处理
# make feature +5
# csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去
def getres1(row):
    return len([x for x in row.values if type(x) == int and x < 0])


def getres2(row):
    return len([x for x in row.values if type(x) == int and x == -8])


def getres3(row):
    return len([x for x in row.values if type(x) == int and x == -1])


def getres4(row):
    return len([x for x in row.values if type(x) == int and x == -2])


def getres5(row):
    return len([x for x in row.values if type(x) == int and x == -3])


# 检查数据
data['neg1'] = data[data.columns].apply(lambda row: getres1(row), axis=1)
data.loc[data['neg1'] > 20, 'neg1'] = 20  # 平滑处理,最多出现20次
data['neg2'] = data[data.columns].apply(lambda row: getres2(row), axis=1)
# print(data['neg2']) # 每一行里值为-8的数的个数
data['neg3'] = data[data.columns].apply(lambda row: getres3(row), axis=1)
data['neg4'] = data[data.columns].apply(lambda row: getres4(row), axis=1)
data['neg5'] = data[data.columns].apply(lambda row: getres5(row), axis=1)

# 填充缺失值 共25列 去掉4列 填充21列
# 以下的列都是缺省的,视情况填补
data['work_status'] = data['work_status'].fillna(0)
data['work_yr'] = data['work_yr'].fillna(0)
data['work_manage'] = data['work_manage'].fillna(0)
data['work_type'] = data['work_type'].fillna(0)

data['edu_yr'] = data['edu_yr'].fillna(0)
data['edu_status'] = data['edu_status'].fillna(0)

data['s_work_type'] = data['s_work_type'].fillna(0)
data['s_work_status'] = data['s_work_status'].fillna(0)
data['s_political'] = data['s_political'].fillna(0)
data['s_hukou'] = data['s_hukou'].fillna(0)
data['s_income'] = data['s_income'].fillna(0)
data['s_birth'] = data['s_birth'].fillna(0)
data['s_edu'] = data['s_edu'].fillna(0)
data['s_work_exper'] = data['s_work_exper'].fillna(0)

data['minor_child'] = data['minor_child'].fillna(0)
data['marital_now'] = data['marital_now'].fillna(0)
data['marital_1st'] = data['marital_1st'].fillna(0)
data['social_neighbor'] = data['social_neighbor'].fillna(0)
data['social_friend'] = data['social_friend'].fillna(0)
data['hukou_loc'] = data['hukou_loc'].fillna(1)  # 最少为1,表示户口
data['family_income'] = data['family_income'].fillna(66365)  # 删除问题值后的平均值

# 继续进行特殊的列进行数据处理
# 读happiness_index.xlsx
data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d',
                                     errors='coerce')  # 防止时间格式不同的报错errors='coerce‘
data['survey_time'] = data['survey_time'].dt.year  # 仅仅是year,方便计算年龄
data['age'] = data['survey_time'] - data['birth']
# print(data['age'], data['survey_time'], data['birth'])
# 年龄分层 145+1=146
bins = [0, 17, 26, 34, 50, 63, 100]
data['age_bin'] = pd.cut(data['age'], bins, labels=[0, 1, 2, 3, 4, 5])

# loc为Selection by Label函数,即为按标签取数据,标签是什么,例如第一个参数选择index,第二个参数选择column
# 对‘宗教’处理
data.loc[data['religion'] < 0, 'religion'] = 1  # 1为不信仰宗教
data.loc[data['religion_freq'] < 0, 'religion_freq'] = 1  # 1为从来没有参加过
# 对‘教育程度’处理
data.loc[data['edu'] < 0, 'edu'] = 4  # 初中
data.loc[data['edu_status'] < 0, 'edu_status'] = 0
data.loc[data['edu_yr'] < 0, 'edu_yr'] = 0
# 对‘个人收入’处理
data.loc[data['income'] < 0, 'income'] = 0  # 认为无收入
# 对‘政治面貌’处理
data.loc[data['political'] < 0, 'political'] = 1  # 认为是群众
# 对体重处理
data.loc[(data['weight_jin'] <= 80) & (data['height_cm'] >= 160), 'weight_jin'] = data['weight_jin'] * 2
data.loc[data['weight_jin'] <= 60, 'weight_jin'] = data['weight_jin'] * 2  # 个人的想法,哈哈哈,没有60斤的成年人吧
# 对身高处理
data.loc[data['height_cm'] < 150, 'height_cm'] = 150  # 成年人的实际情况
# 对‘健康’处理
data.loc[data['health'] < 0, 'health'] = 4  # 认为是比较健康
data.loc[data['health_problem'] < 0, 'health_problem'] = 4
# 对‘沮丧’处理
data.loc[data['depression'] < 0, 'depression'] = 4
# 对‘媒体’处理
data.loc[data['media_1'] < 0, 'media_1'] = 1  # 都是从不
data.loc[data['media_2'] < 0, 'media_2'] = 1
data.loc[data['media_3'] < 0, 'media_3'] = 1
data.loc[data['media_4'] < 0, 'media_4'] = 1
data.loc[data['media_5'] < 0, 'media_5'] = 1
data.loc[data['media_6'] < 0, 'media_6'] = 1
# 对‘空闲活动’处理
data.loc[data['leisure_1'] < 0, 'leisure_1'] = 1
data.loc[data['leisure_2'] < 0, 'leisure_2'] = 5
data.loc[data['leisure_3'] < 0, 'leisure_3'] = 3
# 使用众数(代码中使用mode()来实现异常值的修正),mode函数寻找数组或者矩阵每行/每列中最常出现成员以及出现的次数
data.loc[data['leisure_4'] < 0, 'leisure_4'] = data['leisure_4'].mode()  # 取众数
# print(data['leisure_4'].mode())   # 众数为5.0
data.loc[data['leisure_5'] < 0, 'leisure_5'] = data['leisure_5'].mode()
data.loc[data['leisure_6'] < 0, 'leisure_6'] = data['leisure_6'].mode()
data.loc[data['leisure_7'] < 0, 'leisure_7'] = data['leisure_7'].mode()
data.loc[data['leisure_8'] < 0, 'leisure_8'] = data['leisure_8'].mode()
data.loc[data['leisure_9'] < 0, 'leisure_9'] = data['leisure_9'].mode()
data.loc[data['leisure_10'] < 0, 'leisure_10'] = data['leisure_10'].mode()
data.loc[data['leisure_11'] < 0, 'leisure_11'] = data['leisure_11'].mode()
data.loc[data['leisure_12'] < 0, 'leisure_12'] = data['leisure_12'].mode()
data.loc[data['socialize'] < 0, 'socialize'] = 2  # 很少
data.loc[data['relax'] < 0, 'relax'] = 4  # 经常
data.loc[data['learn'] < 0, 'learn'] = 1  # 从不
# 对‘社交’处理
data.loc[data['social_neighbor'] < 0, 'social_neighbor'] = 0
data.loc[data['social_friend'] < 0, 'social_friend'] = 0
data.loc[data['socia_outing'] < 0, 'socia_outing'] = 1
data.loc[data['neighbor_familiarity'] < 0, 'social_neighbor'] = 4
# 对‘社会公平性’处理
data.loc[data['equity'] < 0, 'equity'] = 4
# 对‘社会等级’处理
data.loc[data['class_10_before'] < 0, 'class_10_before'] = 3
data.loc[data['class'] < 0, 'class'] = 5
data.loc[data['class_10_after'] < 0, 'class_10_after'] = 5
data.loc[data['class_14'] < 0, 'class_14'] = 2
# 对‘工作情况’处理
data.loc[data['work_status'] < 0, 'work_status'] = 0
data.loc[data['work_yr'] < 0, 'work_yr'] = 0
data.loc[data['work_manage'] < 0, 'work_manage'] = 0
data.loc[data['work_type'] < 0, 'work_type'] = 0
# 对‘社会保障’处理
data.loc[data['insur_1'] < 0, 'insur_1'] = 1
data.loc[data['insur_2'] < 0, 'insur_2'] = 1
data.loc[data['insur_3'] < 0, 'insur_3'] = 1
data.loc[data['insur_4'] < 0, 'insur_4'] = 1
data.loc[data['insur_1'] == 0, 'insur_1'] = 0
data.loc[data['insur_2'] == 0, 'insur_2'] = 0
data.loc[data['insur_3'] == 0, 'insur_3'] = 0
data.loc[data['insur_4'] == 0, 'insur_4'] = 0

# 取均值进行缺失值的补全(代码实现为means()),因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全
# 对家庭情况处理
family_income_mean = data['family_income'].mean()
data.loc[data['family_income'] < 0, 'family_income'] = family_income_mean
data.loc[data['family_m'] < 0, 'family_m'] = 2
data.loc[data['family_status'] < 0, 'family_status'] = 3
data.loc[data['house'] < 0, 'house'] = 1
data.loc[data['car'] < 0, 'car'] = 0
data.loc[data['car'] == 2, 'car'] = 0
data.loc[data['son'] < 0, 'son'] = 1
data.loc[data['daughter'] < 0, 'daughter'] = 0
data.loc[data['minor_child'] < 0, 'minor_child'] = 0
# 对‘婚姻’处理
data.loc[data['marital_1st'] < 0, 'marital_1st'] = 0
data.loc[data['marital_now'] < 0, 'marital_now'] = 0
# 对‘配偶’处理
data.loc[data['s_birth'] < 0, 's_birth'] = 0
data.loc[data['s_edu'] < 0, 's_edu'] = 0
data.loc[data['s_political'] < 0, 's_political'] = 0
data.loc[data['s_hukou'] < 0, 's_hukou'] = 0
data.loc[data['s_income'] < 0, 's_income'] = 0
data.loc[data['s_work_type'] < 0, 's_work_type'] = 0
data.loc[data['s_work_status'] < 0, 's_work_status'] = 0
data.loc[data['s_work_exper'] < 0, 's_work_exper'] = 0
# 对‘父母情况’处理
data.loc[data['f_birth'] < 0, 'f_birth'] = 1945
data.loc[data['f_edu'] < 0, 'f_edu'] = 1
data.loc[data['f_political'] < 0, 'f_political'] = 1
data.loc[data['f_work_14'] < 0, 'f_work_14'] = 2
data.loc[data['m_birth'] < 0, 'm_birth'] = 1940
data.loc[data['m_edu'] < 0, 'm_edu'] = 1
data.loc[data['m_political'] < 0, 'm_political'] = 1
data.loc[data['m_work_14'] < 0, 'm_work_14'] = 2
# 和同龄人相比社会经济地位
data.loc[data['status_peer'] < 0, 'status_peer'] = 2
# 和3年前比社会经济地位
data.loc[data['status_3_before'] < 0, 'status_3_before'] = 2
# 对‘观点’处理
data.loc[data['view'] < 0, 'view'] = 4
# 对期望年收入处理
data.loc[data['inc_ability'] <= 0, 'inc_ability'] = 2
inc_exp_mean = data['inc_exp'].mean()
data.loc[data['inc_exp'] <= 0, 'inc_exp'] = inc_exp_mean  # 取均值

# 部分特征处理,取众数
for i in range(1, 9 + 1):
    data.loc[data['public_service_' + str(i)] < 0, 'public_service_' + str(i)] = data[
        'public_service_' + str(i)].dropna().mode().values
for i in range(1, 13 + 1):
    data.loc[data['trust_' + str(i)] < 0, 'trust_' + str(i)] = data['trust_' + str(i)].dropna().mode().values
# 数据增广,即进一步分析每一个特征之间的关系,从而进行数据增广。
# 第一次结婚年龄 147
data['marital_1stbir'] = data['marital_1st'] - data['birth']
# 最近结婚年龄 148
data['marital_nowtbir'] = data['marital_now'] - data['birth']
# 是否再婚 149
data['mar'] = data['marital_nowtbir'] - data['marital_1stbir']
# 配偶年龄 150
data['marital_sbir'] = data['marital_now'] - data['s_birth']
# 配偶年龄差 151
data['age_'] = data['marital_nowtbir'] - data['marital_sbir']

# 收入比 151+7 =158
data['income/s_income'] = data['income'] / (data['s_income'] + 1)
data['income+s_income'] = data['income'] + (data['s_income'] + 1)
data['income/family_income'] = data['income'] / (data['family_income'] + 1)
data['all_income/family_income'] = (data['income'] + data['s_income']) / (data['family_income'] + 1)
data['income/inc_exp'] = data['income'] / (data['inc_exp'] + 1)
data['family_income/m'] = data['family_income'] / (data['family_m'] + 0.01)
data['income/m'] = data['income'] / (data['family_m'] + 0.01)

# 收入/面积比 158+4=162
data['income/floor_area'] = data['income'] / (data['floor_area'] + 0.01)
data['all_income/floor_area'] = (data['income'] + data['s_income']) / (data['floor_area'] + 0.01)
data['family_income/floor_area'] = data['family_income'] / (data['floor_area'] + 0.01)
data['floor_area/m'] = data['floor_area'] / (data['family_m'] + 0.01)

# class 162+3=165
data['class_10_diff'] = (data['class_10_after'] - data['class'])
data['class_diff'] = data['class'] - data['class_10_before']
data['class_14_diff'] = data['class'] - data['class_14']
# 悠闲指数 166
leisure_fea_lis = ['leisure_' + str(i) for i in range(1, 13)]
data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1)  # skew
# 满意指数 167
public_service_fea_lis = ['public_service_' + str(i) for i in range(1, 10)]
data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1)  # skew

# 信任指数 168
trust_fea_lis = ['trust_' + str(i) for i in range(1, 14)]
data['trust_sum'] = data[trust_fea_lis].sum(axis=1)  # skew

# province mean 168+13=181
data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values
data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values
data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values
data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values
data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values
data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values
data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values
data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values
data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values
data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values
data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values
data['province_public_service_sum_mean'] = data.groupby(['province'])['public_service_sum'].transform('mean').values
data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values

# city   mean 181+13=194
data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values
data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values
data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values
data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values
data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values
data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values
data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values
data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values
data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values
data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values
data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values
data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values
data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values

# county  mean 194 + 13 = 207
data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values
data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values
data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values
data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values
data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values
data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values
data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values
data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values
data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values
data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values
data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values
data['county_public_service_sum_mean'] = data.groupby(['county'])['public_service_sum'].transform('mean').values
data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values

# ratio 相比同省 207 + 13 =220
data['income/province'] = data['income'] / (data['province_income_mean'])
data['family_income/province'] = data['family_income'] / (data['province_family_income_mean'])
data['equity/province'] = data['equity'] / (data['province_equity_mean'])
data['depression/province'] = data['depression'] / (data['province_depression_mean'])
data['floor_area/province'] = data['floor_area'] / (data['province_floor_area_mean'])
data['health/province'] = data['health'] / (data['province_health_mean'])
data['class_10_diff/province'] = data['class_10_diff'] / (data['province_class_10_diff_mean'])
data['class/province'] = data['class'] / (data['province_class_mean'])
data['health_problem/province'] = data['health_problem'] / (data['province_health_problem_mean'])
data['family_status/province'] = data['family_status'] / (data['province_family_status_mean'])
data['leisure_sum/province'] = data['leisure_sum'] / (data['province_leisure_sum_mean'])
data['public_service_sum/province'] = data['public_service_sum'] / (data['province_public_service_sum_mean'])
data['trust_sum/province'] = data['trust_sum'] / (data['province_trust_sum_mean'] + 1)

# ratio 相比同市 220 + 13 =233
data['income/city'] = data['income'] / (data['city_income_mean'])
data['family_income/city'] = data['family_income'] / (data['city_family_income_mean'])
data['equity/city'] = data['equity'] / (data['city_equity_mean'])
data['depression/city'] = data['depression'] / (data['city_depression_mean'])
data['floor_area/city'] = data['floor_area'] / (data['city_floor_area_mean'])
data['health/city'] = data['health'] / (data['city_health_mean'])
data['class_10_diff/city'] = data['class_10_diff'] / (data['city_class_10_diff_mean'])
data['class/city'] = data['class'] / (data['city_class_mean'])
data['health_problem/city'] = data['health_problem'] / (data['city_health_problem_mean'])
data['family_status/city'] = data['family_status'] / (data['city_family_status_mean'])
data['leisure_sum/city'] = data['leisure_sum'] / (data['city_leisure_sum_mean'])
data['public_service_sum/city'] = data['public_service_sum'] / (data['city_public_service_sum_mean'])
data['trust_sum/city'] = data['trust_sum'] / (data['city_trust_sum_mean'])

# ratio 相比同个地区 233 + 13 =246
data['income/county'] = data['income'] / (data['county_income_mean'])
data['family_income/county'] = data['family_income'] / (data['county_family_income_mean'])
data['equity/county'] = data['equity'] / (data['county_equity_mean'])
data['depression/county'] = data['depression'] / (data['county_depression_mean'])
data['floor_area/county'] = data['floor_area'] / (data['county_floor_area_mean'])
data['health/county'] = data['health'] / (data['county_health_mean'])
data['class_10_diff/county'] = data['class_10_diff'] / (data['county_class_10_diff_mean'])
data['class/county'] = data['class'] / (data['county_class_mean'])
data['health_problem/county'] = data['health_problem'] / (data['county_health_problem_mean'])
data['family_status/county'] = data['family_status'] / (data['county_family_status_mean'])
data['leisure_sum/county'] = data['leisure_sum'] / (data['county_leisure_sum_mean'])
data['public_service_sum/county'] = data['public_service_sum'] / (data['county_public_service_sum_mean'])
data['trust_sum/county'] = data['trust_sum'] / (data['county_trust_sum_mean'])

# age   mean 246+ 13 =259
data['age_income_mean'] = data.groupby(['age'])['income'].transform('mean').values
data['age_family_income_mean'] = data.groupby(['age'])['family_income'].transform('mean').values
data['age_equity_mean'] = data.groupby(['age'])['equity'].transform('mean').values
data['age_depression_mean'] = data.groupby(['age'])['depression'].transform('mean').values
data['age_floor_area_mean'] = data.groupby(['age'])['floor_area'].transform('mean').values
data['age_health_mean'] = data.groupby(['age'])['health'].transform('mean').values
data['age_class_10_diff_mean'] = data.groupby(['age'])['class_10_diff'].transform('mean').values
data['age_class_mean'] = data.groupby(['age'])['class'].transform('mean').values
data['age_health_problem_mean'] = data.groupby(['age'])['health_problem'].transform('mean').values
data['age_family_status_mean'] = data.groupby(['age'])['family_status'].transform('mean').values
data['age_leisure_sum_mean'] = data.groupby(['age'])['leisure_sum'].transform('mean').values
data['age_public_service_sum_mean'] = data.groupby(['age'])['public_service_sum'].transform('mean').values
data['age_trust_sum_mean'] = data.groupby(['age'])['trust_sum'].transform('mean').values

# 和同龄人相比259 + 13 =272维
data['income/age'] = data['income'] / (data['age_income_mean'])
data['family_income/age'] = data['family_income'] / (data['age_family_income_mean'])
data['equity/age'] = data['equity'] / (data['age_equity_mean'])
data['depression/age'] = data['depression'] / (data['age_depression_mean'])
data['floor_area/age'] = data['floor_area'] / (data['age_floor_area_mean'])
data['health/age'] = data['health'] / (data['age_health_mean'])
data['class_10_diff/age'] = data['class_10_diff'] / (data['age_class_10_diff_mean'])
data['class/age'] = data['class'] / (data['age_class_mean'])
data['health_problem/age'] = data['health_problem'] / (data['age_health_problem_mean'])
data['family_status/age'] = data['family_status'] / (data['age_family_status_mean'])
data['leisure_sum/age'] = data['leisure_sum'] / (data['age_leisure_sum_mean'])
data['public_service_sum/age'] = data['public_service_sum'] / (data['age_public_service_sum_mean'])
data['trust_sum/age'] = data['trust_sum'] / (data['age_trust_sum_mean'])
# print('shape', data.shape)
# print(data.head())

# 272-9=263个特征
# 删除数值特别少的和之前用过的特征
del_list = ['id', 'survey_time', 'edu_other', 'invest_other', 'property_other', 'join_party', 'province', 'city',
            'county']
use_feature = [clo for clo in data.columns if clo not in del_list]
data.fillna(0, inplace=True)
train_shape = train.shape[0]
features = data[use_feature].columns  # 删除后所有的特征
X_train_263 = data[:train_shape][use_feature].values
# print(data[:train_shape][use_feature])
y_train = target
X_test_263 = data[train_shape:][use_feature].values
# print(X_train_263.shape)  (7988, 263)

imp_fea_49 = ['equity', 'depression', 'health', 'class', 'family_status', 'health_problem', 'class_10_after',
              'equity/province', 'equity/city', 'equity/county',
              'depression/province', 'depression/city', 'depression/county',
              'health/province', 'health/city', 'health/county',
              'class/province', 'class/city', 'class/county',
              'family_status/province', 'family_status/city', 'family_status/county',
              'family_income/province', 'family_income/city', 'family_income/county',
              'floor_area/province', 'floor_area/city', 'floor_area/county',
              'leisure_sum/province', 'leisure_sum/city', 'leisure_sum/county',
              'public_service_sum/province', 'public_service_sum/city', 'public_service_sum/county',
              'trust_sum/province', 'trust_sum/city', 'trust_sum/county',
              'income/m', 'public_service_sum', 'class_diff', 'status_3_before', 'age_income_mean',
              'age_floor_area_mean',
              'weight_jin', 'height_cm',
              'health/age', 'depression/age', 'equity/age', 'leisure_sum/age'
              ]
train_shape = train.shape[0]
X_train_49 = data[:train_shape][imp_fea_49].values
X_test_49 = data[train_shape:][imp_fea_49].values
# print(X_train_49.shape) #最重要的49个特征(7988, 49)

# 选择需要进行onehot编码的离散变量进行one-hot编码,再合成为第三类特征,共383维。
# 使用One-hot进行编码的原因在于,有部分特征为分离值,例如性别中男女,男为1,女为2,我们想使用One-hot将其变为男为0,女为1,来增强机器学习算法的鲁棒性能;
cat_fea = ['survey_type', 'gender', 'nationality', 'edu_status', 'political', 'hukou', 'hukou_loc', 'work_exper',
           'work_status', 'work_type',
           'work_manage', 'marital', 's_political', 's_hukou', 's_work_exper', 's_work_status', 's_work_type',
           'f_political', 'f_work_14',
           'm_political', 'm_work_14']
noc_fea = [clo for clo in use_feature if clo not in cat_fea]

onehot_data = data[cat_fea].values
# print(onehot_data.shape)      #(10956, 21)
enc = preprocessing.OneHotEncoder(categories='auto')
oh_data = enc.fit_transform(onehot_data).toarray()
# print(oh_data)  # 变为onehot编码格式(10956, 141)

X_train_oh = oh_data[:train_shape, :]
X_test_oh = oh_data[train_shape:, :]

# print(X_train_oh.shape)                              (7988, 141)
# print(data[:train_shape][noc_fea].values.shape)      (7988, 242)
X_train_383 = np.column_stack([data[:train_shape][noc_fea].values, X_train_oh])  # 先是noc,再是cat_fea
X_test_383 = np.column_stack([data[train_shape:][noc_fea].values, X_test_oh])
# df=DataFrame(X_train_383)
# df.to_csv('result.csv')
# print(X_train_383.shape)  #(7988, 383)

# 特征建模
##### lgb_263 #
# lightGBM决策树
lgb_263_param = {
    'num_leaves': 7,
    'min_data_in_leaf': 20,  # 叶子可能具有的最小记录数
    'objective': 'regression',
    'max_depth': -1,
    'learning_rate': 0.003,
    "boosting": "gbdt",  # 用gbdt算法
    "feature_fraction": 0.18,  # 例如 0.18时,意味着在每次迭代中随机选择18%的参数来建树
    "bagging_freq": 1,
    "bagging_fraction": 0.55,  # 每次迭代时用的数据比例
    "bagging_seed": 14,
    "metric": 'mse',
    "lambda_l1": 0.1005,
    "lambda_l2": 0.1996,
    "verbosity": -1}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=4)  # 交叉切分:5
oof_lgb_263 = np.zeros(len(X_train_263))
predictions_lgb_263 = np.zeros(len(X_test_263))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
    print("fold n°{}".format(fold_ + 1))
    trn_data = lgb.Dataset(X_train_263[trn_idx], y_train[trn_idx])
    # print(trn_idx)
    # print('--',X_train_263[trn_idx])
    val_data = lgb.Dataset(X_train_263[val_idx], y_train[val_idx])  # train:val=4:1

    num_round = 10000
    lgb_263 = lgb.train(lgb_263_param, trn_data, num_round, valid_sets=[trn_data, val_data], verbose_eval=500,
                        early_stopping_rounds=800)
    oof_lgb_263[val_idx] = lgb_263.predict(X_train_263[val_idx], num_iteration=lgb_263.best_iteration)
    # print(oof_lgb_263[val_idx])           # [4.5536726  4.0173519  3.05126319 ... 4.52023072 4.35002348 3.7669993 ]
    # print(folds.n_splits)      5
    predictions_lgb_263 += lgb_263.predict(X_test_263, num_iteration=lgb_263.best_iteration) / folds.n_splits
    # print(X_train_263.shape)                      # (7988, 263)
    # print(X_test_263.shape)                       # (2968, 263)
    # print('+++',predictions_lgb_263.shape)        # (2968,)
print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb_263, target)))
# print(oof_lgb_263)     # -- [3.6687421  3.944796   4.24840619 ... 3.7669993  3.69435363 4.04505082]

##### xgb_263
# xgboost
xgb_263_params = {'eta': 0.02,  # lr
                  'max_depth': 6,
                  'min_child_weight': 3,  # 最小叶子节点样本权重和
                  'gamma': 0,  # 指定节点分裂所需的最小损失函数下降值。
                  'subsample': 0.7,  # 控制对于每棵树,随机采样的比例
                  'colsample_bytree': 0.3,  # 用来控制每棵随机采样的列数的占比 (每一列是一个特征)。
                  'lambda': 2,
                  'objective': 'reg:linear',
                  'eval_metric': 'rmse',
                  'silent': True,
                  'nthread': -1}

folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof_xgb_263 = np.zeros(len(X_train_263))
predictions_xgb_263 = np.zeros(len(X_test_263))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
    print("fold n°{}".format(fold_ + 1))
    trn_data = xgb.DMatrix(X_train_263[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train_263[val_idx], y_train[val_idx])

    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    xgb_263 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600,
                        verbose_eval=500, params=xgb_263_params)
    oof_xgb_263[val_idx] = xgb_263.predict(xgb.DMatrix(X_train_263[val_idx]), ntree_limit=xgb_263.best_ntree_limit)
    predictions_xgb_263 += xgb_263.predict(xgb.DMatrix(X_test_263),
                                           ntree_limit=xgb_263.best_ntree_limit) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb_263, target)))

# RandomForestRegressor随机森林
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof_rfr_263 = np.zeros(len(X_train_263))
predictions_rfr_263 = np.zeros(len(X_test_263))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_263[trn_idx]
    tr_y = y_train[trn_idx]
    rfr_263 = rfr(n_estimators=1600, max_depth=9, min_samples_leaf=9, min_weight_fraction_leaf=0.0,
                  max_features=0.25, verbose=1, n_jobs=-1)
    # verbose = 0 为不在标准输出流输出日志信息
    # verbose = 1 为输出进度条记录
    # verbose = 2 为每个epoch输出一行记录
    rfr_263.fit(tr_x, tr_y)
    oof_rfr_263[val_idx] = rfr_263.predict(X_train_263[val_idx])

    predictions_rfr_263 += rfr_263.predict(X_test_263) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_rfr_263, target)))

# GradientBoostingRegressor梯度提升决策树
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
oof_gbr_263 = np.zeros(train_shape)
predictions_gbr_263 = np.zeros(len(X_test_263))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_263[trn_idx]
    tr_y = y_train[trn_idx]
    gbr_263 = gbr(n_estimators=400, learning_rate=0.01, subsample=0.65, max_depth=7, min_samples_leaf=20,
                  max_features=0.22, verbose=1)
    gbr_263.fit(tr_x, tr_y)
    oof_gbr_263[val_idx] = gbr_263.predict(X_train_263[val_idx])

    predictions_gbr_263 += gbr_263.predict(X_test_263) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_gbr_263, target)))

# ExtraTreesRegressor 极端随机森林回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_etr_263 = np.zeros(train_shape)
predictions_etr_263 = np.zeros(len(X_test_263))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_263[trn_idx]
    tr_y = y_train[trn_idx]
    etr_263 = etr(n_estimators=1000, max_depth=8, min_samples_leaf=12, min_weight_fraction_leaf=0.0,
                  max_features=0.4, verbose=1, n_jobs=-1)
    etr_263.fit(tr_x, tr_y)
    oof_etr_263[val_idx] = etr_263.predict(X_train_263[val_idx])

    predictions_etr_263 += etr_263.predict(X_test_263) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_etr_263, target)))

train_stack2 = np.vstack([oof_lgb_263, oof_xgb_263, oof_gbr_263, oof_rfr_263, oof_etr_263]).transpose()
# transpose()函数的作用就是调换x,y,z的位置,也就是数组的索引值
test_stack2 = np.vstack([predictions_lgb_263, predictions_xgb_263, predictions_gbr_263, predictions_rfr_263,
                         predictions_etr_263]).transpose()

# 交叉验证:5折,重复2次
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
oof_stack2 = np.zeros(train_stack2.shape[0])
predictions_lr2 = np.zeros(test_stack2.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack2, target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack2[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack2[val_idx], target.iloc[val_idx].values
    # Kernel Ridge Regression
    lr2 = kr()
    lr2.fit(trn_data, trn_y)

    oof_stack2[val_idx] = lr2.predict(val_data)
    predictions_lr2 += lr2.predict(test_stack2) / 10

print(mean_squared_error(target.values, oof_stack2))

# 对于49维的数据进行与上述263维数据相同的操作
lgb_49_param = {
    'num_leaves': 9,
    'min_data_in_leaf': 23,
    'objective': 'regression',
    'max_depth': -1,
    'learning_rate': 0.002,
    "boosting": "gbdt",
    "feature_fraction": 0.45,
    "bagging_freq": 1,
    "bagging_fraction": 0.65,
    "bagging_seed": 15,
    "metric": 'mse',
    "lambda_l2": 0.2,
    "verbosity": -1}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=45)
oof_lgb_49 = np.zeros(len(X_train_49))
predictions_lgb_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    trn_data = lgb.Dataset(X_train_49[trn_idx], y_train[trn_idx])
    val_data = lgb.Dataset(X_train_49[val_idx], y_train[val_idx])

    num_round = 12000
    lgb_49 = lgb.train(lgb_49_param, trn_data, num_round, valid_sets=[trn_data, val_data], verbose_eval=1000,
                       early_stopping_rounds=1000)
    oof_lgb_49[val_idx] = lgb_49.predict(X_train_49[val_idx], num_iteration=lgb_49.best_iteration)
    predictions_lgb_49 += lgb_49.predict(X_test_49, num_iteration=lgb_49.best_iteration) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb_49, target)))

# xgb_49
xgb_49_params = {'eta': 0.02,
                 'max_depth': 5,
                 'min_child_weight': 3,
                 'gamma': 0,
                 'subsample': 0.7,
                 'colsample_bytree': 0.35,
                 'lambda': 2,
                 'objective': 'reg:linear',
                 'eval_metric': 'rmse',
                 'silent': True,
                 'nthread': -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=43)
oof_xgb_49 = np.zeros(len(X_train_49))
predictions_xgb_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    trn_data = xgb.DMatrix(X_train_49[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train_49[val_idx], y_train[val_idx])

    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    xgb_49 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600,
                       verbose_eval=500, params=xgb_49_params)

    oof_xgb_49[val_idx] = xgb_49.predict(xgb.DMatrix(X_train_49[val_idx]), ntree_limit=xgb_49.best_ntree_limit)
    predictions_xgb_49 += xgb_49.predict(xgb.DMatrix(X_test_49), ntree_limit=xgb_49.best_ntree_limit) / folds.n_splits


print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb_49, target)))

# 梯度提升决策树
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
oof_gbr_49 = np.zeros(train_shape)
predictions_gbr_49 = np.zeros(len(X_test_49))
# GradientBoostingRegressor梯度提升决策树
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_49[trn_idx]
    tr_y = y_train[trn_idx]
    gbr_49 = gbr(n_estimators=600, learning_rate=0.01, subsample=0.65, max_depth=6, min_samples_leaf=20,
                 max_features=0.35, verbose=1)
    gbr_49.fit(tr_x, tr_y)
    oof_gbr_49[val_idx] = gbr_49.predict(X_train_49[val_idx])
    predictions_gbr_49 += gbr_49.predict(X_test_49) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_gbr_49, target)))

# 在每一种特征工程中,进行5折的交叉验证,并重复两次
train_stack3 = np.vstack([oof_lgb_49, oof_xgb_49, oof_gbr_49]).transpose()
test_stack3 = np.vstack([predictions_lgb_49, predictions_xgb_49, predictions_gbr_49]).transpose()
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
oof_stack3 = np.zeros(train_stack3.shape[0])
predictions_lr3 = np.zeros(test_stack3.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack3, target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack3[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack3[val_idx], target.iloc[val_idx].values
    # Kernel Ridge Regression
    lr3 = kr()
    lr3.fit(trn_data, trn_y)

    oof_stack3[val_idx] = lr3.predict(val_data)
    predictions_lr3 += lr3.predict(test_stack3) / 10

mean_squared_error(target.values, oof_stack3)

# 对于383维的数据进行与上述263以及49维数据相同的操作
# Kernel Ridge Regression 基于核的岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_kr_383 = np.zeros(train_shape)
predictions_kr_383 = np.zeros(len(X_test_383))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
    print("fold n°{}".format(fold_ + 1))
    trn_x = X_train_383[trn_idx]
    trn_y = y_train[trn_idx]

    kr_383 = kr()
    kr_383.fit(trn_x, trn_y)
    oof_kr_383[val_idx] = kr_383.predict(X_train_383[val_idx])
    predictions_kr_383 += kr_383.predict(X_test_383) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_kr_383, target)))

# 普通岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_ridge_383 = np.zeros(train_shape)
predictions_ridge_383 = np.zeros(len(X_test_383))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_383[trn_idx]
    tr_y = y_train[trn_idx]

    ridge_383 = Ridge(alpha=1200)
    ridge_383.fit(tr_x, tr_y)
    oof_ridge_383[val_idx] = ridge_383.predict(X_train_383[val_idx])

    predictions_ridge_383 += ridge_383.predict(X_test_383) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_ridge_383, target)))

# ElasticNet 弹性网络
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_en_383 = np.zeros(train_shape)
predictions_en_383 = np.zeros(len(X_test_383))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_383[trn_idx]
    tr_y = y_train[trn_idx]
    # ElasticNet 弹性网络
    en_383 = en(alpha=1.0, l1_ratio=0.06)
    en_383.fit(tr_x, tr_y)
    oof_en_383[val_idx] = en_383.predict(X_train_383[val_idx])

    predictions_en_383 += en_383.predict(X_test_383) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_en_383, target)))

# BayesianRidge 贝叶斯岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_br_383 = np.zeros(train_shape)
predictions_br_383 = np.zeros(len(X_test_383))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_383[trn_idx]
    tr_y = y_train[trn_idx]
    # BayesianRidge 贝叶斯回归
    br_383 = br()
    br_383.fit(tr_x, tr_y)
    oof_br_383[val_idx] = br_383.predict(X_train_383[val_idx])

    predictions_br_383 += br_383.predict(X_test_383) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_br_383, target)))

# 在每一种特征工程中,进行5折的交叉验证,并重复两次
train_stack1 = np.vstack([oof_br_383, oof_kr_383, oof_en_383, oof_ridge_383]).transpose()
test_stack1 = np.vstack([predictions_br_383, predictions_kr_383, predictions_en_383, predictions_ridge_383]).transpose()

folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
oof_stack1 = np.zeros(train_stack1.shape[0])
predictions_lr1 = np.zeros(test_stack1.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack1, target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack1[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack1[val_idx], target.iloc[val_idx].values
    # LinearRegression简单的线性回归
    lr1 = lr()
    lr1.fit(trn_data, trn_y)

    oof_stack1[val_idx] = lr1.predict(val_data)
    predictions_lr1 += lr1.predict(test_stack1) / 10

mean_squared_error(target.values, oof_stack1)

# 增加更多的模型进行49维特征的数据的构建
# KernelRidge 核岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_kr_49 = np.zeros(train_shape)
predictions_kr_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_49[trn_idx]
    tr_y = y_train[trn_idx]
    kr_49 = kr()
    kr_49.fit(tr_x, tr_y)
    oof_kr_49[val_idx] = kr_49.predict(X_train_49[val_idx])

    predictions_kr_49 += kr_49.predict(X_test_49) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_kr_49, target)))

# Ridge 岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_ridge_49 = np.zeros(train_shape)
predictions_ridge_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_49[trn_idx]
    tr_y = y_train[trn_idx]
    ridge_49 = Ridge(alpha=6)
    ridge_49.fit(tr_x, tr_y)
    oof_ridge_49[val_idx] = ridge_49.predict(X_train_49[val_idx])

    predictions_ridge_49 += ridge_49.predict(X_test_49) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_ridge_49, target)))

# BayesianRidge 贝叶斯岭回归
folds = KFold(n_splits=5, shuffle=True, random_state=13)
oof_br_49 = np.zeros(train_shape)
predictions_br_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_49[trn_idx]
    tr_y = y_train[trn_idx]
    br_49 = br()
    br_49.fit(tr_x, tr_y)
    oof_br_49[val_idx] = br_49.predict(X_train_49[val_idx])

    predictions_br_49 += br_49.predict(X_test_49) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_br_49, target)))

# ElasticNet 弹性网络
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=45)
oof_en_49 = np.zeros(train_shape)
predictions_en_49 = np.zeros(len(X_test_49))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
    print("fold n°{}".format(fold_ + 1))
    tr_x = X_train_49[trn_idx]
    tr_y = y_train[trn_idx]
    en_49 = en(alpha=1.0, l1_ratio=0.05)
    en_49.fit(tr_x, tr_y)
    oof_en_49[val_idx] = en_49.predict(X_train_49[val_idx])

    predictions_en_49 += en_49.predict(X_test_49) / folds.n_splits

print("CV score: {:<8.8f}".format(mean_squared_error(oof_en_49, target)))

# 在每一种特征工程中,进行5折的交叉验证,并重复两次
train_stack4 = np.vstack([oof_br_49, oof_kr_49, oof_en_49, oof_ridge_49]).transpose()
test_stack4 = np.vstack([predictions_br_49, predictions_kr_49, predictions_en_49, predictions_ridge_49]).transpose()

folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
oof_stack4 = np.zeros(train_stack4.shape[0])
predictions_lr4 = np.zeros(test_stack4.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack4, target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack4[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack4[val_idx], target.iloc[val_idx].values
    # LinearRegression
    lr4 = lr()
    lr4.fit(trn_data, trn_y)

    oof_stack4[val_idx] = lr4.predict(val_data)
    predictions_lr4 += lr4.predict(test_stack1) / 10

mean_squared_error(target.values, oof_stack4)

# 模型融合
# 将以上的4中集成学习模型再次进行集成学习的训练,直接使用LinearRegression简单线性回归的进行集成
train_stack5 = np.vstack([oof_stack1, oof_stack2, oof_stack3, oof_stack4]).transpose()
test_stack5 = np.vstack([predictions_lr1, predictions_lr2, predictions_lr3, predictions_lr4]).transpose()

folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
oof_stack5 = np.zeros(train_stack5.shape[0])
predictions_lr5 = np.zeros(test_stack5.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack5, target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack5[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack5[val_idx], target.iloc[val_idx].values
    # LinearRegression
    lr5 = lr()
    lr5.fit(trn_data, trn_y)

    oof_stack5[val_idx] = lr5.predict(val_data)
    predictions_lr5 += lr5.predict(test_stack5) / 10

mean_squared_error(target.values, oof_stack5)

# 结果保存
submit_example = pd.read_csv('submit_example.csv', sep=',', encoding='latin-1')
submit_example['happiness'] = predictions_lr5
submit_example.happiness.describe()

submit_example.loc[submit_example['happiness']>4.96,'happiness']= 5
submit_example.loc[submit_example['happiness']<=1.04,'happiness']= 1
submit_example.loc[(submit_example['happiness']>1.96)&(submit_example['happiness']<2.04),'happiness']= 2

submit_example.to_csv("submision.csv",index=False)
submit_example.happiness.describe()

集成学习案例:蒸汽量预测

1 数据信息
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。我们需要利用训练数据训练出模型,预测测试数据的目标变量。
2 评价指标
最终的评价指标为均方误差MSE。
3 数据处理
(1)探索数据分布
对于连续分布的传感器的数据,使用 kdeplot(核密度估计图) 进行数据的初步分析,即EDA。核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。
(2)画出热力图(sns.heatmap),查看特征之间的相关性。然后进行进行降维操作(将相关性的绝对值小于阈值的特征进行删除)和归一化操作。
(3) 特征工程
绘图显示Box-Cox变换对数据分布影响,Box-Cox用于连续的响应变量不满足正态分布的情况。在进行Box-Cox变换之后,可以一定程度上**减小不可观测的误差和预测变量的相关性。**同时,使用对数变换target目标值提升特征数据的正太性 。
(4)模型构建以及集成学习
构建训练集和测试集。寻找离群值,并删除。
(5) 进行模型的预测以及结果的保存
4 具体代码

import warnings

from sklearn.kernel_ridge import KernelRidge

warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score, cross_val_predict, KFold
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, StandardScaler

# 加载数据
data_train = pd.read_csv('train.txt', sep='\t')
data_test = pd.read_csv('test.txt', sep='\t')
# 合并训练数据和测试数据
data_train["oringin"] = "train"
data_test["oringin"] = "test"
data_all = pd.concat([data_train, data_test], axis=0, ignore_index=True)
# 显示前5条数据
# print(data_all.head())

# 使用 kdeplot(核密度估计图) 进行数据的初步分析
# for column in data_all.columns[0:-2]:
#     #核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。
#     g = sns.kdeplot(data_all[column][(data_all["oringin"] == "train")], color="Red", shade = True)
#     g = sns.kdeplot(data_all[column][(data_all["oringin"] == "test")], ax =g, color="Blue", shade= True)
#     g.set_xlabel(column)
#     g.set_ylabel("Frequency")
#     g = g.legend(["train","test"])
#     plt.show()

# 通过观察发现特征"V5","V9","V11","V17","V22","V28"中训练集数据分布和测试集数据分布不均,所以删除这些特征数据
data_all.drop(["V5", "V9", "V11", "V17", "V22", "V28"], axis=1, inplace=True)

# 查看特征之间的相关性
data_train1 = data_all[data_all["oringin"] == "train"].drop("oringin", axis=1)
# plt.figure(figsize=(20, 16))  # 指定绘图对象宽度和高度
# colnm = data_train1.columns.tolist()  # 列表头
# mcorr = data_train1[colnm].corr(method="spearman")  # 相关系数矩阵,即给出了任意两个变量之间的相关系数
# mask = np.zeros_like(mcorr, dtype=np.bool)  # 构造与mcorr同维数矩阵 为bool型
# mask[np.triu_indices_from(mask)] = True  # 角分线右侧为True
# cmap = sns.diverging_palette(220, 10, as_cmap=True)  # 返回matplotlib colormap对象,调色板
# g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')  # 热力图(看两两相似度)
# plt.show()

# 进行降维操作,即将相关性的绝对值小于阈值的特征进行删除
threshold = 0.1
corr_matrix = data_train1.corr().abs()
drop_col = corr_matrix[corr_matrix["target"] < threshold].index
# print(drop_col)       Index(['V14', 'V21', 'V25', 'V26', 'V32', 'V33', 'V34']
data_all.drop(drop_col, axis=1, inplace=True)

# 归一化
cols_numeric = list(data_all.columns)
cols_numeric.remove("oringin")


def scale_minmax(col):
    return (col - col.min()) / (col.max() - col.min())


scale_cols = [col for col in cols_numeric if col != 'target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax, axis=0)
# print(data_all[scale_cols].describe())


# 特征工程
# 绘图显示Box-Cox变换对数据分布影响,Box-Cox用于连续的响应变量不满足正态分布的情况。在进行Box-Cox变换之后,可以一定程度上减小不可观测的误差和预测变量的相关性
fcols = 6
frows = len(cols_numeric) - 1
plt.figure(figsize=(4 * fcols, 4 * frows))
i = 0
'''
for var in cols_numeric:
    if var != 'target':
        dat = data_all[[var, 'target']].dropna()

        i += 1
        plt.subplot(frows, fcols, i)
        sns.distplot(dat[var], fit=stats.norm)
        plt.title(var + ' Original')
        plt.xlabel('')

        i += 1
        plt.subplot(frows, fcols, i)
        _ = stats.probplot(dat[var], plot=plt)
        plt.title('skew=' + '{:.4f}'.format(stats.skew(dat[var])))
        plt.xlabel('')
        plt.ylabel('')

        i += 1
        plt.subplot(frows, fcols, i)
        plt.plot(dat[var], dat['target'], '.', alpha=0.5)
        plt.title('corr=' + '{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1]))

        i += 1
        plt.subplot(frows, fcols, i)
        trans_var, lambda_var = stats.boxcox(dat[var].dropna() + 1)
        trans_var = scale_minmax(trans_var)
        sns.distplot(trans_var, fit=stats.norm);
        plt.title(var + ' Tramsformed')
        plt.xlabel('')

        i += 1
        plt.subplot(frows, fcols, i)
        _ = stats.probplot(trans_var, plot=plt)
        plt.title('skew=' + '{:.4f}'.format(stats.skew(trans_var)))
        plt.xlabel('')
        plt.ylabel('')

        i += 1
        plt.subplot(frows, fcols, i)
        plt.plot(trans_var, dat['target'], '.', alpha=0.5)
        plt.title('corr=' + '{:.2f}'.format(np.corrcoef(trans_var, dat['target'])[0][1]))
'''

# 进行Box-Cox变换
cols_transform = data_all.columns[0:-2]
for col in cols_transform:
    # transform column
    data_all.loc[:, col], _ = stats.boxcox(data_all.loc[:, col] + 1)
# print(data_all.target.describe())
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.distplot(data_all.target.dropna(), fit=stats.norm)
plt.subplot(1, 2, 2)
_ = stats.probplot(data_all.target.dropna(), plot=plt)
# plt.show()

# 使用对数变换target目标值提升特征数据的正太性
sp = data_train.target
data_train.target1 = np.power(1.5, sp)  # 1.5^sp
# print(data_train.target1.describe())

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.distplot(data_train.target1.dropna(), fit=stats.norm)
plt.subplot(1, 2, 2)
_ = stats.probplot(data_train.target1.dropna(), plot=plt)


# 模型构建以及集成学习
def get_training_data():
    # extract training samples
    from sklearn.model_selection import train_test_split
    df_train = data_all[data_all['oringin'] == 'train']
    df_train['label'] = data_train.target1

    y = df_train.target
    X = df_train.drop(["oringin", "target", "label"], axis=1)
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=100)
    return X_train, X_valid, y_train, y_valid


# extract test data (without SalePrice)
def get_test_data():
    df_test = data_all[data_all["oringin"] == "test"].reset_index(drop=True)
    return df_test.drop(["oringin", "target"], axis=1)


# remse和mes评价函数
from sklearn.metrics import make_scorer


# metric for evaluation
def rmse(y_true, y_pred):
    diff = y_pred - y_true
    sum_sq = sum(diff ** 2)
    n = len(y_pred)
    return np.sqrt(sum_sq / n)


def mse(y_ture, y_pred):
    return mean_squared_error(y_ture, y_pred)


# scorer to be used in sklearn model fitting
rmse_scorer = make_scorer(rmse, greater_is_better=False)

# 输入的score_func为记分函数时,该值为True(默认值);输入函数为损失函数时,该值为False
mse_scorer = make_scorer(mse, greater_is_better=False)


# 寻找离群值,并删除
def find_outliers(model, X, y, sigma=3):
    # predict y values using model
    model.fit(X, y)
    y_pred = pd.Series(model.predict(X), index=y.index)

    # calculate residuals between the model prediction and true y values
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()

    # calculate z statistic, define outliers to be where |z|>sigma
    z = (resid - mean_resid) / std_resid
    outliers = z[abs(z) > sigma].index

    # print and plot the results
    print('R2=', model.score(X, y))
    print('rmse=', rmse(y, y_pred))
    print("mse=", mean_squared_error(y, y_pred))
    print('---------------------------------------')

    print('mean of residuals:', mean_resid)
    print('std of residuals:', std_resid)
    print('---------------------------------------')

    print(len(outliers), 'outliers:')
    print(outliers.tolist())

    plt.figure(figsize=(15, 5))
    ax_131 = plt.subplot(1, 3, 1)
    plt.plot(y, y_pred, '.')
    plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
    plt.legend(['Accepted', 'Outlier'])
    plt.xlabel('y')
    plt.ylabel('y_pred');

    ax_132 = plt.subplot(1, 3, 2)
    plt.plot(y, y - y_pred, '.')
    plt.plot(y.loc[outliers], y.loc[outliers] - y_pred.loc[outliers], 'ro')
    plt.legend(['Accepted', 'Outlier'])
    plt.xlabel('y')
    plt.ylabel('y - y_pred');

    ax_133 = plt.subplot(1, 3, 3)
    z.plot.hist(bins=50, ax=ax_133)
    z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133)
    plt.legend(['Accepted', 'Outlier'])
    plt.xlabel('z')

    return outliers


# get training data
X_train, X_valid, y_train, y_valid = get_training_data()
test = get_test_data()

# find and remove outliers using a Ridge model
outliers = find_outliers(Ridge(), X_train, y_train)
# X_outliers = X_train.loc[outliers]
# y_outliers = y_train.loc[outliers]
X_t = X_train.drop(outliers)
y_t = y_train.drop(outliers)


# 模型训练
def get_trainning_data_omitoutliers():
    # 获取训练数据省略异常值
    y = y_t.copy()
    X = X_t.copy()
    return X, y


def train_model(model, param_grid=[], X=[], y=[],
                splits=5, repeats=5):
    # 获取数据
    if len(y) == 0:
        X, y = get_trainning_data_omitoutliers()

    # 交叉验证
    rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)

    # 网格搜索最佳参数
    if len(param_grid) > 0:
        gsearch = GridSearchCV(model, param_grid, cv=rkfold,
                               scoring="neg_mean_squared_error",
                               verbose=1, return_train_score=True)

        # 训练
        gsearch.fit(X, y)

        # 最好的模型
        model = gsearch.best_estimator_
        best_idx = gsearch.best_index_

        # 获取交叉验证评价指标
        grid_results = pd.DataFrame(gsearch.cv_results_)
        cv_mean = abs(grid_results.loc[best_idx, 'mean_test_score'])
        cv_std = grid_results.loc[best_idx, 'std_test_score']

    # 没有网格搜索
    else:
        grid_results = []
        cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
        cv_mean = abs(np.mean(cv_results))
        cv_std = np.std(cv_results)

    # 合并数据
    cv_score = pd.Series({'mean': cv_mean, 'std': cv_std})

    # 预测
    y_pred = model.predict(X)

    # 模型性能的统计数据
    print('----------------------')
    print(model)
    print('----------------------')
    print('score=', model.score(X, y))
    print('rmse=', rmse(y, y_pred))
    print('mse=', mse(y, y_pred))
    print('cross_val: mean=', cv_mean, ', std=', cv_std)

    # 残差分析与可视化
    y_pred = pd.Series(y_pred, index=y.index)
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()
    z = (resid - mean_resid) / std_resid
    n_outliers = sum(abs(z) > 3)
    outliers = z[abs(z) > 3].index

    return model, cv_score, grid_results


# 定义训练变量存储数据
opt_models = dict()
score_models = pd.DataFrame(columns=['mean', 'std'])
splits = 5
repeats = 5

# model = 'KernelRidge'  # 可替换
# opt_models[model] = KernelRidge()  # 可替换
model = 'Ridge'
opt_models[model] = Ridge()
alph_range = np.arange(0.25, 6, 0.25)
param_grid = {'alpha': alph_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
                                                        splits=splits, repeats=repeats)
# print(opt_models[model], cv_score, grid_results)
cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),
             abs(grid_results['std_test_score']) / np.sqrt(splits * repeats))
plt.xlabel('alpha')
plt.ylabel('score')


# 预测函数
def model_predict(test_data, test_y=[]):
    i = 0
    y_predict_total = np.zeros((test_data.shape[0],))
    for model in opt_models.keys():
        if model != "LinearSVR" and model != "KNeighbors":
            y_predict = opt_models[model].predict(test_data)
            y_predict_total += y_predict
            i += 1
        if len(test_y) > 0:
            print("{}_mse:".format(model), mean_squared_error(y_predict, test_y))
    y_predict_mean = np.round(y_predict_total / i, 6)
    if len(test_y) > 0:
        print("mean_mse:", mean_squared_error(y_predict_mean, test_y))
    else:
        y_predict_mean = pd.Series(y_predict_mean)
        return y_predict_mean


# 模型的预测以及结果的保存
y_ = model_predict(test)
y_.to_csv('predict.txt', header=None, index=False)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值