【案例分析】Brand Prediction

1.导入与剖析组成

1.1导入库与相关属性

1.1.1引用主干部分所需库
# 基础库调用
%matplotlib inline
import numpy as np 
from numpy.random import seed
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

# 删除警告
import warnings
warnings.filterwarnings('ignore')

# 高级计算库
from scipy import stats
from scipy.stats import norm

# 机器学习库的导入
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import sklearn
1.1.2模型、结果计算所需工具导入
# SVC 
from sklearn.svm import SVC

# ensemble
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier

# Logistic
from sklearn.linear_model import LogisticRegression

# SGD
from sklearn.linear_model import SGDClassifier 

# KNN
from sklearn.neighbors import KNeighborsClassifier

# model selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

# Vote
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
  • 涵盖随机森林、SVM支持向量机、Logistics回归、KNN邻近算法、ADA迭代最终分类器、ETC极端随机树、GBC梯度迭代、SGD随机梯度下降等八个模型方法。
    选择其中较为可靠的模型,利用投票机制保证结果具有一定可靠性。

1.2将官方给定数据集导入,并了解数据构成

1.2.1导入数据,添加下标列
# 数据的导入
train_data = pd.read_csv(r'train.csv')
test_data = pd.read_csv(r'test.csv')
final_data = pd.read_csv(r'35.csv')

# 删除brand列,防止空缺报错
final_data = final_data.drop('brand',axis = 1)

# 探究数据格式、大小
print('实验数据大小:',train_data.shape)
print('预测数据大小:',test_data.shape)

# 新建数据的下标列aindex
x1 = np.arange(1 , train_data.shape[0]+1)
x2 = np.arange(1 , test_data.shape[0]+1)
train_data.insert(0,'aindex',x1)
test_data.insert(0,'aindex',x2)
实验数据大小: (9398, 7)
预测数据大小: (4500, 7)
  • 实验组train为9398组,每组7个属性。
  • 测试预测组test为4500组,每组7个属性。
1.2.2分别查看实验数据集和预测数据集数据
# train的数据备份
data_Age = train_data.copy()

# 设置格式与重组
aindex = test_data['aindex']
test_data['brand'] = -1
train_data['Set'] = "Train"
test_data['Set'] = "Test"
DATA = train_data.append(test_data)
DATA.reset_index(inplace=True)
1.2.3输出两组数据集的前五行,查看数据的构成
display(train_data.head())
test_data.head()
aindexsalaryageelevelcarzipcodecreditbrandSet
01119806.54480450144442037.711300Train
1278020.7509423015248795.322790Train
2350873.61880203144352951.497700Train
3472298.80402294170276298.695200Train
45128999.9356052160152232.509800Train
aindexsalaryageelevelcarzipcodecreditbrandSet
0128662.39571644138118241.0303-1Test
1268256.01678513117307741.8081-1Test
23130235.445607613027372.1500-1Test
3488149.88200664114440103.6174-1Test
4590778.186813441430.0000-1Test
  • 可以看到数据贴上了Train和Test的标签,且包含aindex下标。

  • 有六个基本属性salary、age、elevel、car、zipcode、credit,最终影响品牌选择brand。

  • 测试预测集Test中brand设置为-1,在测试模型后与备份的集合比对。

1.3数据查看

  • 故接下来首先简单看一下上述六个不同属性的分布;
  • 通过数据可视化的方式输出图表,方便进行观察思考。
1.3.1数据属性的可视化
import matplotlib.pyplot as plt

# 以下六段代码展现六个基本属性的具体信息
# ↓↓下列代码格式基本遵守以下内容↓↓

# plt.subplot2grid((11,5),(0,0),colspan=2,rowspan=2)
# 设计为十一行五列的整体框架,包含六个基本图像输出,各占四行三列的位置。

# sns.distplot(train_data['salary'])
# 显示“xxx”内数据的直方图和核密度

# train_data.elevel.value_counts().plot(kind='bar')
# 以条形图的形式,显示“xxx”内部数据的分类和频数

# plt.xlabel(u"car")
# 设计下标为“xxx”

plt.subplot2grid((11,5),(0,0),colspan=2,rowspan=2)
sns.distplot(train_data['salary'])
plt.xlabel(u"salary")

plt.subplot2grid((11,5),(0,3),colspan=2,rowspan=2)
sns.distplot(train_data['age'])
plt.xlabel(u"age")

plt.subplot2grid((11,5),(4,0),colspan=2,rowspan=2)
train_data.elevel.value_counts().plot(kind='bar')
plt.xlabel(u"elevel")

plt.subplot2grid((11,5),(4,3),colspan=2,rowspan=2)
sns.distplot(train_data['car'])
plt.xlabel(u"car")

plt.subplot2grid((11,5),(8,0),colspan=2,rowspan=2)
train_data.zipcode.value_counts().plot(kind='bar')
plt.xlabel(u"zipcode")

plt.subplot2grid((11,5),(8,3),colspan=2,rowspan=2)
sns.distplot(train_data['credit'])
plt.xlabel(u"credit")

plt.show()

在这里插入图片描述

  • 绝大多数数据属性均无明显分布差异,各个变量属性基本分布均匀。
  • salary:在20000-150000间核密度稳定在10*e^-6左右;
  • age:在30、50、60、70核密度大致为0.01稍小,其余接近0.02整体差距不大;
  • elevel:属性的每个值频数都接近2000,频数差距不大;
  • car:属性每个值的核密度均接近0.05,整体差距不大;
  • zipcode:属性的每个值频数都接近1000,频数差距不大;
  • credit:在0-500000间核密度稳定在2*e^-6左右;

Excel进行数据分析时发现salary和credit在两端存在值密度极大的情况。

train_data.brand.value_counts().plot(kind='bar')
plt.xlabel(u"brand")
Text(0.5, 0, 'brand')

在这里插入图片描述

  • 训练集Train关于brand的属性选择:
  • 有近6000个对象选择1;
  • 有近4000个对象选择0。
display(train_data.head())
test_data.head()
aindexsalaryageelevelcarzipcodecreditbrandSet
01119806.54480450144442037.711300Train
1278020.7509423015248795.322790Train
2350873.61880203144352951.497700Train
3472298.80402294170276298.695200Train
45128999.9356052160152232.509800Train
aindexsalaryageelevelcarzipcodecreditbrandSet
0128662.39571644138118241.0303-1Test
1268256.01678513117307741.8081-1Test
23130235.445607613027372.1500-1Test
3488149.88200664114440103.6174-1Test
4590778.186813441430.0000-1Test

图表中获得以下一些基础信息:

  • 1.salary和credit就数值大小角度观察,不同对象的差距较大。
  • 2.age、elevel、car、zipcode等根据个人的背景因素也会存在不同。

2.观察数据填充与特征情况

  • 首先观察一下给定数组中是否有缺失值的存在,若有则进行填充以提高特征分析的准确率
  • 分析数据的整体特征情况,针对性分析

2.1观察数据是否填充完整

print(DATA.isnull().sum())
print(final_data.isnull().sum())
index      0
aindex     0
salary     0
age        0
elevel     0
car        0
zipcode    0
credit     0
brand      0
Set        0
dtype: int64
salary     0
age        0
elevel     0
car        0
zipcode    0
credit     0
dtype: int64
  • DATA为train和test的数据总集,显示无空缺值
  • final_data为需要填写的数据,显示无空缺值

2.2其余数据特征

ax = sns.heatmap(DATA[DATA.Set == 'Train'][['salary','age','elevel','car','zipcode','credit','brand']].corr(),annot=True, fmt = '.3f', cmap = 'coolwarm');
ax.set_title("brand data features");

在这里插入图片描述

  • 各项数据的生存相关性表中,不难看出,仅salary和brand具有较弱的正相关性,其余特征相关性不明显。
  • 具体结果则由后续的全面特征分析决定。

3.删除和编码特征

3.1保存特征以供评审

# 训练集属性定义
_salary = DATA[DATA.Set == 'Train'].salary
_age = DATA[DATA.Set == 'Train'].age
_elevel = DATA[DATA.Set == 'Train'].elevel
_car = DATA[DATA.Set == 'Train'].car
_zipcode = DATA[DATA.Set == 'Train'].zipcode
_credit = DATA[DATA.Set == 'Train'].credit

# 测试集属性定义
T_salary = final_data.salary
T_age = final_data.age
T_elevel = final_data.elevel
T_car = final_data.car
T_zipcode = final_data.zipcode
T_credit = final_data.credit

3.2分割训练与测试集

TRAIN = DATA[DATA.Set == 'Train']# 划出训练集
TEST = DATA[DATA.Set == 'Test']# 划出测试预测集
aindex = TEST.aindex.to_list()# 转换为列表

# 仅保留六个特征属性的数据
TEST = TEST.drop(['aindex','Set','index','brand'], axis = 1)
X = TRAIN.drop(['aindex','brand','Set','index'], axis=1)

# 数据按照比例分割为测试集和训练集
y = TRAIN.brand
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1, stratify=y)

4.具体模型实现和重要特征

4.1 Random Forest随机森林

#建立一个随机森林模型rf_model
rf_model = RandomForestClassifier(max_depth=6, n_estimators= 50, max_features='auto')
#sklearn数据预处理、数据拟合
rf_model.fit(X_train, y_train)

rf_train_score = rf_model.score(X_train, y_train)#训练集
rf_accuracy = rf_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(rf_train_score * 100))
print("Test: {:.2f} %".format(rf_accuracy*100))
print('Overfit: {:.2f} %'.format((rf_train_score-rf_accuracy)*100))
Train: 88.63 %
Test: 85.43 %
Overfit: 3.20 %
#Random Forest模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"RF":features})#创建DataFrame
importances.sort_values("RF", ascending = False, inplace=True)#数据排序
RF_best_features = list(importances[importances.RF > 0.03].index)
importances.plot.bar()#作图
print("RF_best_features:",RF_best_features, len(RF_best_features))#对应下标题

plt.show()
RF_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.2 Support Vector Machines支持向量机算法

#建立一个Support Vector Machine模型SVM_model
SVM_model = SVC(C = 100, gamma= 0.001, kernel='rbf')
#sklearn数据预处理、数据拟合
SVM_model.fit(X_train, y_train)

svm_train_score = SVM_model.score(X_train, y_train)#训练集
SVM_accuracy = SVM_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(svm_train_score*100))
print("Test: {:.2f} %".format(SVM_accuracy*100))
print('Overfit: {:.2f} %'.format((svm_train_score-SVM_accuracy)*100))
Train: 100.00 %
Test: 60.28 %
Overfit: 39.72 %
#SVM模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"SVM":features})#创建DataFrame
importances.sort_values("SVM", ascending = False, inplace=True)#数据排序
importances
SVM_best_features = list(importances[importances.SVM > 0.03].index)
importances.plot.bar()#作图
print("SVM_best_features:",SVM_best_features, len(SVM_best_features))#对应下标题
SVM_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.3 Logistic回归

#建立一个Logistics回归模型LR_model
LR_model = LogisticRegression(solver='liblinear', C=2.78, penalty='l2')
#sklearn数据预处理、数据拟合
LR_model.fit(X_train, y_train)

LR_train_score = LR_model.score(X_train, y_train.astype('int'))#训练集
LR_accuracy = LR_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(LR_train_score*100))
print("Test: {:.2f} %".format(LR_accuracy*100))
print('Overfit: {:.2f} %'.format((LR_train_score-LR_accuracy)*100))
Train: 57.43 %
Test: 57.62 %
Overfit: -0.19 %
#Logistics回归模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"LR":features})#创建DataFrame
importances.sort_values("LR", ascending = False, inplace=True)#数据排序
importances
LR_best_features = list(importances[importances.LR > 0.03].index)
importances.plot.bar()#作图
print("LR_best_features:",LR_best_features, len(LR_best_features))#对应下标题
LR_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.4 K-NearestNeighbor邻近算法

#建立一个K-NearestNeighbor模型KNN_model
KNN_model = KNeighborsClassifier(n_neighbors=11,metric='euclidean',weights='uniform')
#sklearn数据预处理、数据拟合
KNN_model.fit(X_train, y_train)

KNN_train_score = KNN_model.score(X_train, y_train)#训练集
KNN_accuracy = KNN_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(KNN_train_score*100))
print("Test: {:.2f} %".format(KNN_accuracy*100))
print('Overfit: {:.2f} %'.format((KNN_train_score-KNN_accuracy)*100))
Train: 75.49 %
Test: 69.61 %
Overfit: 5.88 %
#KNN模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"KNN":features})#创建DataFrame
importances.sort_values("KNN", ascending = False, inplace=True)#数据排序
importances
KNN_best_features = list(importances[importances.KNN > 0.03].index)
importances.plot.bar()#作图
print("KNN_best_features:",KNN_best_features, len(KNN_best_features))#对应下标题
KNN_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.5 ADA Boost迭代最终分类器

#建立一个ADA Boost模型ADA_model
ADA_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5, min_samples_leaf=10), n_estimators=200, learning_rate = 0.001) 
#sklearn数据预处理、数据拟合
ADA_model.fit(X_train,y_train)

ADA_train_score = ADA_model.score(X_train, y_train)#训练集
ADA_accuracy = ADA_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(ADA_train_score*100))
print("Test: {:.2f} %".format(ADA_accuracy*100))
print('Overfit: {:.2f} %'.format((ADA_train_score - ADA_accuracy)*100))
Train: 92.47 %
Test: 91.60 %
Overfit: 0.88 %
#ADA模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"ADA":features})#创建DataFrame
importances.sort_values("ADA", ascending = False, inplace=True)#数据排序
importances
ADA_best_features = list(importances[importances.ADA > 0.03].index)
importances.plot.bar()#作图
print("ADA_best_features:",ADA_best_features, len(ADA_best_features))#对应下标题
ADA_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.6 ExtraTrees极端随机树

#建立一个ExtraTrees模型ETC_model
ETC_model = ExtraTreesClassifier(max_features=4, min_samples_leaf=10, n_estimators=300, min_samples_split=3)
#sklearn数据预处理、数据拟合
ETC_model.fit(X_train, y_train)

ETC_train_score = ETC_model.score(X_train, y_train)#训练集
ETC_accuracy = ETC_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(ETC_train_score*100))
print("Test: {:.2f} %".format(ETC_accuracy*100))
print('Overfit: {:.2f} %'.format((ETC_train_score-ETC_accuracy)*100))
Train: 94.22 %
Test: 92.80 %
Overfit: 1.42 %
#ETC模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"ETC":features})#创建DataFrame
importances.sort_values("ETC", ascending = False, inplace=True)#数据排序
importances
ETC_best_features = list(importances[importances.ETC > 0.03].index)
importances.plot.bar()#作图
print("ETC_best_features:",ETC_best_features, len(ETC_best_features))#对应下标题
ETC_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.7 Gradient Boost 梯度迭代

#建立一个GradientBoost模型GBC_model
GBC_model = GradientBoostingClassifier(max_depth=4, max_features=0.3, min_samples_leaf=100, n_estimators=300)
#sklearn数据预处理、数据拟合
GBC_model.fit(X_train, y_train)

GBC_train_score = GBC_model.score(X_train, y_train)#训练集
GBC_accuracy = GBC_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(GBC_train_score*100))
print("Test: {:.2f} %".format(GBC_accuracy*100))
print('Overfit: {:.2f} %'.format((GBC_train_score-GBC_accuracy)*100))
Train: 94.47 %
Test: 91.10 %
Overfit: 3.37 %
#GBC模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"GBC":features})#创建DataFrame
importances.sort_values("GBC", ascending = False, inplace=True)#数据排序
GBC_best_features = list(importances[importances.GBC > 0.03].index)
importances.plot.bar()#作图
print("GBC_best_features:",GBC_best_features, len(GBC_best_features))#对应下标题
GBC_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.8 Stochastic Gradient Descent随机梯度下降

#建立一个Stochastic Gradient Descent模型SGD_model
SGD_model = SGDClassifier(alpha=0.01, penalty='elasticnet', loss='hinge')
#sklearn数据预处理、数据拟合
SGD_model.fit(X_train, y_train)

SGD_train_score = SGD_model.score(X_train, y_train)#训练集
SGD_accuracy = SGD_model.score(X_test, y_test)#准确性

#输出结果
print("Train: {:.2f} %".format(SGD_train_score*100))
print("Test: {:.2f} %".format(SGD_accuracy*100))
print('Overfit: {:.2f} %'.format((SGD_train_score-SGD_accuracy)*100))
Train: 39.84 %
Test: 39.82 %
Overfit: 0.02 %
#SGD模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
    features[feature] = importance

importances = pd.DataFrame({"SGD":features})#创建DataFrame
importances.sort_values("SGD", ascending = False, inplace=True)#数据排序
SGD_best_features = list(importances[importances.SGD > 0.03].index)
importances.plot.bar()#作图
print("SGD_best_features:",SGD_best_features, len(SGD_best_features))#对应下标题
SGD_best_features: ['salary', 'age', 'credit'] 3

在这里插入图片描述

  • 由重要性分析和条形图得:
    ‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。

4.9 输出各个模型种,特征影响最大的类

L = min(len(RF_best_features), len(ADA_best_features), len(KNN_best_features), len(LR_best_features), len(SVM_best_features), 
        len(ETC_best_features), len(GBC_best_features), len(SGD_best_features))

TF = pd.DataFrame({"ADA":ADA_best_features[:L], "KNN": KNN_best_features[:L], "LR": LR_best_features[:L],
                  "SVM":SVM_best_features[:L],  "RF":RF_best_features[:L],
                  "ETC":ETC_best_features[:L], "GBC":GBC_best_features[:L], "SGD":SGD_best_features[:L]} )
TF
ADAKNNLRSVMRFETCGBCSGD
0salarysalarysalarysalarysalarysalarysalarysalary
1ageageageageageageageage
2creditcreditcreditcreditcreditcreditcreditcredit
  • 综上:‘salary’ ,'age’等二个特征具有较强重要性,会对结果有很大的影响。

5.汇总算法的准确性

print("Accuracy Scores:")
print("==========================================================")
print("RandomForest: {:.3f}".format(rf_accuracy))
print("SVM classifier: {:.3f}".format(SVM_accuracy))
print("LR classifier: {:.3f}".format(LR_accuracy))
print("KNN classifier: {:.3f}".format(KNN_accuracy))
print("ADA Boost classifier: {:.3f}".format(ADA_accuracy))
print("Extra Tree classifier: {:.3f}".format(ETC_accuracy))
print("Gradient Boosting classifier: {:.3f}".format(GBC_accuracy))
print("Stochastic Gradient descent: {:.3f}".format(SGD_accuracy))
print("==========================================================")
Accuracy Scores:
==========================================================
RandomForest: 0.854
SVM classifier: 0.603
LR classifier: 0.576
KNN classifier: 0.696
ADA Boost classifier: 0.916
Extra Tree classifier: 0.928
Gradient Boosting classifier: 0.911
Stochastic Gradient descent: 0.398
==========================================================

为保证最终准确性,挑选准确率在0.8以上的四组进行最终投票汇总。

6.测试最终“组号”数据

6.1运用模型预测“组号final_data”集合

rf_predictions = rf_model.predict(final_data) 
ada_predictions = ADA_model.predict(final_data)
etc_predictions = ETC_model.predict(final_data)
gbc_predictions = GBC_model.predict(final_data)

6.2等比重多模型计数预测

def vote(votes):
    weight_dict = {'RF':1,"ADA":1,"ETC":1,"GBC":1,}# 比重都是1
    weights = np.array(list(weight_dict.values()))# 权重依据其值定义
    sw = weights.sum()# 总权重
    v = [v * weights[i] for i,v in enumerate(votes)]
    return sum(v)/ sw

#创建DataFrame储存模型结果和分类标题
ALL_PREDICTIONS = pd.DataFrame({'RF':rf_predictions,"ADA":ada_predictions,"ETC":etc_predictions, 
                                "GBC":gbc_predictions,
                               'salary':T_salary,'age':T_age,'elevel':T_elevel,
                                'car':T_car,'zipcode':T_zipcode,'credit':T_credit})
clfs = ['RF',"ADA","ETC","GBC"]

# 利用投票机制进行优化
ALL_PREDICTIONS['Vote'] = ALL_PREDICTIONS[clfs].apply(lambda row: vote(row), axis = 1)
ALL_PREDICTIONS['Predict'] = ALL_PREDICTIONS.Vote.apply(lambda row: int(np.rint(row)))

# 最终投票机制的结果定义为vc_predictions
vc_predictions = ALL_PREDICTIONS.Predict

6.3检查模型训练结果

#训练预测
rf_train = rf_model.predict(X)  
ada_train = ADA_model.predict(X)
etc_train = ETC_model.predict(X)
gbc_train = GBC_model.predict(X)

#训练结果投票
TRAIN_PREDICTIONS = pd.DataFrame({'brand':train_data.brand,'salary':_salary,'age':_age,
                                  'elevel': _elevel,'car':_car,'zipcode':_zipcode,'credit':_credit,
                                  'RF':rf_train,"ADA":ada_train,"ETC":etc_train, "GBC":gbc_train})
TRAIN_PREDICTIONS['Vote'] = TRAIN_PREDICTIONS[clfs].apply(lambda row: vote(row), axis = 1)
TRAIN_PREDICTIONS['VC'] = TRAIN_PREDICTIONS.Vote.apply(lambda row: int(np.rint(row+0.01)))
clfs = ['RF',"ADA","ETC","GBC"]

#错误反馈
wrong = TRAIN_PREDICTIONS[TRAIN_PREDICTIONS.brand != TRAIN_PREDICTIONS.VC]
print(len(wrong))
593
#各个模型的错误累计计算
scores = {}
for c in clfs:
    scores[c] = 0

for i in wrong.index:
    s = TRAIN_PREDICTIONS.loc[TRAIN_PREDICTIONS.index[i],'brand']
    #print(i, s)
    for c in clfs:
        if TRAIN_PREDICTIONS.loc[TRAIN_PREDICTIONS.index[i],c] == s:
            scores[c] += 1
    
scores
{'RF': 107, 'ADA': 89, 'ETC': 52, 'GBC': 88}

6.4训练结果的得分,并进行作图

# 训练集进行优化的分数
train_scores = {}
for clf in [*clfs, 'VC']:
    train_scores[clf] = [len(TRAIN_PREDICTIONS[TRAIN_PREDICTIONS.brand == TRAIN_PREDICTIONS[clf]]) / TRAIN_PREDICTIONS.shape[0]]

TRAIN_SCORES = pd.DataFrame(train_scores)
TRAIN_SCORES
RFADAETCGBCVC
00.8766760.9221110.9379660.9345610.936901
TRAIN_SCORES.plot.bar() # 输出四个模型以及均值的投票结果条形图
plt.xlabel(u"Each model and its overall voting results")
Text(0.5, 0, 'Each model and its overall voting results')

在这里插入图片描述

投票后的效果好于平均值

7.输出结果

csv_data = pd.read_csv(r'35.csv', low_memory = False)# low_memory防止弹出警告
csv_df = pd.DataFrame(csv_data)# 建立新的dataframe
csv_df['brand'] = vc_predictions # 填充brand结果
csv_df.to_csv('my_submission.csv',index = None) # 输出生成后的文件my_submission.csv
  • 成功输出预测35.csv文件的结果到submission文件中

8.参考文献

  • https://www.kaggle.com/lovroselic/titanic-ls
  • https://www.kaggle.com/madivens/code(我的titanic原文)
  • https://www.kaggle.com/c/titanic/discussion/285454
  • 21
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值