1.导入与剖析组成
1.1导入库与相关属性
1.1.1引用主干部分所需库
# 基础库调用
%matplotlib inline
import numpy as np
from numpy.random import seed
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
# 删除警告
import warnings
warnings.filterwarnings('ignore')
# 高级计算库
from scipy import stats
from scipy.stats import norm
# 机器学习库的导入
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import sklearn
1.1.2模型、结果计算所需工具导入
# SVC
from sklearn.svm import SVC
# ensemble
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
# Logistic
from sklearn.linear_model import LogisticRegression
# SGD
from sklearn.linear_model import SGDClassifier
# KNN
from sklearn.neighbors import KNeighborsClassifier
# model selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
# Vote
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
- 涵盖随机森林、SVM支持向量机、Logistics回归、KNN邻近算法、ADA迭代最终分类器、ETC极端随机树、GBC梯度迭代、SGD随机梯度下降等八个模型方法。
选择其中较为可靠的模型,利用投票机制保证结果具有一定可靠性。
1.2将官方给定数据集导入,并了解数据构成
1.2.1导入数据,添加下标列
# 数据的导入
train_data = pd.read_csv(r'train.csv')
test_data = pd.read_csv(r'test.csv')
final_data = pd.read_csv(r'35.csv')
# 删除brand列,防止空缺报错
final_data = final_data.drop('brand',axis = 1)
# 探究数据格式、大小
print('实验数据大小:',train_data.shape)
print('预测数据大小:',test_data.shape)
# 新建数据的下标列aindex
x1 = np.arange(1 , train_data.shape[0]+1)
x2 = np.arange(1 , test_data.shape[0]+1)
train_data.insert(0,'aindex',x1)
test_data.insert(0,'aindex',x2)
实验数据大小: (9398, 7)
预测数据大小: (4500, 7)
- 实验组train为9398组,每组7个属性。
- 测试预测组test为4500组,每组7个属性。
1.2.2分别查看实验数据集和预测数据集数据
# train的数据备份
data_Age = train_data.copy()
# 设置格式与重组
aindex = test_data['aindex']
test_data['brand'] = -1
train_data['Set'] = "Train"
test_data['Set'] = "Test"
DATA = train_data.append(test_data)
DATA.reset_index(inplace=True)
1.2.3输出两组数据集的前五行,查看数据的构成
display(train_data.head())
test_data.head()
aindex | salary | age | elevel | car | zipcode | credit | brand | Set | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 119806.54480 | 45 | 0 | 14 | 4 | 442037.71130 | 0 | Train |
1 | 2 | 78020.75094 | 23 | 0 | 15 | 2 | 48795.32279 | 0 | Train |
2 | 3 | 50873.61880 | 20 | 3 | 14 | 4 | 352951.49770 | 0 | Train |
3 | 4 | 72298.80402 | 29 | 4 | 17 | 0 | 276298.69520 | 0 | Train |
4 | 5 | 128999.93560 | 52 | 1 | 6 | 0 | 152232.50980 | 0 | Train |
aindex | salary | age | elevel | car | zipcode | credit | brand | Set | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 28662.39571 | 64 | 4 | 13 | 8 | 118241.0303 | -1 | Test |
1 | 2 | 68256.01678 | 51 | 3 | 11 | 7 | 307741.8081 | -1 | Test |
2 | 3 | 130235.44560 | 76 | 1 | 3 | 0 | 27372.1500 | -1 | Test |
3 | 4 | 88149.88200 | 66 | 4 | 11 | 4 | 440103.6174 | -1 | Test |
4 | 5 | 90778.18681 | 34 | 4 | 14 | 3 | 0.0000 | -1 | Test |
-
可以看到数据贴上了Train和Test的标签,且包含aindex下标。
-
有六个基本属性salary、age、elevel、car、zipcode、credit,最终影响品牌选择brand。
-
测试预测集Test中brand设置为-1,在测试模型后与备份的集合比对。
1.3数据查看
- 故接下来首先简单看一下上述六个不同属性的分布;
- 通过数据可视化的方式输出图表,方便进行观察思考。
1.3.1数据属性的可视化
import matplotlib.pyplot as plt
# 以下六段代码展现六个基本属性的具体信息
# ↓↓下列代码格式基本遵守以下内容↓↓
# plt.subplot2grid((11,5),(0,0),colspan=2,rowspan=2)
# 设计为十一行五列的整体框架,包含六个基本图像输出,各占四行三列的位置。
# sns.distplot(train_data['salary'])
# 显示“xxx”内数据的直方图和核密度
# train_data.elevel.value_counts().plot(kind='bar')
# 以条形图的形式,显示“xxx”内部数据的分类和频数
# plt.xlabel(u"car")
# 设计下标为“xxx”
plt.subplot2grid((11,5),(0,0),colspan=2,rowspan=2)
sns.distplot(train_data['salary'])
plt.xlabel(u"salary")
plt.subplot2grid((11,5),(0,3),colspan=2,rowspan=2)
sns.distplot(train_data['age'])
plt.xlabel(u"age")
plt.subplot2grid((11,5),(4,0),colspan=2,rowspan=2)
train_data.elevel.value_counts().plot(kind='bar')
plt.xlabel(u"elevel")
plt.subplot2grid((11,5),(4,3),colspan=2,rowspan=2)
sns.distplot(train_data['car'])
plt.xlabel(u"car")
plt.subplot2grid((11,5),(8,0),colspan=2,rowspan=2)
train_data.zipcode.value_counts().plot(kind='bar')
plt.xlabel(u"zipcode")
plt.subplot2grid((11,5),(8,3),colspan=2,rowspan=2)
sns.distplot(train_data['credit'])
plt.xlabel(u"credit")
plt.show()
- 绝大多数数据属性均无明显分布差异,各个变量属性基本分布均匀。
- salary:在20000-150000间核密度稳定在10*e^-6左右;
- age:在30、50、60、70核密度大致为0.01稍小,其余接近0.02整体差距不大;
- elevel:属性的每个值频数都接近2000,频数差距不大;
- car:属性每个值的核密度均接近0.05,整体差距不大;
- zipcode:属性的每个值频数都接近1000,频数差距不大;
- credit:在0-500000间核密度稳定在2*e^-6左右;
Excel进行数据分析时发现salary和credit在两端存在值密度极大的情况。
train_data.brand.value_counts().plot(kind='bar')
plt.xlabel(u"brand")
Text(0.5, 0, 'brand')
- 训练集Train关于brand的属性选择:
- 有近6000个对象选择1;
- 有近4000个对象选择0。
display(train_data.head())
test_data.head()
aindex | salary | age | elevel | car | zipcode | credit | brand | Set | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 119806.54480 | 45 | 0 | 14 | 4 | 442037.71130 | 0 | Train |
1 | 2 | 78020.75094 | 23 | 0 | 15 | 2 | 48795.32279 | 0 | Train |
2 | 3 | 50873.61880 | 20 | 3 | 14 | 4 | 352951.49770 | 0 | Train |
3 | 4 | 72298.80402 | 29 | 4 | 17 | 0 | 276298.69520 | 0 | Train |
4 | 5 | 128999.93560 | 52 | 1 | 6 | 0 | 152232.50980 | 0 | Train |
aindex | salary | age | elevel | car | zipcode | credit | brand | Set | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 28662.39571 | 64 | 4 | 13 | 8 | 118241.0303 | -1 | Test |
1 | 2 | 68256.01678 | 51 | 3 | 11 | 7 | 307741.8081 | -1 | Test |
2 | 3 | 130235.44560 | 76 | 1 | 3 | 0 | 27372.1500 | -1 | Test |
3 | 4 | 88149.88200 | 66 | 4 | 11 | 4 | 440103.6174 | -1 | Test |
4 | 5 | 90778.18681 | 34 | 4 | 14 | 3 | 0.0000 | -1 | Test |
图表中获得以下一些基础信息:
- 1.salary和credit就数值大小角度观察,不同对象的差距较大。
- 2.age、elevel、car、zipcode等根据个人的背景因素也会存在不同。
2.观察数据填充与特征情况
- 首先观察一下给定数组中是否有缺失值的存在,若有则进行填充以提高特征分析的准确率
- 分析数据的整体特征情况,针对性分析
2.1观察数据是否填充完整
print(DATA.isnull().sum())
print(final_data.isnull().sum())
index 0
aindex 0
salary 0
age 0
elevel 0
car 0
zipcode 0
credit 0
brand 0
Set 0
dtype: int64
salary 0
age 0
elevel 0
car 0
zipcode 0
credit 0
dtype: int64
- DATA为train和test的数据总集,显示无空缺值
- final_data为需要填写的数据,显示无空缺值
2.2其余数据特征
ax = sns.heatmap(DATA[DATA.Set == 'Train'][['salary','age','elevel','car','zipcode','credit','brand']].corr(),annot=True, fmt = '.3f', cmap = 'coolwarm');
ax.set_title("brand data features");
- 各项数据的生存相关性表中,不难看出,仅salary和brand具有较弱的正相关性,其余特征相关性不明显。
- 具体结果则由后续的全面特征分析决定。
3.删除和编码特征
3.1保存特征以供评审
# 训练集属性定义
_salary = DATA[DATA.Set == 'Train'].salary
_age = DATA[DATA.Set == 'Train'].age
_elevel = DATA[DATA.Set == 'Train'].elevel
_car = DATA[DATA.Set == 'Train'].car
_zipcode = DATA[DATA.Set == 'Train'].zipcode
_credit = DATA[DATA.Set == 'Train'].credit
# 测试集属性定义
T_salary = final_data.salary
T_age = final_data.age
T_elevel = final_data.elevel
T_car = final_data.car
T_zipcode = final_data.zipcode
T_credit = final_data.credit
3.2分割训练与测试集
TRAIN = DATA[DATA.Set == 'Train']# 划出训练集
TEST = DATA[DATA.Set == 'Test']# 划出测试预测集
aindex = TEST.aindex.to_list()# 转换为列表
# 仅保留六个特征属性的数据
TEST = TEST.drop(['aindex','Set','index','brand'], axis = 1)
X = TRAIN.drop(['aindex','brand','Set','index'], axis=1)
# 数据按照比例分割为测试集和训练集
y = TRAIN.brand
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1, stratify=y)
4.具体模型实现和重要特征
4.1 Random Forest随机森林
#建立一个随机森林模型rf_model
rf_model = RandomForestClassifier(max_depth=6, n_estimators= 50, max_features='auto')
#sklearn数据预处理、数据拟合
rf_model.fit(X_train, y_train)
rf_train_score = rf_model.score(X_train, y_train)#训练集
rf_accuracy = rf_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(rf_train_score * 100))
print("Test: {:.2f} %".format(rf_accuracy*100))
print('Overfit: {:.2f} %'.format((rf_train_score-rf_accuracy)*100))
Train: 88.63 %
Test: 85.43 %
Overfit: 3.20 %
#Random Forest模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"RF":features})#创建DataFrame
importances.sort_values("RF", ascending = False, inplace=True)#数据排序
RF_best_features = list(importances[importances.RF > 0.03].index)
importances.plot.bar()#作图
print("RF_best_features:",RF_best_features, len(RF_best_features))#对应下标题
plt.show()
RF_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.2 Support Vector Machines支持向量机算法
#建立一个Support Vector Machine模型SVM_model
SVM_model = SVC(C = 100, gamma= 0.001, kernel='rbf')
#sklearn数据预处理、数据拟合
SVM_model.fit(X_train, y_train)
svm_train_score = SVM_model.score(X_train, y_train)#训练集
SVM_accuracy = SVM_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(svm_train_score*100))
print("Test: {:.2f} %".format(SVM_accuracy*100))
print('Overfit: {:.2f} %'.format((svm_train_score-SVM_accuracy)*100))
Train: 100.00 %
Test: 60.28 %
Overfit: 39.72 %
#SVM模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"SVM":features})#创建DataFrame
importances.sort_values("SVM", ascending = False, inplace=True)#数据排序
importances
SVM_best_features = list(importances[importances.SVM > 0.03].index)
importances.plot.bar()#作图
print("SVM_best_features:",SVM_best_features, len(SVM_best_features))#对应下标题
SVM_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.3 Logistic回归
#建立一个Logistics回归模型LR_model
LR_model = LogisticRegression(solver='liblinear', C=2.78, penalty='l2')
#sklearn数据预处理、数据拟合
LR_model.fit(X_train, y_train)
LR_train_score = LR_model.score(X_train, y_train.astype('int'))#训练集
LR_accuracy = LR_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(LR_train_score*100))
print("Test: {:.2f} %".format(LR_accuracy*100))
print('Overfit: {:.2f} %'.format((LR_train_score-LR_accuracy)*100))
Train: 57.43 %
Test: 57.62 %
Overfit: -0.19 %
#Logistics回归模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"LR":features})#创建DataFrame
importances.sort_values("LR", ascending = False, inplace=True)#数据排序
importances
LR_best_features = list(importances[importances.LR > 0.03].index)
importances.plot.bar()#作图
print("LR_best_features:",LR_best_features, len(LR_best_features))#对应下标题
LR_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.4 K-NearestNeighbor邻近算法
#建立一个K-NearestNeighbor模型KNN_model
KNN_model = KNeighborsClassifier(n_neighbors=11,metric='euclidean',weights='uniform')
#sklearn数据预处理、数据拟合
KNN_model.fit(X_train, y_train)
KNN_train_score = KNN_model.score(X_train, y_train)#训练集
KNN_accuracy = KNN_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(KNN_train_score*100))
print("Test: {:.2f} %".format(KNN_accuracy*100))
print('Overfit: {:.2f} %'.format((KNN_train_score-KNN_accuracy)*100))
Train: 75.49 %
Test: 69.61 %
Overfit: 5.88 %
#KNN模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"KNN":features})#创建DataFrame
importances.sort_values("KNN", ascending = False, inplace=True)#数据排序
importances
KNN_best_features = list(importances[importances.KNN > 0.03].index)
importances.plot.bar()#作图
print("KNN_best_features:",KNN_best_features, len(KNN_best_features))#对应下标题
KNN_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.5 ADA Boost迭代最终分类器
#建立一个ADA Boost模型ADA_model
ADA_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5, min_samples_leaf=10), n_estimators=200, learning_rate = 0.001)
#sklearn数据预处理、数据拟合
ADA_model.fit(X_train,y_train)
ADA_train_score = ADA_model.score(X_train, y_train)#训练集
ADA_accuracy = ADA_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(ADA_train_score*100))
print("Test: {:.2f} %".format(ADA_accuracy*100))
print('Overfit: {:.2f} %'.format((ADA_train_score - ADA_accuracy)*100))
Train: 92.47 %
Test: 91.60 %
Overfit: 0.88 %
#ADA模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"ADA":features})#创建DataFrame
importances.sort_values("ADA", ascending = False, inplace=True)#数据排序
importances
ADA_best_features = list(importances[importances.ADA > 0.03].index)
importances.plot.bar()#作图
print("ADA_best_features:",ADA_best_features, len(ADA_best_features))#对应下标题
ADA_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.6 ExtraTrees极端随机树
#建立一个ExtraTrees模型ETC_model
ETC_model = ExtraTreesClassifier(max_features=4, min_samples_leaf=10, n_estimators=300, min_samples_split=3)
#sklearn数据预处理、数据拟合
ETC_model.fit(X_train, y_train)
ETC_train_score = ETC_model.score(X_train, y_train)#训练集
ETC_accuracy = ETC_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(ETC_train_score*100))
print("Test: {:.2f} %".format(ETC_accuracy*100))
print('Overfit: {:.2f} %'.format((ETC_train_score-ETC_accuracy)*100))
Train: 94.22 %
Test: 92.80 %
Overfit: 1.42 %
#ETC模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"ETC":features})#创建DataFrame
importances.sort_values("ETC", ascending = False, inplace=True)#数据排序
importances
ETC_best_features = list(importances[importances.ETC > 0.03].index)
importances.plot.bar()#作图
print("ETC_best_features:",ETC_best_features, len(ETC_best_features))#对应下标题
ETC_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.7 Gradient Boost 梯度迭代
#建立一个GradientBoost模型GBC_model
GBC_model = GradientBoostingClassifier(max_depth=4, max_features=0.3, min_samples_leaf=100, n_estimators=300)
#sklearn数据预处理、数据拟合
GBC_model.fit(X_train, y_train)
GBC_train_score = GBC_model.score(X_train, y_train)#训练集
GBC_accuracy = GBC_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(GBC_train_score*100))
print("Test: {:.2f} %".format(GBC_accuracy*100))
print('Overfit: {:.2f} %'.format((GBC_train_score-GBC_accuracy)*100))
Train: 94.47 %
Test: 91.10 %
Overfit: 3.37 %
#GBC模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"GBC":features})#创建DataFrame
importances.sort_values("GBC", ascending = False, inplace=True)#数据排序
GBC_best_features = list(importances[importances.GBC > 0.03].index)
importances.plot.bar()#作图
print("GBC_best_features:",GBC_best_features, len(GBC_best_features))#对应下标题
GBC_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.8 Stochastic Gradient Descent随机梯度下降
#建立一个Stochastic Gradient Descent模型SGD_model
SGD_model = SGDClassifier(alpha=0.01, penalty='elasticnet', loss='hinge')
#sklearn数据预处理、数据拟合
SGD_model.fit(X_train, y_train)
SGD_train_score = SGD_model.score(X_train, y_train)#训练集
SGD_accuracy = SGD_model.score(X_test, y_test)#准确性
#输出结果
print("Train: {:.2f} %".format(SGD_train_score*100))
print("Test: {:.2f} %".format(SGD_accuracy*100))
print('Overfit: {:.2f} %'.format((SGD_train_score-SGD_accuracy)*100))
Train: 39.84 %
Test: 39.82 %
Overfit: 0.02 %
#SGD模型特征重要性可视化
features = {}
for feature, importance in zip(X_train.columns, rf_model.feature_importances_):
features[feature] = importance
importances = pd.DataFrame({"SGD":features})#创建DataFrame
importances.sort_values("SGD", ascending = False, inplace=True)#数据排序
SGD_best_features = list(importances[importances.SGD > 0.03].index)
importances.plot.bar()#作图
print("SGD_best_features:",SGD_best_features, len(SGD_best_features))#对应下标题
SGD_best_features: ['salary', 'age', 'credit'] 3
- 由重要性分析和条形图得:
‘salary’ ,'age’具有较强重要性,其余特征均无明显重要性表现。
4.9 输出各个模型种,特征影响最大的类
L = min(len(RF_best_features), len(ADA_best_features), len(KNN_best_features), len(LR_best_features), len(SVM_best_features),
len(ETC_best_features), len(GBC_best_features), len(SGD_best_features))
TF = pd.DataFrame({"ADA":ADA_best_features[:L], "KNN": KNN_best_features[:L], "LR": LR_best_features[:L],
"SVM":SVM_best_features[:L], "RF":RF_best_features[:L],
"ETC":ETC_best_features[:L], "GBC":GBC_best_features[:L], "SGD":SGD_best_features[:L]} )
TF
ADA | KNN | LR | SVM | RF | ETC | GBC | SGD | |
---|---|---|---|---|---|---|---|---|
0 | salary | salary | salary | salary | salary | salary | salary | salary |
1 | age | age | age | age | age | age | age | age |
2 | credit | credit | credit | credit | credit | credit | credit | credit |
- 综上:‘salary’ ,'age’等二个特征具有较强重要性,会对结果有很大的影响。
5.汇总算法的准确性
print("Accuracy Scores:")
print("==========================================================")
print("RandomForest: {:.3f}".format(rf_accuracy))
print("SVM classifier: {:.3f}".format(SVM_accuracy))
print("LR classifier: {:.3f}".format(LR_accuracy))
print("KNN classifier: {:.3f}".format(KNN_accuracy))
print("ADA Boost classifier: {:.3f}".format(ADA_accuracy))
print("Extra Tree classifier: {:.3f}".format(ETC_accuracy))
print("Gradient Boosting classifier: {:.3f}".format(GBC_accuracy))
print("Stochastic Gradient descent: {:.3f}".format(SGD_accuracy))
print("==========================================================")
Accuracy Scores:
==========================================================
RandomForest: 0.854
SVM classifier: 0.603
LR classifier: 0.576
KNN classifier: 0.696
ADA Boost classifier: 0.916
Extra Tree classifier: 0.928
Gradient Boosting classifier: 0.911
Stochastic Gradient descent: 0.398
==========================================================
为保证最终准确性,挑选准确率在0.8以上的四组进行最终投票汇总。
6.测试最终“组号”数据
6.1运用模型预测“组号final_data”集合
rf_predictions = rf_model.predict(final_data)
ada_predictions = ADA_model.predict(final_data)
etc_predictions = ETC_model.predict(final_data)
gbc_predictions = GBC_model.predict(final_data)
6.2等比重多模型计数预测
def vote(votes):
weight_dict = {'RF':1,"ADA":1,"ETC":1,"GBC":1,}# 比重都是1
weights = np.array(list(weight_dict.values()))# 权重依据其值定义
sw = weights.sum()# 总权重
v = [v * weights[i] for i,v in enumerate(votes)]
return sum(v)/ sw
#创建DataFrame储存模型结果和分类标题
ALL_PREDICTIONS = pd.DataFrame({'RF':rf_predictions,"ADA":ada_predictions,"ETC":etc_predictions,
"GBC":gbc_predictions,
'salary':T_salary,'age':T_age,'elevel':T_elevel,
'car':T_car,'zipcode':T_zipcode,'credit':T_credit})
clfs = ['RF',"ADA","ETC","GBC"]
# 利用投票机制进行优化
ALL_PREDICTIONS['Vote'] = ALL_PREDICTIONS[clfs].apply(lambda row: vote(row), axis = 1)
ALL_PREDICTIONS['Predict'] = ALL_PREDICTIONS.Vote.apply(lambda row: int(np.rint(row)))
# 最终投票机制的结果定义为vc_predictions
vc_predictions = ALL_PREDICTIONS.Predict
6.3检查模型训练结果
#训练预测
rf_train = rf_model.predict(X)
ada_train = ADA_model.predict(X)
etc_train = ETC_model.predict(X)
gbc_train = GBC_model.predict(X)
#训练结果投票
TRAIN_PREDICTIONS = pd.DataFrame({'brand':train_data.brand,'salary':_salary,'age':_age,
'elevel': _elevel,'car':_car,'zipcode':_zipcode,'credit':_credit,
'RF':rf_train,"ADA":ada_train,"ETC":etc_train, "GBC":gbc_train})
TRAIN_PREDICTIONS['Vote'] = TRAIN_PREDICTIONS[clfs].apply(lambda row: vote(row), axis = 1)
TRAIN_PREDICTIONS['VC'] = TRAIN_PREDICTIONS.Vote.apply(lambda row: int(np.rint(row+0.01)))
clfs = ['RF',"ADA","ETC","GBC"]
#错误反馈
wrong = TRAIN_PREDICTIONS[TRAIN_PREDICTIONS.brand != TRAIN_PREDICTIONS.VC]
print(len(wrong))
593
#各个模型的错误累计计算
scores = {}
for c in clfs:
scores[c] = 0
for i in wrong.index:
s = TRAIN_PREDICTIONS.loc[TRAIN_PREDICTIONS.index[i],'brand']
#print(i, s)
for c in clfs:
if TRAIN_PREDICTIONS.loc[TRAIN_PREDICTIONS.index[i],c] == s:
scores[c] += 1
scores
{'RF': 107, 'ADA': 89, 'ETC': 52, 'GBC': 88}
6.4训练结果的得分,并进行作图
# 训练集进行优化的分数
train_scores = {}
for clf in [*clfs, 'VC']:
train_scores[clf] = [len(TRAIN_PREDICTIONS[TRAIN_PREDICTIONS.brand == TRAIN_PREDICTIONS[clf]]) / TRAIN_PREDICTIONS.shape[0]]
TRAIN_SCORES = pd.DataFrame(train_scores)
TRAIN_SCORES
RF | ADA | ETC | GBC | VC | |
---|---|---|---|---|---|
0 | 0.876676 | 0.922111 | 0.937966 | 0.934561 | 0.936901 |
TRAIN_SCORES.plot.bar() # 输出四个模型以及均值的投票结果条形图
plt.xlabel(u"Each model and its overall voting results")
Text(0.5, 0, 'Each model and its overall voting results')
投票后的效果好于平均值
7.输出结果
csv_data = pd.read_csv(r'35.csv', low_memory = False)# low_memory防止弹出警告
csv_df = pd.DataFrame(csv_data)# 建立新的dataframe
csv_df['brand'] = vc_predictions # 填充brand结果
csv_df.to_csv('my_submission.csv',index = None) # 输出生成后的文件my_submission.csv
- 成功输出预测35.csv文件的结果到submission文件中
8.参考文献
- https://www.kaggle.com/lovroselic/titanic-ls
- https://www.kaggle.com/madivens/code(我的titanic原文)
- https://www.kaggle.com/c/titanic/discussion/285454