作业题目:
In the second MLassignment you have to compare the performance of three di↵erent classification algorithms, namely Naive Bayes, SVM, and RandomForest.
For thisassignment you need to generate a random binary classification problem, andthen train and test (using 10-fold cross validation) the three algorithms. Forsome algorithms inner cross validation (5-fold) for choosing the parameters isneeded. Then, show the classification performace (per-fold and averaged) in thereport, and briefly discussing the results.
简单翻译:
在第二个ML赋值中,你必须比较三种不同的分类算法的性能,即朴素贝叶斯、SVM和随机森林。
对于这个任务,你需要生成一个随机二进制分类问题,然后训练和测试(使用10倍交叉验证)这三个算法。对于一些算法,需要内部交叉验证(5倍)来选择参数。然后,在报告中显示分类性能(每倍和平均),并简要讨论结果。
作业前简要分析:首先通过pip下载sklearn,课件里面的代码十分关键,有助于开始入门(相当于搭个框架),其他自己上网找相关的函数,教程应该也可以了。
代码如下:
from sklearn import datasets
from sklearn import cross_validation#这个cross_validation好像过时了?
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.svm import SVC
import numpy as np
from sklearn.ensemble import RandomForestClassifier
def NB(X_train,y_train,X_test):
clf = GaussianNB()
clf.fit(X_train,y_train)
return clf.predict(X_test)
#输入训练数据列表X,y,再给出一个测试的数据列表,以及C参数,返回相关的SVM预测
def rbf_svm(X_train,y_train,X_test,C):
clf = SVC(C = C,kernel = 'rbf')
clf.fit(X_train,y_train)
return clf.predict(X_test)
#输入训练数据列表X,y,再给出一个测试的数据列表,以及n_estimator参数,返回相关的RFC预测
def RFC(X_train,y_train,X_test,n_estimator):
clf = RandomForestClassifier(n_estimators=n_estimator)
clf.fit(X_train, y_train)
return clf.predict(X_test)
acc_for_NB = [] #使用accuracy评估三个算法
acc_for_SVC = []
acc_for_RFC = []
f1_for_NB = [] # 使用F1-score评估三个算法
f1_for_SVC = []
f1_for_RFC = []
auc_for_NB = [] # 使用AUC ROC评估三个算法
auc_for_SVC = []
auc_for_RFC = []
#创建虚拟数据X,y,后面指明了相关的参数,包括例子个数,属性个数,关联度,之后的就不知道了,
#直接照着老师的课件上面的写即可
X,y = datasets.make_classification(n_samples = 1000,n_features = 10
,n_informative = 2,n_redundant = 2
,n_repeated = 0,n_classes = 2)
#下面的大部分其实参照课本就可以得到很好的效果了
kf = cross_validation.KFold(len(X), n_folds=10, shuffle=True
,random_state = 1234)
for train_index, test_index in kf:
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
ipred = NB(X_train,y_train,X_test)
#分别用三种方法评估NB算法结果评估朴素贝叶斯算法
acc_for_NB.append(metrics.accuracy_score(y_test, ipred))
f1_for_NB.append(metrics.f1_score(y_test, ipred))
auc_for_NB.append(metrics.roc_auc_score(y_test, ipred))
#然后是SVM算法,先找出最好的C值
nn = len(X_train)
bestC = None
Cvalues = [1e-2,1e-1,1e0,1e1,1e2]
innerscore = []
for C in Cvalues:
ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True
,random_state = 5678)
innerf1 = []
for t_index,v_index in ikf:
X_t,X_v = X_train[t_index],X_train[v_index]
y_t,y_v = y_train[t_index],y_train[v_index]
ipred = rbf_svm(X_t,y_t,X_v,C)
innerf1.append(metrics.f1_score(y_v,ipred))
innerscore.append(sum(innerf1)/len(innerf1))
bestC = Cvalues[np.argmax(innerscore)]
SVCpred = rbf_svm(X_train,y_train,X_test,bestC)
print("The bestC is:", bestC)
acc_for_SVC.append(metrics.accuracy_score(y_test,SVCpred))
f1_for_SVC.append(metrics.f1_score(y_test,SVCpred))
auc_for_SVC.append(metrics.roc_auc_score(y_test,SVCpred))
#下面是随机森林random forest了,主要是对n_estimator做出评估
nn = len(X_train)
best_n_estimators_values = None
n_estimators_values=[1,10,100,1000]
innerscore = []
for estimators_value in n_estimators_values:
ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True
,random_state = 5678)
innerf1 = []
for t_index,v_index in ikf:
X_t,X_v = X_train[t_index],X_train[v_index]
y_t,y_v = y_train[t_index],y_train[v_index]
ipred = RFC(X_t,y_t,X_v,estimators_value)
innerf1.append(metrics.f1_score(y_v,ipred))
innerscore.append(sum(innerf1)/len(innerf1))
best_n_estimators_values = n_estimators_values[np.argmax(innerscore)]
print("The best_n_estimators_values is:", best_n_estimators_values)
RFCpred = RFC(X_train,y_train,X_test,best_n_estimators_values)
acc_for_RFC.append(metrics.accuracy_score(y_test,RFCpred))
f1_for_RFC.append(metrics.f1_score(y_test,RFCpred))
auc_for_RFC.append(metrics.roc_auc_score(y_test,RFCpred))
print("Naive Bayes:")
print("Evaluated by accuracy score:")
print(acc_for_NB)
print("Average:", sum(acc_for_NB) / len(acc_for_NB))
print()
print("Evaluated by f1 score:")
print(f1_for_NB)
print("Average:", sum(f1_for_NB) / len(f1_for_NB))
print()
print("Evaluated by roc auc score:")
print(auc_for_NB)
print("Average:", sum(auc_for_NB) / len(auc_for_NB))
print()
print("SVC:")
print("Evaluated by accuracy score:")
print(acc_for_SVC)
print("Average:", sum(acc_for_SVC) / len(acc_for_SVC))
print()
print("Evaluated by f1 score:")
print(f1_for_SVC)
print("Average:", sum(f1_for_SVC) / len(f1_for_SVC))
print()
print("Evaluated by roc auc score:")
print(auc_for_SVC)
print("Average:", sum(auc_for_SVC) / len(auc_for_SVC))
print()
print("Random Forest:")
print("Evaluated by accuracy score:")
print(acc_for_RFC)
print("Average:", sum(acc_for_RFC) / len(acc_for_RFC))
print()
print("Evaluated by f1 score:")
print(f1_for_RFC)
print("Average:", sum(f1_for_RFC) / len(f1_for_RFC))
print()
print("Evaluated by roc auc score:")
print(auc_for_RFC)
print("Average:", sum(auc_for_RFC) / len(auc_for_RFC))
print()
简要分析:
从kf = cross_validation.KFold(len(X), n_folds=10, shuffle=True,random_state= 1234)谈起:
我们将数据分成10组,这个可以从KFold =10看出来,从
官方文档中看出来,将10-1=9份作为训练用的部分,剩下一部分作为检测用的(validation set)至于后面的ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True,random_state= 5678)同上面,就不详细谈了。
实验结果:
The bestC is: 1.0
The best_n_estimators_values is: 1000
The bestC is: 1.0
The best_n_estimators_values is: 100
The bestC is: 1.0
The best_n_estimators_values is: 100
The bestC is: 1.0
The best_n_estimators_values is: 1000
The bestC is: 1.0
The best_n_estimators_values is: 100
The bestC is: 1.0
The best_n_estimators_values is: 100
The bestC is: 1.0
The best_n_estimators_values is: 1000
The bestC is: 1.0
The best_n_estimators_values is: 1000
The bestC is: 1.0
The best_n_estimators_values is: 1000
The bestC is: 1.0
The best_n_estimators_values is: 1000
简单分析:
SVC算法待确定参数为C,RFC算法待确定参数为n_estimator,在10次迭代中,C的最佳取值总是1.0,而n_estimator的有两种取值都可能成为最佳取值(1000成为最佳取值6次,而100有4次,可能1000稍微更好一些)
下面是对朴素贝叶斯公式的检验
Naive Bayes:
Evaluated by accuracy score:
[0.83, 0.84, 0.95, 0.91, 0.91, 0.93, 0.92, 0.93, 0.87, 0.85]
Average: 0.8939999999999999
Evaluated by f1 score:
[0.8282828282828282, 0.8, 0.9411764705882353, 0.9010989010989011,0.896551724137931, 0.9213483146067417, 0.923076923076923, 0.9292929292929293,0.8737864077669903, 0.8543689320388349]
Average: 0.8868983430890314
Evaluated by roc auc score:
[0.8433441558441559, 0.8282828282828284, 0.9444444444444444,0.9078525641025641, 0.9060606060606061, 0.927133655394525, 0.9252525252525251,0.9305722288915566, 0.8697478991596639, 0.8627090983272134]
Average: 0.8945400005760084
可以看出朴素贝叶斯公式在accuracy和AUC ROC中得到比f1更好的检测分数
然后是SVC的检测
SVC:
Evaluated by accuracy score:
[0.94, 0.91, 0.98, 0.96, 0.95, 0.93, 0.96, 0.96, 0.92, 0.93]
Average: 0.944
Evaluated by f1 score:
[0.9464285714285714, 0.8988764044943819, 0.9772727272727273,0.9583333333333334, 0.945054945054945, 0.9230769230769231, 0.9629629629629629,0.9615384615384616, 0.9245283018867925, 0.9369369369369369]
Average: 0.9435009567986035
Evaluated by roc auc score:
[0.9391233766233766, 0.908080808080808, 0.9777777777777779,0.9599358974358976, 0.9505050505050505, 0.928743961352657, 0.9616161616161615,0.9595838335334135, 0.9191676670668267, 0.9328845369237047]
Average: 0.9437419070915676
可以看出SVC公式在accuracy和AUCROC中得到比f1更好的检测分数,同朴素贝叶斯
最后是随机森林:
Random Forest:
Evaluated by accuracy score:
[0.98, 0.96, 0.99, 0.97, 0.98, 0.98, 0.95, 0.99, 0.94, 0.95]
Average: 0.969
Evaluated by f1 score:
[0.9818181818181818, 0.9555555555555556, 0.9887640449438202,0.9690721649484536, 0.9777777777777777, 0.9777777777777777, 0.9532710280373831,0.99009900990099, 0.9444444444444444, 0.9557522123893805]
Average: 0.9694332197593765
Evaluated by roc auc score:
[0.9821428571428572, 0.9595959595959597, 0.9888888888888889,0.9703525641025641, 0.9797979797979798, 0.9782608695652174, 0.9525252525252524,0.9901960784313726, 0.9387755102040816, 0.9504283965728273]
Average: 0.9690964356827001
可以看出随机森林中三种检验都在0.969上面,与前面不同的是,这次的随机森林受到了f1的青睐。
总的来说,在本次作业中,平均得分从低到高是朴素贝叶斯,SVC,随机森林,这可能说明随机森林对本次的数据拟合得比较好;而且前两者而言,F1评价都偏低,可能显示出其在某些方面严格的拟合评价。