题目:
1
Create a classification dataset (n samples 1000, n features 10)
2
Split the dataset using 10-fold cross validation
3
Train the algorithms
GaussianNB
SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
RandomForestClassifier (possible n estimators values [10, 100, 1000])
4
Evaluate the cross-validated performance
Accuracy
F1-score
AUC ROC
5
Write a short report summarizing the methodology and the results
from sklearn import datasets,cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
X, Y = datasets.make_classification(n_samples = 1000, n_features = 10)
kf = cross_validation.KFold(1000, n_folds = 10, shuffle = True)
acc_for_NB = [] #使用accuracy评估三个算法
acc_for_SVC = []
acc_for_RFC = []
f1_for_NB = [] # 使用F1-score评估三个算法
f1_for_SVC = []
f1_for_RFC = []
auc_for_NB = [] # 使用AUC ROC评估三个算法
auc_for_SVC = []
auc_for_RFC = []
for train_index, test_index in kf:
X_train, y_train = X[train_index], Y[train_index]
X_test, y_test = X[test_index], Y[test_index]
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_for_NB.append(metrics.accuracy_score(y_test, pred))
f1_for_NB.append(metrics.f1_score(y_test, pred))
auc_for_NB.append(metrics.roc_auc_score(y_test, pred))
clf = SVC(C=1e00, kernel='rbf', gamma=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_for_SVC.append(metrics.accuracy_score(y_test, pred))
f1_for_SVC.append(metrics.f1_score(y_test, pred))
auc_for_SVC.append(metrics.roc_auc_score(y_test, pred))
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_for_RFC.append(metrics.accuracy_score(y_test, pred))
f1_for_RFC.append(metrics.f1_score(y_test, pred))
auc_for_RFC.append(metrics.roc_auc_score(y_test, pred))
print("Naive Bayes:")
print("Evaluated by accuracy score:")
print(acc_for_NB)
print("Average:", sum(acc_for_NB) / len(acc_for_NB))
print()
print("Evaluated by f1 score:")
print(f1_for_NB)
print("Average:", sum(f1_for_NB) / len(f1_for_NB))
print()
print("Evaluated by roc auc score:")
print(auc_for_NB)
print("Average:", sum(auc_for_NB) / len(auc_for_NB))
print()
print("SVC:")
print("Evaluated by accuracy score:")
print(acc_for_SVC)
print("Average:", sum(acc_for_SVC) / len(acc_for_SVC))
print()
print("Evaluated by f1 score:")
print(f1_for_SVC)
print("Average:", sum(f1_for_SVC) / len(f1_for_SVC))
print()
print("Evaluated by roc auc score:")
print(auc_for_SVC)
print("Average:", sum(auc_for_SVC) / len(auc_for_SVC))
print()
print("Random Forest:")
print("Evaluated by accuracy score:")
print(acc_for_RFC)
print("Average:", sum(acc_for_RFC) / len(acc_for_RFC))
print()
print("Evaluated by f1 score:")
print(f1_for_RFC)
print("Average:", sum(f1_for_RFC) / len(f1_for_RFC))
print()
print("Evaluated by roc auc score:")
print(auc_for_RFC)
print("Average:", sum(auc_for_RFC) / len(auc_for_RFC))
print()
结果:
Naive Bayes:
Evaluated by accuracy score:
[0.94, 0.92, 0.88, 0.86, 0.91, 0.91, 0.89, 0.9, 0.83, 0.94]
Average: 0.8979999999999999
Evaluated by f1 score:
[0.9491525423728813, 0.9245283018867925, 0.8604651162790697, 0.8653846153846154, 0.8988764044943819, 0.9108910891089109, 0.8952380952380952, 0.8913043478260869, 0.8089887640449439, 0.9387755102040817]
Average: 0.8943604786839858
Evaluated by roc auc score:
[0.9461958806221101, 0.9190705128205129, 0.8747474747474747, 0.8606985146527499, 0.9099025974025975, 0.91, 0.8917069243156199, 0.8993558776167472, 0.8263749498193497, 0.9407051282051283]
Average: 0.897875786020229
SVC:
Evaluated by accuracy score:
[0.96, 0.9, 0.88, 0.88, 0.89, 0.9, 0.89, 0.9, 0.83, 0.93]
Average: 0.8959999999999999
Evaluated by f1 score:
[0.9661016949152543, 0.9038461538461539, 0.8604651162790697, 0.8823529411764707, 0.8735632183908046, 0.9, 0.8932038834951458, 0.8913043478260869, 0.8045977011494252, 0.9278350515463918]
Average: 0.8903270108624802
Evaluated by roc auc score:
[0.9672131147540983, 0.8998397435897435, 0.8747474747474747, 0.8819751103974307, 0.8871753246753247, 0.9, 0.8933172302737521, 0.8993558776167472, 0.8251706142111602, 0.9302884615384616]
Average: 0.8959082951804194
Random Forest:
Evaluated by accuracy score:
[0.96, 0.93, 0.93, 0.92, 0.94, 0.9, 0.92, 0.94, 0.94, 0.95]
Average: 0.9329999999999998
Evaluated by f1 score:
[0.9661016949152543, 0.9345794392523366, 0.9195402298850575, 0.923076923076923, 0.9333333333333332, 0.9, 0.9259259259259259, 0.9333333333333332, 0.9318181818181819, 0.9484536082474228]
Average: 0.9316162669787769
Evaluated by roc auc score:
[0.9672131147540983, 0.9286858974358975, 0.9262626262626262, 0.920915295062224, 0.9415584415584416, 0.9, 0.9194847020933978, 0.9380032206119163, 0.9361702127659575, 0.9503205128205129]
Average: 0.9328614023365072
显然可见RFC评估结果最好,NB和SVC差不多
而ACC评估方法分数最高,可见AUC和F1评估更严格