练习内容:
For this assignment you need to generate a random binary classification
problem, and then train and test (using 10-fold cross validation) the three
algorithms. For some algorithms inner cross validation (5-fold) for choosing
the parameters is needed. Then, show the classification performace
(per-fold and averaged) in the report, and briefly discussing the results.
Steps:
1 Create a classification dataset (n_samples >= 1000, n_features >= 10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
GaussianNB
SVC (possible C_values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
RandomForestClassifier (possible n_estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
Accuracy
F1-score
AUC ROC
5 Write a short report summarizing the methodology and the results
代码:
from sklearn import cross_validation
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import numpy as np
performance = np.ndarray(shape=(10, 3, 3))
def Gaussian_naive_Bayes():
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
return metric(y_test, pred)
def SVM():
clf = SVC(C=1e-01, kernel='rbf', gamma=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
return metric(y_test, pred)
def Random_Forest():
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
return metric(y_test, pred)
def metric(y_test, pred):
acc = metrics.accuracy_score(y_test, pred)
f1 = metrics.f1_score(y_test, pred)
auc = metrics.roc_auc_score(y_test, pred)
return acc, f1, auc
dataset = datasets.make_classification(n_samples=1000, n_features=10,
n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)
kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
i = 0
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
performance[i, 0, :] = Gaussian_naive_Bayes()
performance[i, 1, :] = SVM()
performance[i, 2, :] = Random_Forest()
i += 1
name = ['GaussianNB', 'SVC', 'RandomForestClassifier']
mean = np.mean(performance, axis=0)
for i in list(range(0, 3)):
print(name[i])
print(' Accuracy: ', performance[:, i, 0], ' Averaged: ', mean[i, 0])
print(' F1-score: ', performance[:, i, 1], ' Averaged: ', mean[i, 1])
print(' AUC ROC: ', performance[:, i, 2], ' Averaged: ', mean[i, 2], '\n')
输出结果:
GaussianNB
Accuracy: [0.92 0.91 0.91 0.88 0.93 0.91 0.9 0.95 0.92 0.97] Averaged: 0.9200000000000002
F1-score: [0.90909091 0.90322581 0.90909091 0.88235294 0.93457944 0.91428571
0.89583333 0.94736842 0.92 0.96470588] Averaged: 0.918053335608686
AUC ROC: [0.91883117 0.90865385 0.91185897 0.88405797 0.93232323 0.91414141
0.90084303 0.94937975 0.92036815 0.96797226] Averaged: 0.9208429797130474
SVC
Accuracy: [0.95 0.97 0.96 0.94 0.97 0.94 0.94 0.94 0.98 0.97] Averaged: 0.9560000000000001
F1-score: [0.94382022 0.96907216 0.96078431 0.94117647 0.97247706 0.94339623
0.9375 0.93617021 0.98039216 0.96470588] Averaged: 0.9549494716598202
AUC ROC: [0.95048701 0.97035256 0.96073718 0.94444444 0.97070707 0.94343434
0.94098756 0.93917567 0.979992 0.96797226] Averaged: 0.9568290093650107
RandomForestClassifier
Accuracy: [0.98 0.96 0.97 0.99 0.99 0.96 0.97 0.96 0.99 0.99] Averaged: 0.976
F1-score: [0.97777778 0.95918367 0.97087379 0.99065421 0.99099099 0.96296296
0.96907216 0.95833333 0.99029126 0.98850575] Averaged: 0.975864590476051
AUC ROC: [0.98214286 0.96073718 0.97035256 0.99074074 0.98888889 0.96161616
0.97169811 0.95958383 0.98979592 0.99122807] Averaged: 0.9766784327262139
分析:
在对样本量为1000的数据集的10-fold交叉验证中。
随机森林算法表现最好,SVC算法次之,GaussianNB算法最差。
所有三种算法的准确率基本能够保持在80%以上。
算法表现受数据本身集影响较大。