Scikit-Learn Exercise
Assignment
In the second ML assignment you have to compare the performance of three different classification algorithms, namely Naive Bayes, SVM, and Random Forest. For this assignment you need to generate a random binary classification problem, and then train and test (using 10-fold cross validation) the three algorithms.
For some algorithms inner cross validation (5-fold) for choosing the parameters is needed. Then, show the classification performace (per-fold and averaged) in the report, and briefly discussing the results.
Code
from sklearn import datasets
from sklearn import cross_validation
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
Create a Classification Dataset
# Create a Classification Dataset
dataset = datasets.make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)
Split the Dataset Using 10-fold Cross Validation
# Split the Dataset Using 10-fold Cross Validation
kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
print("X_train:\n", X_train)
print("y_train:\n", y_train)
print("X_test:\n", X_test)
print("y_test:\n", y_test)
Train the Algorithms
Gaussian NB
# Predict using Naive Bayes
NB_clf = GaussianNB()
NB_clf.fit(X_train, y_train)
NB_pred = NB_clf.predict(X_test)
print("Algorithm:\tGaussianNB")
print("Predict:\n", NB_pred)
print("y_test:\n", y_test)
NB_acc = metrics.accuracy_score(y_test, NB_pred)
print("Accuracy:\t", NB_acc)
NB_f1 = metrics.f1_score(y_test, NB_pred)
print("F1 Score:\t", NB_f1)
NB_auc = metrics.roc_auc_score(y_test, NB_pred)
print("AUC ROC:\t", NB_auc)
SVC
# Calculate Best C
c_args=[1e-2, 1e-1, 1e0, 1e1, 1e2]
c_best=0
c_eva=0
inn_kf = cross_validation.KFold(len(X_train), n_folds=5, shuffle=True)
for inn_train_index, inn_test_index in inn_kf:
inn_X_train, inn_X_test = X_train[inn_train_index], X_train[inn_test_index]
inn_y_train, inn_y_test = y_train[inn_train_index], y_train[inn_test_index]
for c_arg in c_args:
inn_SVC_clf = SVC(C=c_arg, kernel ="rbf", gamma=0.1)
inn_SVC_clf.fit(inn_X_train, inn_y_train)
inn_SVC_pred = inn_SVC_clf.predict(inn_X_test)
inn_SVC_acc = metrics.accuracy_score(inn_y_test, inn_SVC_pred)
if inn_SVC_acc > c_eva:
c_eva = inn_SVC_acc
c_best = c_arg
# Predict using SVC
SVC_clf = SVC(C=c_best, kernel='rbf', gamma=0.1)
SVC_clf.fit(X_train, y_train)
SVC_pred = SVC_clf.predict(X_test)
print("Algorithm:\tSVC")
print("Best C:\t", c_best)
print("Predict:\n", SVC_pred)
print("y_test:\n", y_test)
SVC_acc = metrics.accuracy_score(y_test, SVC_pred)
print("Accuracy:\t", SVC_acc)
SVC_f1 = metrics.f1_score(y_test, SVC_pred)
print("F1 Score:\t", SVC_f1)
SVC_auc = metrics.roc_auc_score(y_test, SVC_pred)
print("AUC ROC:\t", SVC_auc)
Random Forest Classifier
# Calculate Best n_estimator
n_args=[10, 100, 1000]
n_best=0
n_eva=0
inn_kf = cross_validation.KFold(len(X_train), n_folds=5, shuffle=True)
for inn_train_index, inn_test_index in inn_kf:
inn_X_train, inn_X_test = X_train[inn_train_index], X_train[inn_test_index]
inn_y_train, inn_y_test = y_train[inn_train_index], y_train[inn_test_index]
for n_arg in n_args:
inn_RFC_clf = RandomForestClassifier(n_estimators=n_arg)
inn_RFC_clf.fit(inn_X_train, inn_y_train)
inn_RFC_pred = inn_RFC_clf.predict(inn_X_test)
inn_RFC_acc = metrics.accuracy_score(inn_y_test, inn_RFC_pred)
if inn_RFC_acc > n_eva:
n_eva = inn_RFC_acc
n_best = n_arg
# Predict using RFC
RFC_clf = RandomForestClassifier(n_estimators=n_best)
RFC_clf.fit(X_train, y_train)
RFC_pred = RFC_clf.predict(X_test)
print("Algorithm:\tRFC")
print("Best n_estimator:\t", n_best)
print("Predict:\n", RFC_pred)
print("y_test:\n", y_test)
RFC_acc = metrics.accuracy_score(y_test, RFC_pred)
print("Accuracy:\t", RFC_acc)
RFC_f1 = metrics.f1_score(y_test, RFC_pred)
print("F1 Score:\t", RFC_f1)
RFC_auc = metrics.roc_auc_score(y_test, RFC_pred)
print("AUC ROC:\t", RFC_auc)
Result
单次运行结果
使用朴素贝叶斯算法:
使用SVC算法,最佳C值为0.1:
使用RFC算法,最佳n_estimastors值为10:
3次运行结果综合
Naive Bayes
Index | Accuracy | F1 Score | AUC ROC |
---|---|---|---|
1st 1 s t | 0.93 | 0.9213 | 0.9364 |
2nd 2 n d | 0.95 | 0.9351 | 0.9417 |
3rd 3 r d | 0.86 | 0.86 | 0.8614 |
Average | 0.9133 | 0.9055 | 0.9132 |
SVC
Index | Accuracy | F1 Score | AUC ROC |
---|---|---|---|
1st 1 s t | 0.99 | 0.9880 | 0.9881 |
2nd 2 n d | 0.95 | 0.9383 | 0.95 |
3rd 3 r d | 0.87 | 0.8660 | 0.8702 |
Average | 0.9367 | 0.9308 | 0.9361 |
Random Forest Classifier
Index | Accuracy | F1 Score | AUC ROC |
---|---|---|---|
1st 1 s t | 0.99 | 0.9880 | 0.9881 |
2nd 2 n d | 0.95 | 0.9383 | 0.95 |
3rd 3 r d | 0.92 | 0.9167 | 0.9199 |
Average | 0.9533 | 0.9477 | 0.9527 |
Analysis
从总的训练和预测结果来看,三种算法的准确程度按照RFC > SVC > NB的顺序递减,因此可以判断使用RFC算法进行的机器学习效果最好,SVC次之,NB最差。