Scikit-Learn Exercise

Scikit-Learn Exercise

Assignment

In the second ML assignment you have to compare the performance of three different classification algorithms, namely Naive Bayes, SVM, and Random Forest. For this assignment you need to generate a random binary classification problem, and then train and test (using 10-fold cross validation) the three algorithms.
For some algorithms inner cross validation (5-fold) for choosing the parameters is needed. Then, show the classification performace (per-fold and averaged) in the report, and briefly discussing the results.

Code

from sklearn import datasets
from sklearn import cross_validation 
from sklearn import metrics 
from sklearn.naive_bayes import GaussianNB 
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier 

Create a Classification Dataset

# Create a Classification Dataset 
dataset = datasets.make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)

Split the Dataset Using 10-fold Cross Validation

# Split the Dataset Using 10-fold Cross Validation
kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True) 
for train_index, test_index in kf: 
    X_train, y_train = dataset[0][train_index], dataset[1][train_index] 
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

print("X_train:\n", X_train) 
print("y_train:\n", y_train) 
print("X_test:\n", X_test) 
print("y_test:\n", y_test)

Train the Algorithms

Gaussian NB
# Predict using Naive Bayes 
NB_clf = GaussianNB() 
NB_clf.fit(X_train, y_train) 
NB_pred = NB_clf.predict(X_test)

print("Algorithm:\tGaussianNB")
print("Predict:\n", NB_pred)
print("y_test:\n", y_test)

NB_acc = metrics.accuracy_score(y_test, NB_pred) 
print("Accuracy:\t", NB_acc)
NB_f1 = metrics.f1_score(y_test, NB_pred) 
print("F1 Score:\t", NB_f1) 
NB_auc = metrics.roc_auc_score(y_test, NB_pred) 
print("AUC ROC:\t", NB_auc)
SVC
# Calculate Best C
c_args=[1e-2, 1e-1, 1e0, 1e1, 1e2]
c_best=0
c_eva=0
inn_kf = cross_validation.KFold(len(X_train), n_folds=5, shuffle=True)
for inn_train_index, inn_test_index in inn_kf:
    inn_X_train, inn_X_test = X_train[inn_train_index], X_train[inn_test_index]
    inn_y_train, inn_y_test = y_train[inn_train_index], y_train[inn_test_index]
for c_arg in c_args:
    inn_SVC_clf = SVC(C=c_arg, kernel ="rbf", gamma=0.1)
    inn_SVC_clf.fit(inn_X_train, inn_y_train)
    inn_SVC_pred = inn_SVC_clf.predict(inn_X_test)
    inn_SVC_acc = metrics.accuracy_score(inn_y_test, inn_SVC_pred)
    if inn_SVC_acc > c_eva:
        c_eva = inn_SVC_acc
        c_best = c_arg

# Predict using SVC
SVC_clf = SVC(C=c_best, kernel='rbf', gamma=0.1) 
SVC_clf.fit(X_train, y_train) 
SVC_pred = SVC_clf.predict(X_test)

print("Algorithm:\tSVC")
print("Best C:\t", c_best)
print("Predict:\n", SVC_pred)
print("y_test:\n", y_test)

SVC_acc = metrics.accuracy_score(y_test, SVC_pred) 
print("Accuracy:\t", SVC_acc)
SVC_f1 = metrics.f1_score(y_test, SVC_pred) 
print("F1 Score:\t", SVC_f1) 
SVC_auc = metrics.roc_auc_score(y_test, SVC_pred) 
print("AUC ROC:\t", SVC_auc)
Random Forest Classifier
# Calculate Best n_estimator
n_args=[10, 100, 1000]
n_best=0
n_eva=0
inn_kf = cross_validation.KFold(len(X_train), n_folds=5, shuffle=True)
for inn_train_index, inn_test_index in inn_kf:
    inn_X_train, inn_X_test = X_train[inn_train_index], X_train[inn_test_index]
    inn_y_train, inn_y_test = y_train[inn_train_index], y_train[inn_test_index]
for n_arg in n_args:
    inn_RFC_clf = RandomForestClassifier(n_estimators=n_arg) 
    inn_RFC_clf.fit(inn_X_train, inn_y_train)
    inn_RFC_pred = inn_RFC_clf.predict(inn_X_test)
    inn_RFC_acc = metrics.accuracy_score(inn_y_test, inn_RFC_pred)
    if inn_RFC_acc > n_eva:
        n_eva = inn_RFC_acc
        n_best = n_arg

# Predict using RFC
RFC_clf = RandomForestClassifier(n_estimators=n_best) 
RFC_clf.fit(X_train, y_train) 
RFC_pred = RFC_clf.predict(X_test)

print("Algorithm:\tRFC")
print("Best n_estimator:\t", n_best)
print("Predict:\n", RFC_pred)
print("y_test:\n", y_test)

RFC_acc = metrics.accuracy_score(y_test, RFC_pred) 
print("Accuracy:\t", RFC_acc)
RFC_f1 = metrics.f1_score(y_test, RFC_pred) 
print("F1 Score:\t", RFC_f1) 
RFC_auc = metrics.roc_auc_score(y_test, RFC_pred) 
print("AUC ROC:\t", RFC_auc)

Result

单次运行结果

使用朴素贝叶斯算法:
这里写图片描述
使用SVC算法,最佳C值为0.1:
这里写图片描述
使用RFC算法,最佳n_estimastors值为10:
这里写图片描述

3次运行结果综合
Naive Bayes
IndexAccuracyF1 ScoreAUC ROC
1st 1 s t 0.930.92130.9364
2nd 2 n d 0.950.93510.9417
3rd 3 r d 0.860.860.8614
Average0.91330.90550.9132
SVC
IndexAccuracyF1 ScoreAUC ROC
1st 1 s t 0.990.98800.9881
2nd 2 n d 0.950.93830.95
3rd 3 r d 0.870.86600.8702
Average0.93670.93080.9361
Random Forest Classifier
IndexAccuracyF1 ScoreAUC ROC
1st 1 s t 0.990.98800.9881
2nd 2 n d 0.950.93830.95
3rd 3 r d 0.920.91670.9199
Average0.95330.94770.9527

Analysis

从总的训练和预测结果来看,三种算法的准确程度按照RFC > SVC > NB的顺序递减,因此可以判断使用RFC算法进行的机器学习效果最好,SVC次之,NB最差。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值