sklearn练习

最新推荐文章于 2023-11-17 18:06:03 发布

mori644

最新推荐文章于 2023-11-17 18:06:03 发布

阅读量249

点赞数

本文链接：https://blog.csdn.net/mori644/article/details/80715591

版权

练习内容：

For this assignment you need to generate a random binary classification
problem, and then train and test (using 10-fold cross validation) the three
algorithms. For some algorithms inner cross validation (5-fold) for choosing
the parameters is needed. Then, show the classification performace
(per-fold and averaged) in the report, and briefly discussing the results.

Steps：
1 Create a classification dataset (n_samples >= 1000, n_features >= 10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
     GaussianNB
     SVC (possible C_values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
     RandomForestClassifier (possible n_estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
     Accuracy
     F1-score
     AUC ROC

5 Write a short report summarizing the methodology and the results

代码：

from sklearn import cross_validation
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

import numpy as np

performance = np.ndarray(shape=(10, 3, 3))

def Gaussian_naive_Bayes():
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def SVM():
    clf = SVC(C=1e-01, kernel='rbf', gamma=0.1)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def Random_Forest():
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def metric(y_test, pred):
    acc = metrics.accuracy_score(y_test, pred)
    f1 = metrics.f1_score(y_test, pred)
    auc = metrics.roc_auc_score(y_test, pred)

    return acc, f1, auc


dataset = datasets.make_classification(n_samples=1000, n_features=10,
                                       n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)

kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
i = 0
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

    performance[i, 0, :] = Gaussian_naive_Bayes()
    performance[i, 1, :] = SVM()
    performance[i, 2, :] = Random_Forest()

    i += 1

name = ['GaussianNB', 'SVC', 'RandomForestClassifier']
mean = np.mean(performance, axis=0)
for i in list(range(0, 3)):
    print(name[i])
    print('  Accuracy: ', performance[:, i, 0], ' Averaged: ', mean[i, 0])
    print('  F1-score: ', performance[:, i, 1], ' Averaged: ', mean[i, 1])
    print('  AUC ROC:  ', performance[:, i, 2], ' Averaged: ', mean[i, 2], '\n')

输出结果：

GaussianNB
  Accuracy:  [0.92 0.91 0.91 0.88 0.93 0.91 0.9  0.95 0.92 0.97]  Averaged:  0.9200000000000002
  F1-score:  [0.90909091 0.90322581 0.90909091 0.88235294 0.93457944 0.91428571
 0.89583333 0.94736842 0.92       0.96470588]  Averaged:  0.918053335608686
  AUC ROC:   [0.91883117 0.90865385 0.91185897 0.88405797 0.93232323 0.91414141
 0.90084303 0.94937975 0.92036815 0.96797226]  Averaged:  0.9208429797130474

SVC
  Accuracy:  [0.95 0.97 0.96 0.94 0.97 0.94 0.94 0.94 0.98 0.97]  Averaged:  0.9560000000000001
  F1-score:  [0.94382022 0.96907216 0.96078431 0.94117647 0.97247706 0.94339623
 0.9375     0.93617021 0.98039216 0.96470588]  Averaged:  0.9549494716598202
  AUC ROC:   [0.95048701 0.97035256 0.96073718 0.94444444 0.97070707 0.94343434
 0.94098756 0.93917567 0.979992   0.96797226]  Averaged:  0.9568290093650107

RandomForestClassifier
  Accuracy:  [0.98 0.96 0.97 0.99 0.99 0.96 0.97 0.96 0.99 0.99]  Averaged:  0.976
  F1-score:  [0.97777778 0.95918367 0.97087379 0.99065421 0.99099099 0.96296296
 0.96907216 0.95833333 0.99029126 0.98850575]  Averaged:  0.975864590476051
  AUC ROC:   [0.98214286 0.96073718 0.97035256 0.99074074 0.98888889 0.96161616
 0.97169811 0.95958383 0.98979592 0.99122807]  Averaged:  0.9766784327262139

分析：

在对样本量为1000的数据集的10-fold交叉验证中。

随机森林算法表现最好，SVC算法次之，GaussianNB算法最差。

所有三种算法的准确率基本能够保持在80%以上。

算法表现受数据本身集影响较大。

mori644

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
sklearn练习

练习内容：For this assignment you need to generate a random binary classificationproblem, and then train and test (using 10-fold cross validation) the threealgorithms. For some algorithms inner cross valid...
复制链接

扫一扫