高级编程技术第十五周作业

最新推荐文章于 2018-06-30 14:53:46 发布

Dexan_N

最新推荐文章于 2018-06-30 14:53:46 发布

阅读量233

点赞数

本文链接：https://blog.csdn.net/Dexan_N/article/details/80740780

版权

本周简单介绍了以下sklearn这个库，简单来说sklearn是基于numpy、scipy等基础数学库的一个机器学习库，提供了几种机器学习算法。

Assignment

Create a classification dataset (n samples >= 1000, n features >= 10)
Split the dataset using 10-fold cross validation
Train the algorithms
- GaussianNB
- SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
- RandomForestClassifier (possible n estimators values [10, 100, 1000])
Evaluate the cross-validated performance
- Accuracy
- F1-score
- AUC ROC
Write a short report summarizing the methodology and the results

简要来说就是应用sklearn中的三种模型到一个分类模型数据集中，并进行数据预测。
使用sklearn的方式如同上述步骤一样。
1. 建立数据集
2. 分割数据集以进行交叉验证
3. 训练模型
4. 应用模型
5. 对模型进行评估

下面是实现代码

from sklearn import datasets
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
dataset = datasets.make_classification(
    n_samples=2000, n_features=15, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2
)
kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]
clf = []
pred = []
acc = []
f1 = []
auc = []
algorithm = [
    'GaussianNB',
    'SVC[C=1e-02]',
    'SVC[C=1e-01]',
    'SVC[C=1e00]',
    'SVC[C=1e01]',
    'SVC[C=1e02]',
    'RandomForestClassifier[n estimators=10]',
    'RandomForestClassifier[n estimators=100]',
    'RandomForestClassifier[n estimators=1000]'
]
clf.append(GaussianNB())
clf.append(SVC(C=1e-02, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e-01, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e00, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e01, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e02, kernel='rbf', gamma=0.1))
clf.append(RandomForestClassifier(n_estimators=10))
clf.append(RandomForestClassifier(n_estimators=100))
clf.append(RandomForestClassifier(n_estimators=1000))
for i in range(0, len(clf)):
    clf[i].fit(X_train, y_train)
    pred.append(clf[i].predict(X_test))
    acc.append(metrics.accuracy_score(y_test, pred[i]))
    f1.append(metrics.f1_score(y_test, pred[i]))
    auc.append(metrics.roc_auc_score(y_test, pred[i]))
    print("Evaluate of {}:".format(algorithm[i]))
    print("Accuracy:{}".format(acc[i]))
    print("F1-score:{}".format(f1[i]))
    print("AUC ROC:{}".format(auc[i]))

运行结果如下：

Evaluate of GaussianNB:
Accuracy:0.835
F1-score:0.8374384236453202
AUC ROC:0.8350000000000001
Evaluate of SVC[C=1e-02]:
Accuracy:0.825
F1-score:0.8372093023255814
AUC ROC:0.825
Evaluate of SVC[C=1e-01]:
Accuracy:0.875
F1-score:0.8756218905472637
AUC ROC:0.875
Evaluate of SVC[C=1e00]:
Accuracy:0.895
F1-score:0.8985507246376813
AUC ROC:0.8950000000000001
Evaluate of SVC[C=1e01]:
Accuracy:0.875
F1-score:0.8756218905472637
AUC ROC:0.875
Evaluate of SVC[C=1e02]:
Accuracy:0.86
F1-score:0.8599999999999999
AUC ROC:0.86
Evaluate of RandomForestClassifier[n estimators=10]:
Accuracy:0.925
F1-score:0.9238578680203046
AUC ROC:0.925
Evaluate of RandomForestClassifier[n estimators=100]:
Accuracy:0.92
F1-score:0.9215686274509804
AUC ROC:0.92
Evaluate of RandomForestClassifier[n estimators=1000]:
Accuracy:0.92
F1-score:0.9215686274509804
AUC ROC:0.92

可以看到样本数2000，特征数15的情况下
GaussianNB表现一般
SVC当C过小或过大的情况下甚至比GaussianNB精确度还差，在正确选择C值的时候能提供不错的精确度
RandomForestClassifier有着最高的精确度（以及最长的运行时间），然而n estimators并不会对精确度有太大的影响