本周简单介绍了以下sklearn
这个库,简单来说sklearn
是基于numpy
、scipy
等基础数学库的一个机器学习库,提供了几种机器学习算法。
Assignment
- Create a classification dataset (n samples >= 1000, n features >= 10)
- Split the dataset using 10-fold cross validation
- Train the algorithms
- GaussianNB
- SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
- RandomForestClassifier (possible n estimators values [10, 100, 1000])
- Evaluate the cross-validated performance
- Accuracy
- F1-score
- AUC ROC
- Write a short report summarizing the methodology and the results
简要来说就是应用sklearn
中的三种模型到一个分类模型数据集中,并进行数据预测。
使用sklearn
的方式如同上述步骤一样。
1. 建立数据集
2. 分割数据集以进行交叉验证
3. 训练模型
4. 应用模型
5. 对模型进行评估
下面是实现代码
from sklearn import datasets
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
dataset = datasets.make_classification(
n_samples=2000, n_features=15, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2
)
kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
clf = []
pred = []
acc = []
f1 = []
auc = []
algorithm = [
'GaussianNB',
'SVC[C=1e-02]',
'SVC[C=1e-01]',
'SVC[C=1e00]',
'SVC[C=1e01]',
'SVC[C=1e02]',
'RandomForestClassifier[n estimators=10]',
'RandomForestClassifier[n estimators=100]',
'RandomForestClassifier[n estimators=1000]'
]
clf.append(GaussianNB())
clf.append(SVC(C=1e-02, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e-01, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e00, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e01, kernel='rbf', gamma=0.1))
clf.append(SVC(C=1e02, kernel='rbf', gamma=0.1))
clf.append(RandomForestClassifier(n_estimators=10))
clf.append(RandomForestClassifier(n_estimators=100))
clf.append(RandomForestClassifier(n_estimators=1000))
for i in range(0, len(clf)):
clf[i].fit(X_train, y_train)
pred.append(clf[i].predict(X_test))
acc.append(metrics.accuracy_score(y_test, pred[i]))
f1.append(metrics.f1_score(y_test, pred[i]))
auc.append(metrics.roc_auc_score(y_test, pred[i]))
print("Evaluate of {}:".format(algorithm[i]))
print("Accuracy:{}".format(acc[i]))
print("F1-score:{}".format(f1[i]))
print("AUC ROC:{}".format(auc[i]))
运行结果如下:
Evaluate of GaussianNB:
Accuracy:0.835
F1-score:0.8374384236453202
AUC ROC:0.8350000000000001
Evaluate of SVC[C=1e-02]:
Accuracy:0.825
F1-score:0.8372093023255814
AUC ROC:0.825
Evaluate of SVC[C=1e-01]:
Accuracy:0.875
F1-score:0.8756218905472637
AUC ROC:0.875
Evaluate of SVC[C=1e00]:
Accuracy:0.895
F1-score:0.8985507246376813
AUC ROC:0.8950000000000001
Evaluate of SVC[C=1e01]:
Accuracy:0.875
F1-score:0.8756218905472637
AUC ROC:0.875
Evaluate of SVC[C=1e02]:
Accuracy:0.86
F1-score:0.8599999999999999
AUC ROC:0.86
Evaluate of RandomForestClassifier[n estimators=10]:
Accuracy:0.925
F1-score:0.9238578680203046
AUC ROC:0.925
Evaluate of RandomForestClassifier[n estimators=100]:
Accuracy:0.92
F1-score:0.9215686274509804
AUC ROC:0.92
Evaluate of RandomForestClassifier[n estimators=1000]:
Accuracy:0.92
F1-score:0.9215686274509804
AUC ROC:0.92
可以看到样本数2000,特征数15的情况下
GaussianNB
表现一般
SVC
当C过小或过大的情况下甚至比GaussianNB
精确度还差,在正确选择C值的时候能提供不错的精确度
RandomForestClassifier
有着最高的精确度(以及最长的运行时间),然而n estimators
并不会对精确度有太大的影响