题目
先导入各种库
from sklearn import datasets,cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
c:\users\sunyy\appdata\local\programs\python\python36-32\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
有个警告,不理他
生成数据集,要求 样本数大于等于1000(取2000), 样本特征大于等于10(取15),分成10折
dataset = datasets.make_classification(n_samples=2000, n_features=15) #二分类问题,n_classes的默认值为2所以不用给
data, target = dataset[0], dataset[1] # dataset 是一个包含2个array的list ,第一个array是样本输入,第二个array是样本输出
#将训练集分成10折 , n_folds = 10
kf = cross_validation.KFold(len(data), n_folds=10, shuffle=True) #kf包含各种分好的下标
然后用不同分类方法分别进行交叉检验,代码照抄pdf上面的内容.
为了评估每种方法的结果,对三个评价参数取10次验证的平均数(虽然我也不知道这么做对不对)
首先是朴素贝叶斯
avr_acc = 0
avr_f1 = 0
avr_auc = 0
for train_index, test_index in kf:
X_train, y_train = data[train_index], target[train_index]
X_test, y_test = data[test_index], target[test_index]
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
avr_acc += acc
f1 = metrics.f1_score(y_test, pred)
avr_f1 += f1
auc = metrics.roc_auc_score(y_test, pred)
avr_auc += auc
avr_acc /= 10
avr_f1 /= 10
avr_auc /= 10
print("朴素贝叶斯:")
print("Accuracy: %f" % (avr_acc))
print("F1-score: %f" % (avr_f1))
print("AUC ROC : %f" % (avr_auc))
朴素贝叶斯:
Accuracy: 0.885000
F1-score: 0.885972
AUC ROC : 0.884704
SVM, 要求用rbf核,参数C取[1e-02, 1e-01, 1e00, 1e01, 1e02]
for cc in [1e-02, 1e-01, 1e00, 1e01, 1e02]:
avr_acc = 0
avr_f1 = 0
avr_auc = 0
for train_index, test_index in kf:
X_train, y_train = data[train_index], target[train_index]
X_test, y_test = data[test_index], target[test_index]
clf = SVC(C=cc, kernel='rbf', gamma=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
avr_acc += acc
f1 = metrics.f1_score(y_test, pred)
avr_f1 += f1
auc = metrics.roc_auc_score(y_test, pred)
avr_auc += auc
avr_acc /= 10
avr_f1 /= 10
avr_auc /= 10
print("SVM:参数C = %f" % (cc))
print("Accuracy: %f" % (avr_acc))
print("F1-score: %f" % (avr_f1))
print("AUC ROC : %f" % (avr_auc))
SVM:参数C = 0.010000
Accuracy: 0.791500
F1-score: 0.796085
AUC ROC : 0.804599
SVM:参数C = 0.100000
Accuracy: 0.891000
F1-score: 0.894124
AUC ROC : 0.891249
SVM:参数C = 1.000000
Accuracy: 0.889500
F1-score: 0.889850
AUC ROC : 0.889226
SVM:参数C = 10.000000
Accuracy: 0.864500
F1-score: 0.862406
AUC ROC : 0.864076
SVM:参数C = 100.000000
Accuracy: 0.848000
F1-score: 0.846445
AUC ROC : 0.847830
随机森林:要求n_estimators 为 [10, 100, 1000] ,1000的要算好久
for nn in [10, 100, 1000]:
avr_acc = 0
avr_f1 = 0
avr_auc = 0
for train_index, test_index in kf:
X_train, y_train = data[train_index], target[train_index]
X_test, y_test = data[test_index], target[test_index]
clf = RandomForestClassifier(n_estimators=nn)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
avr_acc += acc
f1 = metrics.f1_score(y_test, pred)
avr_f1 += f1
auc = metrics.roc_auc_score(y_test, pred)
avr_auc += auc
avr_acc /= 10
avr_f1 /= 10
avr_auc /= 10
print("随机森林:参数n_estimators = %d" % (nn))
print("Accuracy: %f" % (avr_acc))
print("F1-score: %f" % (avr_f1))
print("AUC ROC : %f" % (avr_auc))
随机森林:参数n_estimators = 10
Accuracy: 0.906500
F1-score: 0.904558
AUC ROC : 0.906317
随机森林:参数n_estimators = 100
Accuracy: 0.919000
F1-score: 0.918893
AUC ROC : 0.919083
随机森林:参数n_estimators = 1000
Accuracy: 0.916500
F1-score: 0.916685
AUC ROC : 0.916653
综合比较,随机森林算法准确度最高,SVM当参数选择合适时准确度也比较高,朴素贝叶斯比参数选择差的SVM表现要好