python作业_sklearn部分

作业题目:

In the second MLassignment you have to compare the performance of three di↵erent classification algorithms, namely Naive Bayes, SVM, and RandomForest.

For thisassignment you need to generate a random binary classification problem, andthen train and test (using 10-fold cross validation) the three algorithms. Forsome algorithms inner cross validation (5-fold) for choosing the parameters isneeded. Then, show the classification performace (per-fold and averaged) in thereport, and briefly discussing the results.

简单翻译:

在第二个ML赋值中,你必须比较三种不同的分类算法的性能,即朴素贝叶斯、SVM和随机森林。

对于这个任务,你需要生成一个随机二进制分类问题,然后训练和测试(使用10倍交叉验证)这三个算法。对于一些算法,需要内部交叉验证(5倍)来选择参数。然后,在报告中显示分类性能(每倍和平均),并简要讨论结果。

 

作业前简要分析:首先通过pip下载sklearn,课件里面的代码十分关键,有助于开始入门(相当于搭个框架),其他自己上网找相关的函数,教程应该也可以了。

 

代码如下:

from sklearn import datasets
from sklearn import cross_validation#这个cross_validation好像过时了?
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt 
from sklearn import metrics 
from sklearn.svm import SVC
import numpy as np
from sklearn.ensemble import RandomForestClassifier
def NB(X_train,y_train,X_test):
	clf = GaussianNB()
	clf.fit(X_train,y_train)
	return clf.predict(X_test)
	
#输入训练数据列表X,y,再给出一个测试的数据列表,以及C参数,返回相关的SVM预测
def rbf_svm(X_train,y_train,X_test,C):
	clf = SVC(C = C,kernel = 'rbf')
	clf.fit(X_train,y_train)
	return clf.predict(X_test)

#输入训练数据列表X,y,再给出一个测试的数据列表,以及n_estimator参数,返回相关的RFC预测
def RFC(X_train,y_train,X_test,n_estimator):
	clf = RandomForestClassifier(n_estimators=n_estimator)
	clf.fit(X_train, y_train)
	return clf.predict(X_test)

acc_for_NB = []             #使用accuracy评估三个算法  
acc_for_SVC = []  
acc_for_RFC = []  
  
f1_for_NB = []              # 使用F1-score评估三个算法  
f1_for_SVC = []  
f1_for_RFC = []  
  
auc_for_NB = []             # 使用AUC ROC评估三个算法  
auc_for_SVC = []  
auc_for_RFC = [] 

#创建虚拟数据X,y,后面指明了相关的参数,包括例子个数,属性个数,关联度,之后的就不知道了,
#直接照着老师的课件上面的写即可
X,y = datasets.make_classification(n_samples = 1000,n_features = 10
									,n_informative = 2,n_redundant = 2
									,n_repeated = 0,n_classes = 2)

#下面的大部分其实参照课本就可以得到很好的效果了
kf = cross_validation.KFold(len(X), n_folds=10, shuffle=True
								,random_state = 1234)
for train_index, test_index in kf:
	X_train, y_train = X[train_index], y[train_index]
	X_test, y_test = X[test_index], y[test_index]
	
	ipred = NB(X_train,y_train,X_test)
	
	#分别用三种方法评估NB算法结果评估朴素贝叶斯算法
	acc_for_NB.append(metrics.accuracy_score(y_test, ipred))  
	f1_for_NB.append(metrics.f1_score(y_test, ipred))  
	auc_for_NB.append(metrics.roc_auc_score(y_test, ipred))
	
	#然后是SVM算法,先找出最好的C值
	nn = len(X_train)
	bestC = None
	Cvalues = [1e-2,1e-1,1e0,1e1,1e2]
	innerscore = []
	for C in Cvalues:
		ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True
										,random_state = 5678)
		innerf1 = []
		for t_index,v_index in ikf:
			X_t,X_v = X_train[t_index],X_train[v_index]
			y_t,y_v = y_train[t_index],y_train[v_index]
			
			ipred = rbf_svm(X_t,y_t,X_v,C)
			
			innerf1.append(metrics.f1_score(y_v,ipred))
		innerscore.append(sum(innerf1)/len(innerf1))
	bestC = Cvalues[np.argmax(innerscore)]
	
	SVCpred = rbf_svm(X_train,y_train,X_test,bestC)
	print("The bestC is:", bestC)  
	acc_for_SVC.append(metrics.accuracy_score(y_test,SVCpred))
	f1_for_SVC.append(metrics.f1_score(y_test,SVCpred))
	auc_for_SVC.append(metrics.roc_auc_score(y_test,SVCpred))
	
	#下面是随机森林random forest了,主要是对n_estimator做出评估
	nn = len(X_train)
	best_n_estimators_values = None
	n_estimators_values=[1,10,100,1000]
	innerscore = []
	
	for estimators_value in n_estimators_values:
		ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True
										,random_state = 5678)
		innerf1 = []
		for t_index,v_index in ikf:
			X_t,X_v = X_train[t_index],X_train[v_index]
			y_t,y_v = y_train[t_index],y_train[v_index]
			
			ipred = RFC(X_t,y_t,X_v,estimators_value)
			
			innerf1.append(metrics.f1_score(y_v,ipred))
		innerscore.append(sum(innerf1)/len(innerf1))
	best_n_estimators_values = n_estimators_values[np.argmax(innerscore)]
	print("The best_n_estimators_values is:", best_n_estimators_values)  
	RFCpred = RFC(X_train,y_train,X_test,best_n_estimators_values)
	acc_for_RFC.append(metrics.accuracy_score(y_test,RFCpred))
	f1_for_RFC.append(metrics.f1_score(y_test,RFCpred))
	auc_for_RFC.append(metrics.roc_auc_score(y_test,RFCpred))
	
print("Naive Bayes:")  
  
print("Evaluated by accuracy score:")  
print(acc_for_NB)  
print("Average:", sum(acc_for_NB) / len(acc_for_NB))  
print()  
  
print("Evaluated by f1 score:")  
print(f1_for_NB)  
print("Average:", sum(f1_for_NB) / len(f1_for_NB))  
print()  
  
print("Evaluated by roc auc score:")  
print(auc_for_NB)  
print("Average:", sum(auc_for_NB) / len(auc_for_NB))  
print()  
  
print("SVC:")  
  
print("Evaluated by accuracy score:")  
print(acc_for_SVC)  
print("Average:", sum(acc_for_SVC) / len(acc_for_SVC))  
print()  
  
print("Evaluated by f1 score:")  
print(f1_for_SVC)  
print("Average:", sum(f1_for_SVC) / len(f1_for_SVC))  
print()  
  
print("Evaluated by roc auc score:")  
print(auc_for_SVC)  
print("Average:", sum(auc_for_SVC) / len(auc_for_SVC))  
print()  
  
print("Random Forest:")  
  
print("Evaluated by accuracy score:")  
print(acc_for_RFC)  
print("Average:", sum(acc_for_RFC) / len(acc_for_RFC))  
print()  
  
print("Evaluated by f1 score:")  
print(f1_for_RFC)  
print("Average:", sum(f1_for_RFC) / len(f1_for_RFC))  
print()  
  
print("Evaluated by roc auc score:")  
print(auc_for_RFC)  
print("Average:", sum(auc_for_RFC) / len(auc_for_RFC))  
print()  

简要分析:

从kf = cross_validation.KFold(len(X), n_folds=10, shuffle=True,random_state= 1234)谈起:

我们将数据分成10组,这个可以从KFold =10看出来,从

官方文档中看出来,将10-1=9份作为训练用的部分,剩下一部分作为检测用的(validation set)至于后面的ikf = cross_validation.KFold(nn,n_folds = 5,shuffle = True,random_state= 5678)同上面,就不详细谈了。

 

实验结果:

The bestC is: 1.0

The best_n_estimators_values is: 1000

The bestC is: 1.0

The best_n_estimators_values is: 100

The bestC is: 1.0

The best_n_estimators_values is: 100

The bestC is: 1.0

The best_n_estimators_values is: 1000

The bestC is: 1.0

The best_n_estimators_values is: 100

The bestC is: 1.0

The best_n_estimators_values is: 100

The bestC is: 1.0

The best_n_estimators_values is: 1000

The bestC is: 1.0

The best_n_estimators_values is: 1000

The bestC is: 1.0

The best_n_estimators_values is: 1000

The bestC is: 1.0

The best_n_estimators_values is: 1000

简单分析:

       SVC算法待确定参数为C,RFC算法待确定参数为n_estimator,在10次迭代中,C的最佳取值总是1.0,而n_estimator的有两种取值都可能成为最佳取值(1000成为最佳取值6次,而100有4次,可能1000稍微更好一些)

 

下面是对朴素贝叶斯公式的检验

Naive Bayes:

Evaluated by accuracy score:

[0.83, 0.84, 0.95, 0.91, 0.91, 0.93, 0.92, 0.93, 0.87, 0.85]

Average: 0.8939999999999999

 

Evaluated by f1 score:

[0.8282828282828282, 0.8, 0.9411764705882353, 0.9010989010989011,0.896551724137931, 0.9213483146067417, 0.923076923076923, 0.9292929292929293,0.8737864077669903, 0.8543689320388349]

Average: 0.8868983430890314

 

Evaluated by roc auc score:

[0.8433441558441559, 0.8282828282828284, 0.9444444444444444,0.9078525641025641, 0.9060606060606061, 0.927133655394525, 0.9252525252525251,0.9305722288915566, 0.8697478991596639, 0.8627090983272134]

Average: 0.8945400005760084

 

可以看出朴素贝叶斯公式在accuracy和AUC ROC中得到比f1更好的检测分数

 

然后是SVC的检测

SVC:

Evaluated by accuracy score:

[0.94, 0.91, 0.98, 0.96, 0.95, 0.93, 0.96, 0.96, 0.92, 0.93]

Average: 0.944

 

Evaluated by f1 score:

[0.9464285714285714, 0.8988764044943819, 0.9772727272727273,0.9583333333333334, 0.945054945054945, 0.9230769230769231, 0.9629629629629629,0.9615384615384616, 0.9245283018867925, 0.9369369369369369]

Average: 0.9435009567986035

 

Evaluated by roc auc score:

[0.9391233766233766, 0.908080808080808, 0.9777777777777779,0.9599358974358976, 0.9505050505050505, 0.928743961352657, 0.9616161616161615,0.9595838335334135, 0.9191676670668267, 0.9328845369237047]

Average: 0.9437419070915676

 

 

可以看出SVC公式在accuracy和AUCROC中得到比f1更好的检测分数,同朴素贝叶斯

 

最后是随机森林:

Random Forest:

Evaluated by accuracy score:

[0.98, 0.96, 0.99, 0.97, 0.98, 0.98, 0.95, 0.99, 0.94, 0.95]

Average: 0.969

 

Evaluated by f1 score:

[0.9818181818181818, 0.9555555555555556, 0.9887640449438202,0.9690721649484536, 0.9777777777777777, 0.9777777777777777, 0.9532710280373831,0.99009900990099, 0.9444444444444444, 0.9557522123893805]

Average: 0.9694332197593765

 

Evaluated by roc auc score:

[0.9821428571428572, 0.9595959595959597, 0.9888888888888889,0.9703525641025641, 0.9797979797979798, 0.9782608695652174, 0.9525252525252524,0.9901960784313726, 0.9387755102040816, 0.9504283965728273]

Average: 0.9690964356827001

 

可以看出随机森林中三种检验都在0.969上面,与前面不同的是,这次的随机森林受到了f1的青睐。

 

总的来说,在本次作业中,平均得分从低到高是朴素贝叶斯,SVC,随机森林,这可能说明随机森林对本次的数据拟合得比较好;而且前两者而言,F1评价都偏低,可能显示出其在某些方面严格的拟合评价。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值