1. 分类算法评估参数
1.1 分类准确度:准确度越高模型越好
1.2 对数损失函数:越小模型越好
1.3 AUC
ROC是反映敏感性和特异性连续变量的综合指标,AUC是ROC曲线下的面积。AUV的值越大,诊断准确性越高。
敏感性指标,又称为真正类率: sensitivity=TP/(TP+FN)
负正类率 FPR=FP/(FP+TN)
特性性指标 specificity=TN/(FP+TN)=1-FPR
1.4 混淆矩阵
混淆矩阵主要用于比较分类结果和实际测得值,可以把分类结果的精度显示在一个混淆矩阵中。混淆矩阵是可视化工具,特别适用于监督学习,在无监督学习时一般叫做匹配矩阵。
1.5 分类报告
精确率 P=TP/(TP+FP)
召回率 R=TP/(TP+FN)
F1=(P+R)/2
#分类准确度
kfold=KFold(n_splits=10,random_state=7,shuffle=True)
model=LogisticRegression()
result=cross_val_score(model, X,Y,cv=kfold)
print("算法评估结果准确度:%.3f%%(%.3f%%)"%(result.mean()*100,result.std()*100))
#对数损失函数
result=cross_val_score(model,X,Y,cv=kfold,scoring='neg_log_loss')
print("Logloss %.3f(%.3f)"%(result.mean(),result.std()))
#AUC
result=cross_val_score(model,X,Y,cv=kfold,scoring='roc_auc')
print("AUC %.3f(%.3f)"%(result.mean(),result.std()))
from sklearn.metrics import confusion_matrix,classification_report
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.33,random_state=4)#指定随机粒度可以确保每次执行程序得到相同的结果,有助于比较两个不同算法生成的模型的结果
model=LogisticRegression()
model.fit(X_train,Y_train)
pre_Y=model.predict(X_test)
#混淆矩阵
matrix=confusion_matrix(Y_test,pre_Y)
classes=['0','1']
dataframe=pd.DataFrame(data=matrix,index=classes,columns=classes)
#report
report=classification_report(Y_test,pre_Y)
2. 回归算法评估参数
平均绝对误差MAE
均方误差MSE
决定系数R2
from sklearn.linear_model import LinearRegression
names=['CRTM','ZN','INDUSs','skin','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PRTATIO','B','LSTAT','MEDV']
data=pd.read_csv("housing.csv",names=names)
kfold=KFold(n_splits=10,random_state=7,shuffle=True)
model=LinearRegression()
result=cross_val_score(model, X,Y,cv=kfold,scoring='neg_mean_absolute_error')
print("MAE %.3f(%.3f)"%(result.mean(),result.std()))
result=cross_val_score(model, X,Y,cv=kfold,scoring='neg_mean_squared_error')
print("MAE %.3f(%.3f)"%(result.mean(),result.std()))
result=cross_val_score(model, X,Y,cv=kfold,scoring='r2')
print("MAE %.3f(%.3f)"%(result.mean(),result.std()))
3.分类算法审查
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
model={}
model['LR']=LogisticRegression()
model['LDC']=LinearDiscriminantAnalysis()
model['DTC']=DecisionTreeClassifier()
model['KNN']= KNeighborsClassifier()
model['GNB']=GaussianNB()
model['SVM']=SVC()
results=[]
for key in model:
kfold=KFold(n_splits=10,random_state=7,shuffle=True)
result=cross_val_score(model[key], X,Y,cv=kfold)
results.append(result)
print('%s:%f(%f)'%(key,result.mean(),result.std()))
绘图查看结果
#绘图
fig=plt.figure()
fig.suptitle('Algorithm Comparison')
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklables(model.keys())
plt.show()
4. 回归算法审查
4.1 线性回归算法
4.2 岭回归
改良的最小二乘估计法。通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价,获得回归系数更符合实际、更可靠的回归方法。对病态数据的拟合要强于最小二乘法。
4.3 套索回归
4.4 弹性网络回归
4.5 K近邻
4.6 分类与回归树
4.7 支持向量机
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
names=['CRTM','ZN','INDUSs','skin','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PRTATIO','B','LSTAT','MEDV']
data=pd.read_csv("housing.csv",names=names,delim_whitespace=True)
X=data.values[:,:13]
Y=data.values[:,13]
models={}
models['LR']=LinearRegression()
models['Ridge']=Ridge()
models['Lasso']=Lasso()
models['EN']=ElasticNet()
models['KNN']=KNeighborsRegressor()
models['DTR']=DecisionTreeRegressor()
models['SVR']=SVR()
results=[]
for key in models:
kfold=KFold(n_splits=10,random_state=7,shuffle=True)
result=cross_val_score(models[key], X,Y, cv=kfold,scoring='neg_mean_squared_error')
results.append(result)
print('%s:%f'%(key,result.mean()))
cross_val_score 函数的参数