在这篇文章中,我们将讨论以下概念,这些概念都旨在评估机器学习分类模型的性能:
- 交叉验证模型。
- 混淆矩阵。
- ROC曲线。
- Cohen's κ score。
导入Python库
import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport warningswarnings.filterwarnings('ignore')
我们首先创建具有三个特征和二元标签的简单机器学习数据集。Python代码如下:
from sklearn.model_selection import train_test_split# Creating the datasetN = 1000 # number of samplesdata = {'A': np.random.normal(100, 8, N), 'B': np.random.normal(60, 5, N), 'C': np.random.choice([1, 2, 3], size=N, p=[0.2, 0.3, 0.5])}df = pd.DataFrame(data=data)# Labeling def get_label(A, B, C): if A < 95: return 1 elif C == 1: return 1 elif B > 68 or B < 52: return 1 return 0df['label'] = df.apply(lambda row: get_label(row['A'],row['B'],row['C']),axis=1)# Dividing to train and test setX = np.asarray(df[['A', 'B', 'C']])y = np.asarray(df['label'])X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
让我们尝试使用简单的逻辑回归来进行演示。
from sklearn import linear_modelfrom sklearn.model_selection import cross_val_scoreclf = linear_model.LogisticRegression()clf.fit(X_train, y_train)print(">> Score of the classifier on the train set is: