开发良好信用和不良信用不平衡分类的模型 Python

背包客研究

于 2023-09-28 10:43:30 发布

阅读量138

点赞数

分类专栏：不平衡学习文章标签：分类 python 数据挖掘

本文链接：https://blog.csdn.net/weixin_39168167/article/details/133377716

版权

不平衡学习专栏收录该内容

8 篇文章 1 订阅

订阅专栏

开发良好信用和不良信用不平衡分类的模型 Python

对于某些不平衡的分类任务，少数类别的误分类错误比其他类型的预测错误更重要。

一个例子是对银行客户是否应该获得贷款进行分类的问题。与拒绝向标记为不良客户的良好客户提供贷款相比，向标记为良好客户的不良客户提供贷款会给银行带来更大的成本。

这需要仔细选择性能指标，该指标既可以总体上促进最小化错误分类错误，又有利于最小化一种类型的错误分类错误。

德国信用数据集是一个标准的不平衡分类数据集，具有错误分类错误成本不同的特性。可以使用Fbeta-Measure 来评估在此数据集上评估的模型，Fbeta-Measure提供了一种总体量化模型性能的方法，并满足了一种类型的错误分类错误比另一种错误分类错误的成本更高的要求。

在本教程中，您将了解如何开发和评估不平衡的德国信用分类数据集的模型。

完成本教程后，您将了解：

如何加载和探索数据集并产生数据准备和模型选择的想法。
如何评估一套机器学习模型并通过数据欠采样技术提高其性能。
如何拟合最终模型并使用它来预测特定情况的类标签。

教程概述

本教程分为五个部分；他们是：

德国信用数据集
探索数据集
模型测试和基线结果
评估模型
1. 评估机器学习算法
2. 评估欠采样
3. 进一步的模型改进
对新数据进行预测

德国信用数据集

在这个项目中，我们将使用一个标准的不平衡机器学习数据集，称为“德国信用”数据集或简称为“德国”。

该数据集被用作 Statlog 项目的一部分，这是 20 世纪 90 年代欧洲发起的一项计划，旨在评估和比较大量（当时）针对一系列不同分类任务的机器学习算法。该数据集归功于 Hans Hofmann。

不同学科之间的分裂几乎肯定阻碍了交流和进步。StatLog 项目旨在通过选择分类程序（无论历史谱系如何）来打破这些划分，在大规模和商业上重要的问题上对其进行测试，从而确定各种技术在多大程度上满足行业需求。

— 第 4 页，机器学习、神经和统计分类，1994 年。

德国信用数据集描述了客户的财务和银行详细信息，其任务是确定客户是好还是坏。假设该任务涉及预测客户是否会偿还贷款或信贷。

该数据集包括 1,000 个示例和 20 个输入变量，其中 7 个是数值变量（整数），13 个是分类变量。

Status of existing checking account（现有支票账户的状态）
Duration in month（持续时间（月））
Credit history Purpose（信用记录）
Purpose 目的
Credit amount（授信金额
Savings account 储蓄账户
Present employment since 到目前的就业
Installment rate in percentage of disposable income 分期付款率占可支配收入的百分比
Personal status and sex 个人状况和性别
Other debtors 其他债务人
Present residence since 现居住地
Property 财产
Age in years 年龄（岁）
Other installment plans 其他分期付款计划
Housing 住房
Number of existing credits at this bank 该银行现有信贷数量
Job 工作
Number of dependents 家属人数
Telephone 电话
Foreign worker 外劳

一些分类变量具有序数关系，例如“储蓄账户”，但大多数没有。

有两类，1 类为好客户，2 类为差客户。好的客户是默认的或负类，而坏的客户是例外或正类。总共 70% 的示例是好客户，而其余 30% 的示例是坏客户。

好客户：负面或多数阶层（70%）。
不良客户：正面或少数群体（30%）。

数据集提供了一个成本矩阵，该矩阵对正类的每个误分类错误给予不同的惩罚。具体来说，对于误报（将坏客户标记为好客户）应用 5 的成本，对于误报（将好客户标记为坏客户）分配 1 的成本。

误报成本：5
误报成本：1

这表明正类是预测任务的重点，并且银行或金融机构向不良客户提供资金比不向良好客户提供资金的成本更高。选择性能指标时必须考虑到这一点。

接下来，让我们仔细看看数据。

探索数据集

首先，下载数据集并将其保存在当前工作目录中，名称为“ german.csv ”。

下载德国信用数据集 (german.csv)

查看文件的内容。

文件的前几行应如下所示：

A11,6,A34,A43,1169,A65,A75,4,A93,A101,4,A121,67,A143,A152,2,A173,1,A192,A201,1
A12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2
A14,12,A34,A46,2096,A61,A74,2,A93,A101,3,A121,49,A143,A152,1,A172,2,A191,A201,1
A11,42,A32,A42,7882,A61,A74,2,A93,A103,4,A122,45,A143,A153,1,A173,2,A191,A201,1
A11,24,A33,A40,4870,A61,A73,3,A93,A101,4,A124,53,A143,A153,2,A173,2,A191,A201,2
...

我们可以看到分类列以Axxx格式编码，其中“ x ”是不同标签的整数。需要对分类变量进行 one-hot 编码。

我们还可以看到数值变量具有不同的标度，例如第 2 列中的 6、48 和 12，以及第 5 列中的 1169、5951 等。这表明对于那些对规模敏感。

目标变量或类是最后一列，包含值 1 和 2。这些需要分别被标签编码为 0 和 1，以满足不平衡二元分类任务的一般期望，其中 0 代表否定情况，1 代表积极的情况。

可以使用read_csv() Pandas 函数将数据集作为 DataFrame 加载，指定位置以及没有标题行的事实。

...
# define the dataset location
filename = 'german.csv'
# load the csv file as a data frame
dataframe = read_csv(filename, header=None)

加载后，我们可以通过打印DataFrame的形状来汇总行数和列数。

...
# summarize the shape of the dataset
print(dataframe.shape)

我们还可以使用Counter对象总结每个类中示例的数量。

...
# summarize the class distribution
target = dataframe.values[:,-1]
counter = Counter(target)
for k,v in counter.items():
	per = v / len(target) * 100
	print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

将它们结合在一起，下面列出了加载和汇总数据集的完整示例。

# load and summarize the dataset
from pandas import read_csv
from collections import Counter
# define the dataset location
filename = 'german.csv'
# load the csv file as a data frame
dataframe = read_csv(filename, header=None)
# summarize the shape of the dataset
print(dataframe.shape)
# summarize the class distribution
target = dataframe.values[:,-1]
counter = Counter(target)
for k,v in counter.items():
	per = v / len(target) * 100
	print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

运行示例首先加载数据集并确认行数和列数，即 1,000 行、20 个输入变量和 1 个目标变量。

然后总结类别分布，确认好客户和坏客户的数量以及少数和多数类别中的案例百分比。

(1000, 21)
Class=1, Count=700, Percentage=70.000%
Class=2, Count=300, Percentage=30.000%

我们还可以通过为每个变量创建直方图来查看七个数值输入变量的分布。

首先，我们可以通过调用DataFrame 上的select_dtypes() 函数来选择具有数字变量的列。然后我们可以从 DataFrame 中仅选择这些列。我们预计会有七个，加上数字类别标签。

...
# select columns with numerical data types
num_ix = df.select_dtypes(include=['int64', 'float64']).columns
# select a subset of the dataframe with the chosen columns
subset = df[num_ix]

然后我们可以创建每个数字输入变量的直方图。下面列出了完整的示例。

# create histograms of numeric input variables
from pandas import read_csv
from matplotlib import pyplot
# define the dataset location
filename = 'german.csv'
# load the csv file as a data frame
df = read_csv(filename, header=None)
# select columns with numerical data types
num_ix = df.select_dtypes(include=['int64', 'float64']).columns
# select a subset of the dataframe with the chosen columns
subset = df[num_ix]
# create a histogram plot of each numeric variable
ax = subset.hist()
# disable axis labels to avoid the clutter
for axis in ax.flatten():
	axis.set_xticklabels([])
	axis.set_yticklabels([])
# show the plot
pyplot.show()

运行该示例会为数据集中的七个输入变量和一个类标签创建一个包含一个直方图子图的图形。每个子图的标题指示 DataFrame 中的列号（例如从 0 到 20 的零偏移）。

我们可以看到许多不同的分布，一些具有类高斯分布，另一些具有看似指数或离散的分布。

根据建模算法的选择，我们期望将分布缩放到相同的范围是有用的，并且可能使用一些幂变换。
在这里插入图片描述

德国信用数据集中数值变量的直方图

现在我们已经审查了数据集，让我们看看开发一个用于评估候选模型的测试工具。

模型测试和baseline结果

我们将使用重复分层 k 倍交叉验证来评估候选模型。

k折交叉验证过程提供了对模型性能的良好总体估计，至少与单个训练测试分割相比，并没有过于乐观的偏差。我们将使用 k=10，这意味着每次折叠将包含大约 1000/10 或 100 个示例。

分层意味着每个折叠都将包含按类别划分的相同示例混合，即大约 70% 到 30% 的好客户与坏客户。重复意味着评估过程将执行多次，以帮助避免侥幸结果并更好地捕获所选模型的方差。我们将使用三个重复。

这意味着单个模型将被拟合和评估 10 * 3 或 30 次，并且将报告这些运行的平均值和标准偏差。

这可以使用RepeatedStratifiedKFold scikit-learn 类来实现。

我们将预测客户是否良好的类别标签。因此，我们需要一种适合评估预测类标签的度量。

任务的重点是正类（不良客户）。精确率和召回率是一个很好的起点。最大化精度将最大限度地减少误报，最大化召回将最大限度地减少模型预测中的误报。

精度 = TruePositives / (TruePositives + FalsePositives)
召回率 = TruePositives / (TruePositives + FalseNegatives)

使用 F-Measure 将计算精度和召回率之间的调和平均值。这是一个很好的单一数字，可以用来比较和选择解决此问题的模型。问题在于，假阴性比假阳性更具破坏性。

F 测量 = (2 * 精度 * 召回率) / (精度 + 召回率)

请记住，此数据集上的漏报是指不良客户被标记为良好客户并获得贷款的情况。误报是指好客户被标记为坏客户并且没有获得贷款的情况。

假阴性：坏客户（1 类）被预测为好客户（0 类）。
误报：将好客户（0 类）预测为坏客户（1 类）。

对于银行来说，误报的成本比误报的成本更高。

成本（误报）> 成本（误报）

换句话说，我们对 F 度量感兴趣，它总结了模型最小化正类错误分类错误的能力，但我们希望支持能够更好地最小化假阴性而不是假阳性的模型。

这可以通过使用 F 度量的一个版本来实现，该版本计算精确度和召回率的加权调和平均值，但与精确度分数相比，倾向于更高的召回分数。这称为Fbeta-measure，是 F-measure 的推广，其中“ beta ”是定义两个分数的权重的参数。

Fbeta 测量 = ((1 + beta^2) * 精度 * 召回率) / (beta^2 * 精度 + 召回率)

Beta 值为 2 时，更注重召回率而不是精确率，被称为 F2 度量。

F2-测量 = ((1 + 2^2) * 精度 * 召回率) / (2^2 * 精度 + 召回率)

我们将使用这一指标来评估德国信用数据集上的模型。这可以使用fbeta_score() scikit-learn 函数来实现。

我们可以定义一个函数来加载数据集并将列拆分为输入和输出变量。我们将对分类变量进行 one-hot 编码，并对目标变量进行标签编码。您可能还记得，one-hot 编码将分类变量替换为变量的每个值的一个新列，并在该值的列中用 1 标记值。

首先，我们必须将 DataFrame 拆分为输入和输出变量。

...
# split into inputs and outputs
last_ix = len(dataframe.columns) - 1
X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]

接下来，我们需要选择所有分类输入变量，然后应用 one-hot 编码并保持数值变量不变。

这可以使用ColumnTransformer并将变换定义为仅应用于分类变量的列索引的OneHotEncoder来实现。

...
# select categorical features
cat_ix = X.select_dtypes(include=['object', 'bool']).columns
# one hot encode cat features only
ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')
X = ct.fit_transform(X)

然后我们可以对目标变量进行标签编码。

...
# label encode the target variable to have the classes 0 and 1
y = LabelEncoder().fit_transform(y)

下面的load_dataset *()*函数将所有这些联系在一起，并加载和准备用于建模的数据集。

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	# one hot encode cat features only
	ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')
	X = ct.fit_transform(X)
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X, y

接下来，我们需要一个函数来使用fbeta_score()函数评估一组预测，并将beta设置为 2。

# calculate f2 score
def f2(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

然后，我们可以定义一个函数，该函数将评估数据集上的给定模型，并返回每次折叠和重复的 F2 测量分数列表。

下面的evaluate_model *()*函数实现了这一点，将数据集和模型作为参数并返回分数列表。

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation the metric
	metric = make_scorer(f2)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

最后，我们可以使用此测试工具评估数据集上的基线模型。

预测少数类示例的模型将实现最大召回分数和基线精度分数。这提供了该问题的模型性能基线，可以通过该基线来比较所有其他模型。

这可以使用scikit-learn 库中的DummyClassifier类来实现，并将少数类的“ strategy ”参数设置为“ constant ”，将“ constant ”参数设置为“ 1 ”。

...
# define the reference model
model = DummyClassifier(strategy='constant', constant=1)

评估模型后，我们可以直接报告 F2-Measure 分数的平均值和标准差。

...
# evaluate the model
scores = evaluate_model(X, y, model)
# summarize performance
print('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))

将它们结合在一起，下面列出了加载德国信贷数据集、评估基线模型和报告性能的完整示例。

# test harness and baseline model evaluation for the german credit dataset
from collections import Counter
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyClassifier

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	# one hot encode cat features only
	ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')
	X = ct.fit_transform(X)
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X, y

# calculate f2 score
def f2(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y = load_dataset(full_path)
# summarize the loaded dataset
print(X.shape, y.shape, Counter(y))
# define the reference model
model = DummyClassifier(strategy='constant', constant=1)
# evaluate the model
scores = evaluate_model(X, y, model)
# summarize performance
print('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))

运行该示例首先加载并汇总数据集。

我们可以看到，我们加载了正确的行数，并且通过对分类输入变量进行 one-hot 编码，我们将输入变量的数量从 20 个增加到 61 个。这表明 13 个分类变量被编码为总共54列。

重要的是，我们可以看到类标签正确映射到整数，其中 0 表示多数类，1 表示少数类，这是不平衡二元分类数据集的惯例。

接下来，报告 F2 测量分数的平均值。

在这种情况下，我们可以看到基线算法的 F2 测量值约为 0.682。该分数提供了模型技能的下限；任何平均 F2 测量值高于 0.682 的模型都具有技能，而得分低于该值的模型在此数据集上不具备技能。

(1000, 61) (1000,) Counter({0: 700, 1: 300})
Mean F2: 0.682 (0.000)

现在我们有了测试工具和性能基线，我们可以开始评估该数据集上的一些模型。

评估模型

在本节中，我们将使用上一节中开发的测试工具在数据集上评估一套不同的技术。

目标是展示如何系统地解决问题，并展示一些针对不平衡分类问题设计的技术的能力。

报告的性能良好，但没有高度优化（例如，未调整超参数）。

**你能做得更好吗？**如果您可以使用相同的测试工具实现更好的 F2-Measure 性能，我很想听听。请在下面的评论中告诉我。

评估机器学习算法

让我们首先评估数据集上的概率机器学习模型的混合。

在数据集上抽查一套不同的线性和非线性算法是一个好主意，可以快速找出哪些算法效果好、值得进一步关注，哪些算法不行。

我们将在德国信用数据集上评估以下机器学习模型：

逻辑回归 (LR)
线性判别分析 (LDA)
朴素贝叶斯 (NB)
高斯过程分类器 (GPC)
支持向量机（SVM）

我们将主要使用默认模型超参数。

我们将依次定义每个模型并将它们添加到列表中，以便我们可以按顺序评估它们。下面的get_models *()*函数定义了用于评估的模型列表，以及用于稍后绘制结果的模型短名称列表。

# define models to test
def get_models():
	models, names = list(), list()
	# LR
	models.append(LogisticRegression(solver='liblinear'))
	names.append('LR')
	# LDA
	models.append(LinearDiscriminantAnalysis())
	names.append('LDA')
	# NB
	models.append(GaussianNB())
	names.append('NB')
	# GPC
	models.append(GaussianProcessClassifier())
	names.append('GPC')
	# SVM
	models.append(SVC(gamma='scale'))
	names.append('SVM')
	return models, names

然后，我们可以依次枚举模型列表并评估每个模型，存储分数以供以后评估。

我们将对分类输入变量进行单热编码，就像我们在上一节中所做的那样，在这种情况下，我们将对数值输入变量进行标准化。最好在交叉验证评估过程的每个部分中使用MinMaxScaler来执行此操作。

实现此目的的一个简单方法是使用Pipeline，其中第一步是ColumnTransformer，它将OneHotEncoder仅应用于分类变量，并将MinMaxScaler仅应用于数字输入变量。为了实现这一点，我们需要一个分类和数值输入变量的列索引列表。

我们可以更新*load_dataset()*以返回列索引以及数据集的输入和输出元素。下面列出了该功能的更新版本。

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

然后我们可以调用此函数来获取数据以及分类变量和数值变量的列表。

...
# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)

这可用于在评估每个模型之前准备一个管道来包装每个模型。

首先，定义 ColumnTransformer，它指定要应用于每种类型的列的转换，然后将其用作 Pipeline 中的第一步，以将要拟合和评估的特定模型结束*。*

...
# evaluate each model
for i in range(len(models)):
	# one hot encode categorical, normalize numerical
	ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
	# wrap the model i a pipeline
	pipeline = Pipeline(steps=[('t',ct),('m',models[i])])
	# evaluate the model and store results
	scores = evaluate_model(X, y, pipeline)

我们可以总结每个算法的平均 F2-Measure；这将有助于直接比较算法。

...
# summarize and store
print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

运行结束时，我们将为每个算法的结果样本创建一个单独的箱须图。

这些图将使用相同的 y 轴刻度，因此我们可以直接比较结果的分布。

...
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

将所有这些结合在一起，下面列出了在德国信用数据集上评估一套机器学习算法的完整示例。

# spot check machine learning algorithms on the german credit dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from matplotlib import pyplot
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define models to test
def get_models():
	models, names = list(), list()
	# LR
	models.append(LogisticRegression(solver='liblinear'))
	names.append('LR')
	# LDA
	models.append(LinearDiscriminantAnalysis())
	names.append('LDA')
	# NB
	models.append(GaussianNB())
	names.append('NB')
	# GPC
	models.append(GaussianProcessClassifier())
	names.append('GPC')
	# SVM
	models.append(SVC(gamma='scale'))
	names.append('SVM')
	return models, names

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
	# one hot encode categorical, normalize numerical
	ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
	# wrap the model i a pipeline
	pipeline = Pipeline(steps=[('t',ct),('m',models[i])])
	# evaluate the model and store results
	scores = evaluate_model(X, y, pipeline)
	results.append(scores)
	# summarize and store
	print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

运行该示例会依次评估每个算法并报告平均值和标准差 F2-Measure。

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

在这种情况下，我们可以看到，没有一个测试模型的 F2 测量值高于所有情况下预测多数类的默认值 (0.682)。没有一个模型是熟练的。这是令人惊讶的，尽管这表明两个类别之间的决策边界可能是有噪声的。

LR 0.497 (0.072)
LDA 0.519 (0.072)
NB 0.639 (0.049)
GPC 0.219 (0.061)
SVM 0.436 (0.077)

创建一个图形，显示每个算法的结果样本的箱线图。该框显示中间 50% 的数据，每个框中间的橙色线显示样本的中位数，每个框内的绿色三角形显示样本的平均值。
在这里插入图片描述

现在我们有了一些结果，让我们看看是否可以通过一些欠采样来改进它们。

评估欠采样

在处理不平衡的分类任务时，欠采样可能是使用最不广泛的技术，因为大多数重点都放在使用 SMOTE 对多数类进行过采样上。

欠采样可以帮助从决策边界上的多数类中删除示例，这对分类算法来说是一个挑战。

在本实验中我们将测试以下欠采样算法：

Tomek Links (TL)
Edited Nearest Neighbors (ENN)
Repeated Edited Nearest Neighbors (RENN)
One Sided Selection (OSS)
Neighborhood Cleaning Rule (NCR)

Tomek Links 和 ENN 方法从多数类中选择要删除的示例，而 OSS 和 NCR 都选择要保留的示例和要删除的示例。为了简单起见，我们将使用逻辑回归算法的平衡版本来测试每种欠采样方法。

上一节中的get_models *()*函数可以更新为返回欠采样技术列表，以使用逻辑回归算法进行测试。我们使用不平衡学习库中这些算法的实现。

下面列出了定义欠采样方法的*get_models()*函数的更新版本。

# define undersampling models to test
def get_models():
	models, names = list(), list()
	# TL
	models.append(TomekLinks())
	names.append('TL')
	# ENN
	models.append(EditedNearestNeighbours())
	names.append('ENN')
	# RENN
	models.append(RepeatedEditedNearestNeighbours())
	names.append('RENN')
	# OSS
	models.append(OneSidedSelection())
	names.append('OSS')
	# NCR
	models.append(NeighbourhoodCleaningRule())
	names.append('NCR')
	return models, names

scikit-learn 提供的Pipeline不了解欠采样算法。因此，我们必须使用不平衡学习库提供的Pipeline实现。

与上一节一样，管道的第一步将是分类变量的热编码和数值变量的标准化，最后一步将是拟合模型。在这里，中间步骤将是欠采样技术，仅在训练数据集的交叉验证评估中正确应用。

...
# define model to evaluate
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then undersample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)])
# evaluate the model and store results
scores = evaluate_model(X, y, pipeline)

将其结合在一起，下面列出了在德国信用数据集上使用不同欠采样方法评估逻辑回归的完整示例。

我们预计欠采样将导致逻辑回归技能的提升，理想情况下高于在所有情况下预测少数群体的基线性能。

下面列出了完整的示例。

# evaluate undersampling with logistic regression on the imbalanced german credit dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import OneSidedSelection

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define undersampling models to test
def get_models():
	models, names = list(), list()
	# TL
	models.append(TomekLinks())
	names.append('TL')
	# ENN
	models.append(EditedNearestNeighbours())
	names.append('ENN')
	# RENN
	models.append(RepeatedEditedNearestNeighbours())
	names.append('RENN')
	# OSS
	models.append(OneSidedSelection())
	names.append('OSS')
	# NCR
	models.append(NeighbourhoodCleaningRule())
	names.append('NCR')
	return models, names

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
	# define model to evaluate
	model = LogisticRegression(solver='liblinear', class_weight='balanced')
	# one hot encode categorical, normalize numerical
	ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
	# scale, then undersample, then fit model
	pipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)])
	# evaluate the model and store results
	scores = evaluate_model(X, y, pipeline)
	results.append(scores)
	# summarize and store
	print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

运行该示例可使用五种不同的欠采样技术来评估逻辑回归算法。

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

在本例中，我们可以看到五种欠采样技术中的三种产生了 F2 度量，该度量相对于基线 0.682 有所改进。具体来说，ENN、RENN 和 NCR 具有重复编辑的最近邻，可产生最佳性能，F2 测量值约为 0.716。

结果表明，SMOTE取得了最佳成绩，F2 测量值为 0.604。

>TL 0.669 (0.057)
>ENN 0.706 (0.048)
>RENN 0.714 (0.041)
>OSS 0.670 (0.054)
>NCR 0.693 (0.052)

为每种评估的欠采样技术创建箱线图和须线图，表明它们通常具有相同的分布。

令人鼓舞的是，对于性能良好的方法，框分布在 0.8 左右，所有三种方法的平均值和中位数都在 0.7 左右。这凸显出分布偏高，并且有时会因一些不好的评估而令人失望。

在这里插入图片描述

接下来，让我们看看如何使用最终模型对新数据进行预测。

进一步的模型改进

这是一个新部分，与上一节略有不同。在这里，我们将测试可进一步提升 F2 测量性能的特定模型，并且我将在报告/发现新模型时更新本节。

改进#1：InstanceHardnessThreshold

使用平衡 Logistic 回归和InstanceHardnessThreshold欠采样可以实现约0.727的 F2 测量。

下面列出了完整的示例。

# improve performance on the imbalanced german credit dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import InstanceHardnessThreshold

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define model to evaluate
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# define the data sampling
sampling = InstanceHardnessThreshold()
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then sample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])
# evaluate the model and store results
scores = evaluate_model(X, y, pipeline)
print('%.3f (%.3f)' % (mean(scores), std(scores)))

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

运行该示例给出以下结果。

0.727 (0.033)

改进#2：SMOTEENN

使用带有SMOTEENN 的 LDA 可以实现约0.730的 F2 测量，其中 ENN 参数设置为 ENN 实例，且抽样策略设置为多数。

下面列出了完整的示例。

# improve performance on the imbalanced german credit dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define model to evaluate
model = LinearDiscriminantAnalysis()
# define the data sampling
sampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority'))
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then sample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])
# evaluate the model and store results
scores = evaluate_model(X, y, pipeline)
print('%.3f (%.3f)' % (mean(scores), std(scores)))

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

运行该示例给出以下结果。

0.730 (0.046)

改进 #3：带有 StandardScaler 和 RidgeClassifier 的 SMOTEENN

通过使用 RidgeClassifier 而不是 LDA 以及对数字输入使用 StandardScaler 而不是 MinMaxScaler 对 SMOTEENN 进行进一步改进，可以实现约0.741的 F2 测量。

下面列出了完整的示例。

# improve performance on the imbalanced german credit dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import RidgeClassifier
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define model to evaluate
model = RidgeClassifier()
# define the data sampling
sampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority'))
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',StandardScaler(),num_ix)])
# scale, then sample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])
# evaluate the model and store results
scores = evaluate_model(X, y, pipeline)
print('%.3f (%.3f)' % (mean(scores), std(scores)))

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

运行该示例给出以下结果。

0.741 (0.034)

对新数据进行预测

考虑到结果的差异，选择任何欠采样方法可能就足够了。在这种情况下，我们将选择带有重复 ENN 的逻辑回归。

在我们的测试工具上，该模型的 F2 测量值约为 0.716。

我们将使用它作为我们的最终模型，并用它来对新数据进行预测。

首先，我们可以将模型定义为管道。

...
# define model to evaluate
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then undersample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])

一旦定义，我们就可以将其拟合到整个训练数据集上。

...
# fit the model
pipeline.fit(X, y)

一旦拟合，我们可以通过调用*predict()*函数使用它来对新数据进行预测。这将返回类标签 0 表示“好客户”，或 1 表示“坏客户”。

重要的是，我们必须使用适合管道中训练数据集的ColumnTransformer**来使用相同的转换正确准备新数据。

例如：

...
# define a row of data
row = [...]
# make prediction
yhat = pipeline.predict([row])

为了证明这一点，我们可以使用拟合模型对一些案例进行一些标签预测，我们知道该案例是好客户还是坏客户。

下面列出了完整的示例。

# fit a model and make predictions for the german credit dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RepeatedEditedNearestNeighbours

# load the dataset
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	# split into inputs and outputs
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)
# define model to evaluate
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then undersample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])
# fit the model
pipeline.fit(X, y)
# evaluate on some good customers cases (known class 0)
print('Good Customers:')
data = [['A11', 6, 'A34', 'A43', 1169, 'A65', 'A75', 4, 'A93', 'A101', 4, 'A121', 67, 'A143', 'A152', 2, 'A173', 1, 'A192', 'A201'],
	['A14', 12, 'A34', 'A46', 2096, 'A61', 'A74', 2, 'A93', 'A101', 3, 'A121', 49, 'A143', 'A152', 1, 'A172', 2, 'A191', 'A201'],
	['A11', 42, 'A32', 'A42', 7882, 'A61', 'A74', 2, 'A93', 'A103', 4, 'A122', 45, 'A143', 'A153', 1, 'A173', 2, 'A191', 'A201']]
for row in data:
	# make prediction
	yhat = pipeline.predict([row])
	# get the label
	label = yhat[0]
	# summarize
	print('>Predicted=%d (expected 0)' % (label))
# evaluate on some bad customers (known class 1)
print('Bad Customers:')
data = [['A13', 18, 'A32', 'A43', 2100, 'A61', 'A73', 4, 'A93', 'A102', 2, 'A121', 37, 'A142', 'A152', 1, 'A173', 1, 'A191', 'A201'],
	['A11', 24, 'A33', 'A40', 4870, 'A61', 'A73', 3, 'A93', 'A101', 4, 'A124', 53, 'A143', 'A153', 2, 'A173', 2, 'A191', 'A201'],
	['A11', 24, 'A32', 'A43', 1282, 'A62', 'A73', 4, 'A92', 'A101', 2, 'A123', 32, 'A143', 'A152', 1, 'A172', 1, 'A191', 'A201']]
for row in data:
	# make prediction
	yhat = pipeline.predict([row])
	# get the label
	label = yhat[0]
	# summarize
	print('>Predicted=%d (expected 1)' % (label))

运行示例首先在整个训练数据集上拟合模型。

然后，拟合模型用于预测从数据集文件中选择的案例的好客户的标签。我们可以看到大多数情况都被正确预测。这凸显出虽然我们选择了一个好的模型，但它并不完美。

然后使用一些实际不良客户的案例作为模型的输入并预测标签。正如我们所希望的那样，所有情况都会预测出正确的标签。

Good Customers:
>Predicted=0 (expected 0)
>Predicted=0 (expected 0)
>Predicted=0 (expected 0)
Bad Customers:
>Predicted=0 (expected 1)
>Predicted=1 (expected 1)
>Predicted=1 (expected 1)