knn 邻居数量k的选取
This article contains in-depth algorithm overviews of the K-Nearest Neighbors algorithm (Classification and Regression) as well as the following Model Validation techniques: Traditional Train/Test Split and Repeated K-Fold Cross Validation. The algorithm overviews include detailed descriptions of the methodologies and mathematics that occur internally with accompanying concrete examples. Also included are custom, fully functional/flexible frameworks of the above algorithms built from scratch using primarily NumPy. Finally, there is a fully integrated Case Study which deploys several of the custom frameworks (KNN-Classification, Repeated K-Fold Cross Validation) through a full Machine Learning workflow alongside the Iris Flowers dataset to find the optimal KNN model.
本文包含K最近邻算法(分类和回归)以及以下模型验证技术的深入算法概述:传统训练/测试拆分和重复的K折交叉验证。 算法概述包括内部出现的方法和数学的详细说明以及随附的具体示例。 还包括上述算法的自定义,功能齐全/灵活的框架,这些框架主要是使用NumPy从头开始构建的。 最后,有一个完全集成的案例研究,通过完整的机器学习工作流程以及Iris Flowers数据集,部署了几个自定义框架(KNN分类,重复K折交叉验证),以找到最佳KNN模型。
GitHub: https://github.com/Amitg4/KNN_ModelValidation
GitHub: https : //github.com/Amitg4/KNN_ModelValidation
Please use the imports below to run any included code within your own notebook or coding environment.
请使用下面的导入在您自己的笔记本或编码环境中运行任何包含的代码。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from statistics import mean,stdev
from itertools import combinations
import math
%matplotlib inline
KNN算法概述 (KNN Algorithm Overview)
K-Nearest Neighbors is a popular pattern recognition algorithm used in Supervised Machine Learning to handle both classification and regression-based tasks. At a high level, this algorithm operates according to an intuitive methodology:
K-Nearest Neighbors是一种流行的模式识别算法,用于有监督的机器学习中,可以处理基于分类和基于回归的任务。 在较高级别上,此算法根据直观的方法进行操作:
New points target values are predicted according to the target values of the K most similar points stored in the model’s training data.
根据存储在模型训练数据中的K个最相似点的目标值来预测新点的目标值。
Traditionally, ‘similar’ is interpreted as some form of a distance calculation. Therefore, another way to interpret the KNN prediction methodology is that predictions are based off the K closest points within the training data, hence the name K-Nearest Neighbors. With the concept of distance introduced, a good initial question to answer is how distance will be computed. While there are several different mathematical metrics that are viewed as a form of computing distance, this study will highlight the 3 following distance metrics: Euclidean, Manhattan, and Chebyshev.
传统上,“相似”被解释为距离计算的某种形式。 因此,另一种解释KNN预测方法的方法是, 预测基于训练数据中的K个最近点,因此得名K-Nearest Neighbors 。 引入距离概念后,一个很好的初始问题就是如何计算距离。 尽管有几种不同的数学指标被视为计算距离的一种形式,但本研究将重点介绍以下三种距离指标:欧几里得,曼哈顿和切比雪夫。
KNN算法概述-距离指标 (KNN Algorithm Overview — Distance Metrics)
Euclidean distance, based in the Pythagorean Theorem, finds the straight line distance between two points in space. In other words, this is equivalent to finding the shortest distance between two points by drawing a single line between Point A and Point B. Manhattan distance, based in taxicab geometry, is the sum of all N distances between Point A and Point B in N dimensional feature space. For example, in 2D space the Manhattan distance between Point A and Point B would be the sum of the vertical and horizontal distance. Chebyshev distance is the maximum distance between Point A and Point B in N dimensional feature space. For example, in 2D space the Chebyshev distance between Point A and Point B would be max(horizontal distance, vertical distance), in other words whichever distance is greater between the two distances.
基于毕达哥拉斯定理的欧几里得距离找到空间中两点之间的直线距离。 换句话说,这等效于通过在点A和点B之间绘制一条直线找到两个点之间的最短距离。基于出租车几何学的曼哈顿距离是N中点A和点B之间所有N距离的总和维特征空间。 例如,在2D空间中,点A和点B之间的曼哈顿距离将是垂直距离和水平距离之和。 切比雪夫距离是N维特征空间中A点与B点之间的最大距离。 例如,在2D空间中,点A和点B之间的切比雪夫距离为最大(水平距离,垂直距离),换句话说,两个距离之间的距离较大者。
Consider Point A = (A_1, A_2, … , A_N) and Point B = (B_1, B_2, … , B_N) both exist in N dimensional feature space. The distance between these two points can be described by the following formulas:
考虑点A =(A_1,A_2,…,A_N)和点B =(B_1,B_2,…,B_N)都存在于N维特征空间中。 这两个点之间的距离可以用以下公式描述:
As with most mathematical concepts, distance metrics are often easier to understand with a concrete example to visualize. Consider Point A = (0,0) and Point B = (3,4) in 2D feature space.
与大多数数学概念一样,距离指标通常更易于通过可视化的具体示例来理解。 考虑二维特征空间中的点A =(0,0)和点B =(3,4)。
sns.set_style('darkgrid')
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.scatter(x=[0,3],y=[0,4],s = 50)
axes.plot([0,3],[0,4],c='blue')
axes.annotate('X',[1.5,1.8],fontsize = 14,fontweight = 'bold')
axes.plot([0,0],[0,4],c='blue')
axes.annotate('Y',[-0.1,1.8],fontsize = 14,fontweight = 'bold')
axes.plot([0,3],[4,4],c='blue')
axes.annotate('Z',[1.5,4.1],fontsize = 14,fontweight = 'bold')
axes.annotate('(0,0)',[0.1,0.0],fontsize = 14,fontweight = 'bold')
axes.annotate('(3,4)',[2.85,3.7],fontsize = 14,fontweight = 'bold')
axes.grid(lw=1)
plt.show()
Segment X is the straight line distance between Point A and Point B. Segment Y is the vertical distance between Point A and Point B. Segment Z is the horizontal distance between Point A and Point B. Using the distance metrics as detailed above, let us compute the distance between Point A and Point B:
线段X是点A和点B之间的直线距离。线段Y是点A和点B之间的垂直距离。线段Z是点A和点B之间的水平距离。使用上面详述的距离度量,让我们计算点A和点B之间的距离:
Important Note: Because KNN utilizes distance measurements as a key component, feature scaling is especially important to ensure one feature does not mask the input of other features
重要说明:由于KNN利用距离测量作为关键组成部分,因此要素缩放对于确保一个要素不会掩盖其他要素的输入尤为重要。
KNN算法概述-分类 (KNN Algorithm Overview — Classification)
Classification is a supervised machine learning concept which categorizes data into classes.
分类是一种有监督的机器学习概念,它将数据分类。
Suppose a dataset is split into training data, data which allows the model to learn, and test data, data that allows us to validate model performance. The KNN Algorithm supports classification tasks through the following general algorithm structure:
假设数据集被分为训练数据(允许模型学习的数据)和测试数据(允许我们验证模型性能的数据)。 KNN算法通过以下常规算法结构支持分类任务:
Step 1: Select a K Value, Distance Metric, and Weighting Schema (covered later!) for the model Step 2: Train the model with training data Step 3: Compute distance using the selected distance metric between ith test/new observation and every training observation, while storing distances and training labels in a data structure Step 4: Sort the data structure in ascending order by distance Step 5: Use the K closest points and the selected weighting schema to vote on the class of the test/new observation, where the class with the majority vote is selected as the prediction Step 6: Repeat Steps 3–5 until a prediction has been made for every test observation Step 7: Return predictions Step 8: If predictions are for test observations, then compare test labels to predictions in order to assess model accuracy
步骤1:为模型选择K值,距离度量和权重方案(稍后介绍!) 步骤2:使用训练数据训练模型步骤3:使用第i 个测试/新观测值与每个观测值之间的距离度量来计算距离训练观测值,同时将距离和训练标签存储在数据结构中。 步骤4:按距离对数据结构进行升序排序。 步骤5:使用K个最接近的点和所选的权重方案对测试/新观测值的类别进行投票, 步骤6:重复步骤3-5,直到对每个测试观察结果做出预测为止步骤7:返回预测结果步骤8:如果预测是针对测试观察结果,则将测试标签与预测以评估模型的准确性
分类权重架构 (Classification Weighting Schema)
Mentioned above in the classification algorithm steps was selecting and utilizing a ‘Weighting Schema’. This subsection will briefly walk through what this means alongside a concrete example.
上面在分类算法步骤中提到的是选择和利用“加权模式”。 本小节将通过一个具体示例简要介绍这意味着什么。
A weighting schema dictates how much influence each of the K closest points will have towards making a final prediction. While there are many weighting schemas out there used alongside KNN, this study will focus on two:
加权方案指示K个最接近的点中的每一个对做出最终预测的影响力。 尽管与KNN一起使用了许多加权方案,但本研究将着重于以下两个方面:
Schema 1 — Traditional or ‘standard’: This voting schema assumes all points have equal influence over the final prediction. In a classification setting this means each of the K closest points will cast 1 vote Schema 2 — Custom Weighted or ‘weighted’: This voting schema assumes that the closer a point is, the more influence it has over the final prediction. In a classification setting this means the closest point will cast K votes, the 2nd closest will cast K — 1 votes, and so on and so forth until the Kth closest point casts 1 vote. Formulaically, this is shown by # of votes = K+1-position, where position is how close it is to the new point.
模式1-传统或“标准” :此投票模式假定所有点对最终预测具有同等影响。 在分类设置中,这意味着K个最接近的点均将投1票。 模式2-自定义加权或“加权” :此投票模式假定一个点越近,对最终预测的影响越大。 在分类设置中,这意味着最接近的点将投K票,第二最接近的点将投K_1票,依此类推,直到第K 个最接近的点投1票。 按照惯例,这可以通过#个投票= K + 1个位置来表示,其中位置是距新点的距离。
See below for a concrete example:
请参见下面的具体示例:
sns.set_style('darkgrid')
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.scatter(x=[1.3,1.7],y=[0.5,1],s = 50)
axes.scatter(x=[1.5,1,1.2],y=[0.8,1.2,1.5],s = 50)
axes.scatter(x=[1.5],y=[0],s = 50)
axes.annotate('1',[1.3,0.6],fontsize = 14)
axes.annotate('2',[1.5,0.85],fontsize = 14)
axes.annotate('3',[1.7,1.05],fontsize = 14)
axes.grid(lw=1)
plt.show()
The example above shows 5 total points in the training data, where every point is either class blue or class orange. The green data point is the new point which we are trying to classify based off the training data. Assuming a Euclidean distance metric and a 3-Nearest Neighbors framework, Points 1,2 and 3 are the closest points to the new observation in that order.
上面的示例显示了训练数据中的5个总点,其中每个点都是蓝色类或橙色类。 绿色数据点是我们尝试根据训练数据进行分类的新点。 假设欧几里得距离度量标准和3最近邻框架,则点1,2和3是按该顺序最接近新观测值的点。
The ‘standard’ weighting schema would vote in the following format:
“标准”加权模式将以以下格式投票:
- Point 1 would have 1 vote towards class blue 第1点将获得1级蓝色投票
- Point 2 would have 1 vote towards class orange 第二点将对橙色课有1票
- Point 3 would have 1 vote towards class orange 第3点对橙色课有1票
With two votes for class blue and one vote for class orange, the new observation would be classified as a class blue under the ‘standard’ schema.
如果蓝票为2票,橙票为1票,则新观测值将在“标准”模式下分类为蓝票。
The ‘weighted’ weighting schema would vote in the following format:
“加权”加权模式将以以下格式投票:
- Point 1 would have 3 vote towards class blue 第1点将获得3类蓝色投票
- Point 2 would have 2 vote towards class orange 第2点将获得2级橙色投票
- Point 3 would have 1 vote towards class orange 第3点对橙色课有1票
With four votes for class blue and two votes towards class orange, the new observation would be classified as a class blue under the ‘weighted’ schema.
蓝票有四票,橙票有两票,新的观察结果将在“加权”模式下被分类为蓝票。
算法 (The Algorithm)
Below is our construction of the KNN Algorithm for Classification built entirely from scratch using NumPy. The framework is fully flexible for any sized data set (any any number of classes!), and supports the following methods:
以下是我们完全使用NumPy从零开始构建的KNN分类算法的构造。 该框架对于任何大小的数据集(任何数量的类!)都是完全灵活的,并且支持以下方法:
First the model must be instantiated. Generalized syntax will be model = KNNClass(k,weighting = ‘standard’,distance_metric = ‘euclidean’). The user must pass in a positive integer for k, and may optionally pass in values for weighting (‘standard’,’weighted’) and distance_metric (‘euclidean’,’manhattan’,’chebyshev’)
首先必须实例化模型。 通用语法将为model = KNNClass(k,weighting ='standard',distance_metric ='euclidean') 。 用户必须为k传递正整数,并且可以选择传递权重(“ standard”,“ weighted”)和distance_metric(“ euclidean”,“ manhattan”,“ chebyshev”)的值
After the model has been instantiated, it must be fit with training data. Generalized syntax will be model.fit(xtrain,ytrain)
实例化模型后,必须与训练数据拟合。 通用语法将为model.fit(xtrain,ytrain)
After the model has been instantiated and fit with training data, the model can begin predicting values. Generalized syntax will be predictions = model.predict(xtest) where predictions for the test data will be returned in a NumPy array
实例化模型并使其适合训练数据后,模型即可开始预测值。 通用语法为: predictions = model.predict(xtest) ,其中测试数据的预测将在NumPy数组中返回
After the model has been instantiated and fit with training data, the score method can be used to return an accuracy score. Generalized syntax will be model.score(X,Y). The method will then make predictions for X, compare them to Y, and return the proportion of correct predictions.
实例化模型并与训练数据拟合之后,可以使用评分方法返回准确性评分。 通用语法将为model.score(X,Y) 。 然后,该方法将对X进行预测,将其与Y进行比较,然后返回正确预测的比例。
For more information on the inner workings of the custom framework below, please refer to the comments and general code within the code block.
有关下面的自定义框架的内部工作的更多信息,请参考代码块中的注释和常规代码。
#KNN CLASSIFICATION FRAMEWORK
class KNNClass:
def __init__(self,k,weighting = 'standard',distance_metric = 'euclidean'):
#WHEN INSTANTIATING A MODEL, SPECIFY A MANDATORY K - VALUE SIGNIFYING THE NUMBER OF NEAREST NEIGHBORS THE MODEL
#WILL UTILIZE TO MAKE EACH PREDICTIONS. OPTIONALLY, THE USER MAY ALSO SPECIFY A WEIGHTING SCHEMA (STANDARD/WEIGHTING)
#AND DISTANCE_METRIC (EUCLIDEAN/MANHATTAN/CHEBYSHEV)
self.k = k
self.weighting = weighting
self.distance_metric = distance_metric
def fit(self,Xtrain,Ytrain):
#BEFORE MAKING PREDICTIONS, THE MODEL MUST BE FIT WITH THE TRAINING DATA. THIS MEANS THE FEATURE DATA AS XTRAIN AND
#LABELS AS YTRAIN.
self.xtrainmatrix = np.matrix(Xtrain)
self.ytrainmatrix = np.matrix(Ytrain)
if self.xtrainmatrix.shape[0] == 1:
self.xtrainmatrix = self.xtrainmatrix.transpose()
if self.ytrainmatrix.shape[0] == 1:
self.ytrainmatrix = self.ytrainmatrix.transpose()
#IN ADDITION TO STORING THE TRAINING DATA, THE FIT METHOD WILL ALSO STORE A LIST OF THE LABELS AND THE TOTAL NUMBER OF
#FEATURES
self.labellist = list(set(np.array(self.ytrainmatrix).squeeze()))
self.numfeatures = self.xtrainmatrix.shape[1]
def predict(self,Xtest):
#AFTER THE MODEL HAS BEEN INSTANTIATED AND FIT WITH TRAINING DATA, THE PREDICT METHOD CAN BE USED TO RETURN A NUMPY
#ARRAY OF PREDICTED VALUES. TO PROPERLY EXECUTE THIS METHOD, PASS IN THE TEST FEATURE DATA (UNLABELED) FOR XTEST.
self.xtestmatrix = np.matrix(Xtest)
if self.xtestmatrix.shape[0] == 1:
self.xtestmatrix = self.xtestmatrix.transpose()
#BEGIN COMPARING TEST OBS TO TRAINING OBS
preds = []
for testobs in self.xtestmatrix:
distance = []
for trainobs in self.xtrainmatrix:
#CALCULATE DISTANCE BETWEEN TEST OBS AND TRAINING OBS
#DISTANCE_METRIC = 'EUCLIDEAN' CALCULATES EUCLIDEAN DISTANCE: SQRT(SUM((X-Y)**2))
#DISTANCE_METRIC = 'MANHATTAN' CALCULATES MANHATTAN DISTANCE: SUM(ABS(X-Y))
#DISTANCE_METRIC = 'CHEBYSHEV' CALCULATES CHEBYSHEV DISTANCE: MAX(ABS(X-Y))
if self.distance_metric == 'euclidean':
sums = 0
for num in range(0,self.numfeatures):
sums = sums + (testobs[0,num]-trainobs[0,num])**2
distance.append(np.sqrt(sums))
elif self.distance_metric == 'manhattan':
sums = 0
for num in range(0,self.numfeatures):
sums = sums + abs(testobs[0,num]-trainobs[0,num])
distance.append(sums)
elif self.distance_metric == 'chebyshev':
sums = []
for num in range(0,self.numfeatures):
sums.append(abs(testobs[0,num]-trainobs[0,num]))
distance.append(max(sums))
#CREATE A MATRIX WITH YTRAIN AND DISTANCE COLUMN, REPRESENTING THE DISTANCE BETWEEN TEST/TRAINING OBS AND THE TRAINING OUTPUT
#SORT MATRIX BY DISTANCE
distancecol = np.matrix(distance).transpose()
ytrainmatrix2 = np.hstack((self.ytrainmatrix,distancecol))
ytrainmatrix2 = np.matrix(sorted(np.array(ytrainmatrix2),key=lambda x:x[1]))
#CREATE VOTING FRAMEWORK, TRACK NUMBER OF VOTES PER LABEL ACROSS N NEAREST TRAINING POINTS FOR TEST OBS
#IF WEIGHTING = 'STANDARD', A STANDARD VOTING SCHEMA IS FOLLOWED WHERE EACH OF THE K NEAREST POINTS ARE GIVEN THE SAME WEIGHT
#IF WEIGHTING = 'WEIGHTED', A WEIGHTED VOTING SCHEMA IS FOLLOWED WHERE CLOSER POINTS WILL HAVE A HEAVIER WEIGHT BY MEANS OF INCREASED NUMBER OF VOTES
if self.weighting == 'standard':
labelcount = []
for labels in self.labellist:
count = 0
for num in range(0,self.k):
if ytrainmatrix2[num,0] == labels:
count = count + 1
labelcount.append(count)
elif self.weighting == 'weighted':
labelcount = []
for labels in self.labellist:
count = 0
for num in range(0,self.k):
if ytrainmatrix2[num,0] == labels:
count = count + self.k - num
labelcount.append(count)
#CREATE AN ADDED VOTING LAYER FOR SCENARIOS IN WHICH THERE IS A TIE FOR MOST VOTES
predlabel = []
for num in range(0,len(labelcount)):
if labelcount[num] == max(labelcount):
predlabel.append(self.labellist[num])
if len(predlabel) == 1:
preds.append(predlabel[0])
else:
preds.append(predlabel[np.random.randint(0,len(predlabel))])
#RETURN A NUMPY ARRAY OF PREDICTED VALUES
return np.array(preds)
def score(self,X,Y):
#AFTER THE MODEL HAS BEEN INSTANTIATED AND FIT, THE SCORE METHOD MAY BY USED TO RETURN AN ACCURACY SCORE FOR THE MODEL
#GIVEN TEST FEATURES AND TEST LABELS.
#THIS METHOD WORKS BY PASSING IN THE TEST FEATURES (X), UPON WHICH THE PREDICT
#METHOD ABOVE WILL CALLED TO RETURN A NUMPY ARRAY OF PREDICTED VALUES.
preds = self.predict(X)
Y = np.array(Y)
totalpreds = len(preds)
#THESE VALUES WILL BE COMPARED TO THE TEST LABELS(Y) AND AN ACCURACY SCORE WILL BE
#RETURNED AS A VALUE BETWEEN 0-1 COMPUTED BY (CORRECTED PREDS)/(TOTAL PREDS).
comparelist = []
count = 0
for num in range(0,totalpreds):
if preds[num] == Y[num]:
comparelist.append(1)
else:
comparelist.append(0)
return sum(comparelist)/len(comparelist)
KNN算法概述-回归 (KNN Algorithm Overview — Regression)
Regression is a supervised machine learning task which predicts continuous values for data.
回归是有监督的机器学习任务,可以预测数据的连续值。
Suppose a dataset is split into training data, data which allows the model to learn, and test data, data that allows us to validate model performance. The KNN Algorithm supports regression tasks through the following general algorithm structure:
假设数据集被分为训练数据(允许模型学习的数据)和测试数据(允许我们验证模型性能的数据)。 KNN算法通过以下常规算法结构支持回归任务:
Step 1: Select a K Value, Distance Metric, and Weighting Schema for the modelStep 2: Train the model with training dataStep 3: Compute distance using the selected distance metric between ith test/new observation and every training observation, while storing distances and training labels in a data structureStep 4: Sort the data structure in ascending order by distanceStep 5: Use the K closet points and the selected weighting schema to predict the value of the test/new observation
步骤1:为模型选择K值,距离度量和权重方案步骤2:使用训练数据训练模型步骤3:使用第i 个测试/新观测值与每个训练观测值之间的所选距离度量来计算距离,同时存储数据结构中的距离和训练标签步骤4:按距离升序对数据结构进行排序步骤5:使用K个壁橱点和选定的加权方案来预测测试/新观测值
Step 6: Repeat Steps 3–5 until a prediction has been made for every test observationStep 7: Return predictionsStep 8: If predictions are for test observations, then compare test labels to predictions in order to assess model accuracy
步骤6:重复步骤3-5,直到对每个测试观察都做出预测为止步骤7:返回预测步骤8:如果预测是针对测试观察,则将测试标签与预测进行比较,以评估模型的准确性
回归加权架构 (Regression Weighting Schema)
Similar to the classification example, a weighting schema can also be used in KNN Regression to dictate how much influence each of the K closest points will have towards making a final prediction. While there are many weighting schemas out there used alongside KNN, this study will focus on two:
与分类示例类似,加权模式也可以用于KNN回归中,以指示K个最接近的点中的每一个对做出最终预测的影响力。 尽管与KNN一起使用了许多加权方案,但本研究将着重于以下两个方面:
Schema 1 — Traditional or ‘standard’: This voting schema assumes all points have equal influence over the final prediction. In a regression setting this means each of the K closest points will be weighted equally or, in other words, the predicted value is the average of the K closest points. Schema 2 — Custom Weighted or ‘weighted’: This voting schema assumes that the closer a point is, the more influence it has over the final prediction. In a regression setting this means that each of the K closest points will be assigned a weight:
模式1-传统或“标准” :此投票模式假定所有点对最终预测具有同等影响。 在回归设置中,这意味着将对K个最接近点中的每一个均加权,换句话说,预测值是K个最接近点的平均值。 模式2-自定义加权或“加权” :此投票模式假定一点越近,对最终预测的影响越大。 在回归设置中,这意味着将为K个最接近的点中的每一个分配权重:
where position indicates how close it is to the new point. The final prediction is then the weighted average of the K closest points.
位置表示距新点的距离。 然后,最终预测是K个最接近点的加权平均值。
See below for a concrete example:
请参见下面的具体示例:
sns.set_style('darkgrid')
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.scatter(x=[1.3,1.7],y=[0.5,1],s = 50,color = 'red')
axes.scatter(x=[1.5,1,1.2],y=[0.8,1.2,1.5],s = 50,color = 'red')
axes.scatter(x=[1.5],y=[0],s = 50)
axes.annotate('1',[1.3,0.6],fontsize = 14)
axes.annotate('2',[1.5,0.85],fontsize = 14)
axes.annotate('3',[1.7,1.05],fontsize = 14)
axes.grid(lw=1)
plt.show()
The example above shows 5 total points in the training data, marked by the color red. The blue data point is the new point which we are trying to predict based off the training data. Assuming a Euclidean distance metric and a 3-Nearest Neighbors framework, Points 1, 2 and 3 are the closest points to the new observation in that order. Suppose Points 1, 2, and 3 have values 50, 55, and 60 units respectively.
上面的示例显示了训练数据中的5个总点,并用红色标记。 蓝色数据点是我们尝试根据训练数据进行预测的新点。 假设欧几里得距离度量标准和3最近邻框架,则点1,点2和点3是按该顺序最接近新观测值的点。 假设点1、2和3的值分别为50、55和60个单位。
The ‘standard’ weighting schema would calculate in the following format:
“标准”加权方案将以以下格式计算:
The ‘weighted’ weighting schema calculate in the following format:
“加权”加权模式以以下格式计算:
算法 (The Algorithm)
Below is our construction of the KNN Algorithm for Regression built entirely from scratch using NumPy. Similar to the classification framework, the regression framework is fully flexible for any sized data set, and supports the following methods:
以下是我们使用NumPy完全从零开始构建的KNN回归算法。 与分类框架类似,回归框架对于任何大小的数据集都是完全灵活的,并且支持以下方法:
First the model must be instantiated. Generalized syntax will be model = KNNReg(k,weighting = ‘standard’,distance_metric = ‘euclidean’). The user must pass in a positive integer for k, and may optionally pass in values for weighting (‘standard’,’weighted’) and distance_metric (‘euclidean’,’manhattan’,’chebyshev’)
首先必须实例化模型。 通用语法为model = KNNReg(k,weighting ='standard',distance_metric ='euclidean') 。 用户必须为k传递正整数,并且可以选择传递权重(“ standard”,“ weighted”)和distance_metric(“ euclidean”,“ manhattan”,“ chebyshev”)的值
After the model has been instantiated, it must be fit with training data. Generalized syntax will be model.fit(xtrain,ytrain)
实例化模型后,必须与训练数据拟合。 通用语法将为model.fit(xtrain,ytrain)
After the model has been instantiated and fit with training data, the model can begin predicting values. Generalized syntax will be predictions = model.predict(xtest) where predictions for the test data will be returned in a NumPy array
实例化模型并使其适合训练数据后,模型即可开始预测值。 通用语法为: predictions = model.predict(xtest) ,其中测试数据的预测将在NumPy数组中返回
After the model has been instantiated and fit with training data, the score method can be used to return an accuracy score. Generalized syntax will be model.score(X,Y,error_metric = ‘rmse’). The method will then make predictions for X, compare them to Y, and return an error score according to the optional error_metric (‘rmse’,’mse’,’mae’) pass in. The error_metric values indicate scoring based off Root Mean Square Error, Mean Square Error, and Mean Absolute Error.
实例化模型并与训练数据拟合之后,可以使用评分方法返回准确性评分。 通用语法将为model.score(X,Y,error_metric ='rmse') 。 然后,该方法将对X进行预测,将其与Y进行比较,并根据传入的可选error_metric('rmse','mse','mae')返回错误分数。error_metric值指示基于均方根的评分误差,均方误差和平均绝对误差。
For more information on the inner workings of the custom framework below, please refer to the comments and general code within the code block.
有关下面的自定义框架的内部工作的更多信息,请参考代码块中的注释和常规代码。
#KNN REGRESSION FRAMEWORK
class KNNReg:
def __init__(self,k,weighting = 'standard',distance_metric = 'euclidean'):
#WHEN INSTANTIATING A MODEL, SPECIFY A MANDATORY K - VALUE SIGNIFYING THE NUMBER OF NEAREST NEIGHBORS THE MODEL
#WILL UTILIZE TO MAKE EACH PREDICTIONS. OPTIONALLY, THE USER MAY ALSO SPECIFY A WEIGHTING SCHEMA (STANDARD/WEIGHTING)
#AND DISTANCE_METRIC (EUCLIDEAN/MANHATTAN/CHEBYSHEV)
self.k = k
self.weighting = weighting
self.distance_metric = distance_metric
def fit(self,Xtrain,Ytrain):
#BEFORE MAKING PREDICTIONS, THE MODEL MUST BE FIT WITH THE TRAINING DATA. THIS MEANS THE FEATURE DATA AS XTRAIN AND
#LABELS AS YTRAIN.
self.xtrainmatrix = np.matrix(Xtrain)
self.ytrainmatrix = np.matrix(Ytrain)
if self.xtrainmatrix.shape[0] == 1:
self.xtrainmatrix = self.xtrainmatrix.transpose()
if self.ytrainmatrix.shape[0] == 1:
self.ytrainmatrix = self.ytrainmatrix.transpose()
#IN ADDITION TO STORING THE TRAINING DATA, THE FIT METHOD WILL ALSO STORE THE TOTAL NUMBER OF FEATURES
self.numfeatures = self.xtrainmatrix.shape[1]
def predict(self,Xtest):
#AFTER THE MODEL HAS BEEN INSTANTIATED AND FIT WITH TRAINING DATA, THE PREDICT METHOD CAN BE USED TO RETURN A NUMPY
#ARRAY OF PREDICTED VALUES. TO PROPERLY EXECUTE THIS METHOD, PASS IN THE TEST FEATURE DATA (UNLABELED) FOR XTEST.
self.xtestmatrix = np.matrix(Xtest)
if self.xtestmatrix.shape[0] == 1:
self.xtestmatrix = self.xtestmatrix.transpose()
#BEGIN COMPARING TEST OBS TO TRAINING OBS
preds = []
for testobs in self.xtestmatrix:
distance = []
for trainobs in self.xtrainmatrix:
#CALCULATE DISTANCE BETWEEN TEST OBS AND TRAINING OBS
#DISTANCE_METRIC = 'EUCLIDEAN' CALCULATES EUCLIDEAN DISTANCE: SQRT(SUM((X-Y)**2))
#DISTANCE_METRIC = 'MANHATTAN' CALCULATES MANHATTAN DISTANCE: SUM(ABS(X-Y))
#DISTANCE_METRIC = 'CHEBYSHEV' CALCULATES CHEBYSHEV DISTANCE: MAX(ABS(X-Y))
if self.distance_metric == 'euclidean':
sums = 0
for num in range(0,self.numfeatures):
sums = sums + (testobs[0,num]-trainobs[0,num])**2
distance.append(np.sqrt(sums))
elif self.distance_metric == 'manhattan':
sums = 0
for num in range(0,self.numfeatures):
sums = sums + abs(testobs[0,num]-trainobs[0,num])
distance.append(sums)
elif self.distance_metric == 'chebyshev':
sums = []
for num in range(0,self.numfeatures):
sums.append(abs(testobs[0,num]-trainobs[0,num]))
distance.append(max(sums))
#CREATE A MATRIX WITH YTRAIN AND DISTANCE COLUMN, REPRESENTING THE DISTANCE BETWEEN TEST/TRAINING OBS AND THE TRAINING OUTPUT
#SORT MATRIX BY DISTANCE
distancecol = np.matrix(distance).transpose()
ytrainmatrix2 = np.hstack((self.ytrainmatrix,distancecol))
ytrainmatrix2 = np.matrix(sorted(np.array(ytrainmatrix2),key=lambda x:x[1]))
#CREATE VOTING SCHEMA AMONG N CLOSEST POINTS.
#IF INDEX = 0 THEN PREDICTED VALUE WILL BE THE AVERAGE OF THE N CLOSEST POINTS
#IF INDEX = 1 THEN PREDICTED VALUE WILL BE A WEIGHTED AVERAGE OF N CLOSEST POINTS, WHERE CLOSER POINTS ARE WEIGHTED MORE STRONGLY TOWARDS THE PREDICTED VALUE
if self.weighting == 'standard':
sums = 0
for num in range(0,self.k):
sums = sums + ytrainmatrix2[num,0]
preds.append(sums/self.k)
elif self.weighting == 'weighted':
sums = 0
arr = np.arange(1,self.k + 1)
arrsum = np.sum(arr)
for num in range(0,self.k):
sums = sums + ytrainmatrix2[num,0]*((self.k - num)/arrsum)
preds.append(sums)
#RETURN A NUMPY ARRAY OF PREDICTED VALUES
return np.array(preds)
def score(self,X,Y,error_metric = 'rmse'):
#AFTER THE MODEL HAS BEEN INSTANTIATED AND FIT, THE SCORE METHOD MAY BY USED TO RETURN AN ACCURACY SCORE FOR THE MODEL
#GIVEN TEST FEATURES AND TEST LABELS.
#THIS METHOD WORKS BY PASSING IN THE TEST FEATURES (X), UPON WHICH THE PREDICT
#METHOD ABOVE WILL CALLED TO RETURN A NUMPY ARRAY OF PREDICTED VALUES.
preds = self.predict(X)
Y = np.array(Y)
totalpreds = len(preds)
#THESE VALUES WILL BE COMPARED TO THE TEST LABELS(Y) AND AN ACCURACY SCORE WILL BE
#RETURNED. THIS VALUE WILL FOLLOW EITHER A ROOT MEAN SQUARE ERROR(RMSE), MEAN SQUARE ERROR(MSE), OR
#MEAN ABSOLUTE ERROR(MAE) METHODOLOGY DEPENDING ON THE VALUE PASSED IN FOR ERROR_METRIC.
#RMSE = SQRT(SUM((PREDICTED - ACTUAL)**2)/(TOTAL NUMBER OF PREDICTIONS))
#MSE = SUM((PREDICTED - ACTUAL)**2)/(TOTAL NUMBER OF PREDICTIONS).
#MAE = SUM(ABS(PREDICTED-ACTUAL))/(TOTAL NUMBER OF PREDICTIONS)
if error_metric == 'rmse':
error = 0
for num in range(0,totalpreds):
error = error + (preds[num] - Y[num])**2
return np.sqrt(error/totalpreds)
elif error_metric == 'mse':
error = 0
for num in range(0,totalpreds):
error = error + (preds[num] - Y[num])**2
return error/totalpreds
elif error_metric == 'mae':
error = 0
for num in range(0,totalpreds):
error = error + abs(preds[num] - Y[num])
return error/totalpreds
流行的模型验证技术 (Popular Model Validation Techniques)
This section will explore 2 popular model validation techniques: Traditional Train/Test Split, and Repeated K-Fold Cross Validation.
本节将探讨2种流行的模型验证技术: 传统训练/测试拆分和重复K折交叉验证 。
传统火车/测试分裂 (Traditional Train/Test Split)
The model validation process essentially evaluates the model’s ability to accurately predict new data. This is generally done by first splitting labeled data into a training set and testing set. There is no accepted rule of thumb for what ratio of the data to include in training vs test sets. Some recommendations say 70/30, some say 80/20, but in reality the user can fully customize their ratio. The model is then intuitively trained on the training set before being tested on the test set. Testing in this circumstance refers to the model predicting values for the test set, comparing them to the real values for the test set, and using an error metric to evaluate the model’s predictions.
模型验证过程实质上是评估模型准确预测新数据的能力。 通常,这是通过首先将标记的数据分为训练集和测试集来完成的。 没有什么经验法则可以将训练与测试集中的数据比例包括在内。 有些建议说70/30,有些建议说80/20,但实际上,用户可以完全自定义其比率。 然后在对测试集进行测试之前,先对模型进行直观的训练。 在这种情况下,测试是指模型对测试集的预测值,将它们与测试集的实际值进行比较,并使用误差度量来评估模型的预测。
Traditional Train/Test Split follows this exact methodology through for one iteration. Below is our construction of the Traditional Train/Test Split built entirely from scratch using primarily NumPy (there is some pandas for data export purposes as needed). The function is flexible for any sized data set and generally the syntax will be as follows:
传统的训练/测试拆分遵循此精确方法进行一次迭代。 下面是我们主要使用NumPy从头开始完全构建的传统火车/测试拆分的结构(根据需要,有一些用于数据导出的熊猫)。 该函数对于任何大小的数据集都是灵活的,并且语法通常如下:
xtrain, xtest, ytrain, ytest = traintestsplit(X, Y, test_size)
xtrain,xtest,ytrain,ytest = traintestsplit(X,Y,test_size)
The user will pass in all their feature data (X), all their labels (Y), and a test_size decimal/fraction between 0 and 1. This value will dictate the size of the test set. For example, a test_size = 0.3 indicates a 70/30 split (70% of the data is training data, while 30% of the data is test data). The function will then return four data structures:
用户将输入所有特征数据(X),所有标签(Y)以及介于0和1之间的test_size十进制/分数。该值将决定测试集的大小。 例如,test_size = 0.3表示70/30分割(70%的数据是训练数据,而30%的数据是测试数据)。 然后,该函数将返回四个数据结构:
xtrain, feature data for model training, contains 1 — test_size proportion of the data
xtrain (用于模型训练的特征数据)包含数据的1 — test_size比例
xtest, feature data for model testing, contains test_size proportion of the data
为模型试验XTEST,特征数据,包含了数据的test_size比例
ytrain, labels for xtrain
ytrain , xtrain的标签
ytest, labels for xtest
ytest , xtest的标签
These structures can be passed directly into the KNN Frameworks mentioned above. For more information on the inner workings of the custom function below, please refer to the comments and general code within the code block.
这些结构可以直接传递到上述KNN框架中。 有关以下自定义函数内部工作原理的更多信息,请参考代码块中的注释和常规代码。
#TRAIN TEST SPLIT FUNCTION
def traintestsplit(X,Y,test_size):
"""Pass into features (X), labels (Y), and test_size (decimal between 0 and 1). The function will return 4 objects: xtrain,xtest,ytrain,ytest. xtest (features) and ytest (labels) should be used for model testing purposes and contains test_size proportion of the total amount of data passed in. xtrain and ytrain contain the remaining data and are used to initially train the model"""
#CONVERT TO X AND Y TO MATRICES
Xnew = np.matrix(X)
Ynew = np.matrix(Y)
if Xnew.shape[0] == 1:
Xnew = Xnew.transpose()
if Ynew.shape[0] == 1:
Ynew = Ynew.transpose()
#IDENTIFY UNIQUE INDEXES FOR TRAINING DATA AND TEST DATA, AS WELL AS THE SIZE OF THE TEST SIZE
idlist = [] #TEST SET INDICES
idlist2 = [] #TRAINING SET INDICES
#IDENTIFY TEST INDICES THROUGH THE FOLLOWING PROCESS:
#STEP 1: CALCULATE SIZE OF TEST SET
testsize = np.round(test_size*len(X))
#STEP 2: CREATE A LIST OF SIZE TESTSIZE CONTAINING UNIQUE INTEGERS. INTEGERS CANNOT BE LESS THAN 0 OR GREATER THAN lEN(X)
for num in range(0,int(testsize)):
#STEP 2A: FOR THE FIRST ITERATION, APPEND A RANDOM INTEGER BETWEEN 0 AND LEN(X) TO IDLIST
if num == 0:
idlist.append(np.random.randint(0,len(X)))
else:
#STEP 2B: FOR ALL OTHER ITERATIONS, GENERATE A RANDOM INTEGER BETWEEN 0 AND LEN(X). THEN COMPARE IT TO ALL VALUES
#CURRENTLY IN IDLIST. IF IT IS EQUAL TO ANY OF THE VALUES THEN DO NOTHING. IF IT IS NOT EQUAL TO ALL OF THE VALUES
#THEN APPEND IT TO THE LIST. THIS PROCESS WILL REPEAT UNTIL A VALUE HAS BEEN APPENDED, MEANING A UNIQUE VALUE WAS
#GENERATED AND ADDED TO THE LIST.
count = 0
while count == 0:
newloc = np.random.randint(0,len(X))
for locations in idlist:
if newloc == locations:
count = 1
break
if count == 1:
count = 0
else:
count = 1
idlist.append(newloc)
#IDENTIFY TRAINING INDICES THROUGH THE FOLLOWING PROCESS:
#STEP 1: LOOP THROUGH ALL NUMBERS BETWEEN 0 AND LEN(X).
for num in range(0,len(X)):
count = 0
#STEP 2: COMPARE NUMBER TO INDEXES HELD IN IDLIST. IF THE NUMBER DOES NOT EXIST IN IDLIST, THEN APPEND TO IDLIST 2
#MEANING IT HAS BEEN ADDED AS A TRAINING INDEX
for locations in idlist:
if num == locations:
count = 1
break
if count == 0:
idlist2.append(num)
#SORT INDEXES, IDLIST2 IS TRAINING DATA INDEXES WHILE IDLIST IS TESTSET INDEXES
idlist.sort()
idlist2.sort()
#STORE TRAINING AND TEST DATA IN LISTS BASED OFF OF IDENTIFIED INDEXES
xtrain = []
xtest = []
ytrain = []
ytest = []
for items in idlist2:
xtrain.append(np.array(Xnew[items])[0])
ytrain.append(Ynew[items,0])
for items in idlist:
xtest.append(np.array(Xnew[items])[0])
ytest.append(Ynew[items,0])
#RECONVERT TRAINING AND TEST DATA BACK INTO THE FORMAT/DATA TYPE THAT THE DATA WAS ORIGINALLY PASSED IN AS
#ONLY EXCEPTION IS NUMPY NDARRAYS WILL BE RETURNED AS NUMPY MATRIX AND 1 X N DIM MATRICES WILL BE RETURNED AS N X 1 DIM
#CONVERSTION FOR PANDAS DF - FEATURES
if type(X) == pd.core.frame.DataFrame:
xtrain = pd.DataFrame(xtrain)
xtest = pd.DataFrame(xtest)
xtrain.index = idlist2
xtest.index = idlist
#CONVERSTION FOR PANDAS SERIES - FEATURES
if type(X) == pd.core.series.Series:
xtrain = pd.Series(xtrain)
xtest = pd.Series(xtest)
xtrain.index = idlist2
xtest.index = idlist
#CONVERSTION FOR NUMPY MATRIX/ARRAY - FEATURES
if type(X) == np.matrix or type(X) == np.ndarray:
xtrain = np.matrix(xtrain)
xtest = np.matrix(xtest)
if xtrain.shape[0] == 1:
xtrain = xtrain.transpose()
if xtest.shape[0] == 1:
xtest = xtest.transpose()
#CONVERSTION FOR PANDAS DF - LABELS
if type(Y) == pd.core.frame.DataFrame:
ytrain = pd.DataFrame(ytrain)
ytest = pd.DataFrame(ytest)
ytrain.index = idlist2
ytest.index = idlist
#CONVERSTION FOR PANDAS SERIES - LABELS
if type(Y) == pd.core.series.Series:
ytrain = pd.Series(ytrain)
ytest = pd.Series(ytest)
ytrain.index = idlist2
ytest.index = idlist
#CONVERSTION FOR NUMPY MATRIX/ARRAY - LABELS
if type(Y) == np.matrix or type(Y) == np.ndarray:
ytrain = np.matrix(ytrain)
ytest = np.matrix(ytest)
if ytrain.shape[0] == 1:
ytrain = ytrain.transpose()
if ytest.shape[0] == 1:
ytest = ytest.transpose()
#RETURN TRAINING AND TEST DATA SPLITS
return xtrain,xtest,ytrain,ytest
重复K折交叉验证 (Repeated K-Fold Cross Validation)
Repeated K-Fold Cross Validation addresses three significant problems that can occur within a Traditional Train/Test Split methodology. The first problem is sampling bias, as there is no guarantee that the user made a good, comprehensive split of the data. The second problem is that the methodology does not use all the data for training and testing so the model’s predictive capability, can often times still not be appropriately generalized. Lastly, in scenarios where the initial data set is small, Repeated K-Fold Cross Validation is especially useful to allow for maximum usage of the data.
重复的K折交叉验证解决了传统训练/测试拆分方法中可能发生的三个重要问题。 第一个问题是采样偏差,因为不能保证用户对数据进行了良好,全面的分割。 第二个问题是该方法未将所有数据用于训练和测试,因此该模型的预测能力常常仍无法适当地推广。 最后,在初始数据集较小的情况下,重复K折交叉验证对于最大程度地利用数据特别有用。
Repeated K-Fold Cross Validation attempts to combat the issues mentioned above through the following generalized steps:
重复的K折交叉验证尝试通过以下通用步骤来解决上述问题:
Step 1: User select number of repeats (number of times they want K-Fold Cross Validation to occur) and number of folds (number of subsets the data is split into). Suppose number of repeats = R and number of folds = K where R and K are positive integers. Step 2: The data set is split into K folds Step 3: The ith fold is used as the test set (xtest, ytest) while the remaining folds are used as the training set (xtrain, ytrain). Step 4: Train the model on the training data, while using the test data to evaluate the iterations prediction accuracy. Store the prediction accuracy value. Step 5: Repeat steps 3–4 until each fold has been used as the test set (K — 1 repeats). This helps to reduce sampling bias (by ensuring model validation is not subjected to just one train/test split) while also ensuring all the data is used for training and testing at some point in the process. Step 6: Repeat steps 2–5 R — 1 times. This further helps to minimize sampling bias by splitting the data into K different folds for each iteration. Step 7: Take the average of all R*K prediction accuracy values to get a more generalized score for model performance and accuracy
步骤1 :用户选择重复次数(他们希望进行K折叠交叉验证的次数)和折叠次数(数据被拆分为子集的数量)。 假设重复数= R,折叠数= K,其中R和K为正整数。 步骤2 :将数据集拆分为K个折叠。 步骤3 :将第i 个折叠用作测试集(xtest,ytest),而将其余折叠用作训练集(xtrain,ytrain)。 步骤4 :在训练数据上训练模型,同时使用测试数据评估迭代的预测准确性。 存储预测精度值。 步骤5 :重复步骤3-4,直到将每一折用作测试集为止(重复K-1)。 这有助于减少采样偏差(通过确保模型验证不仅仅受到一次训练/测试分裂的影响),同时还可以确保在过程中的某个时刻将所有数据用于训练和测试。 步骤6 :重复步骤2–5 R-1次。 通过为每次迭代将数据分成K个不同的折叠,这进一步有助于最小化采样偏差。 步骤7 :取所有R * K预测精度值的平均值,以获得模型性能和精度的更广义分数
Below is our construction of the Repeated K-Fold Cross Validation Framework built entirely from scratch using NumPy. The framework is flexible for any sized data set and supports the following methods:
以下是我们使用NumPy完全从头构建的重复K折交叉验证框架的构造。 该框架对于任何大小的数据集都是灵活的,并支持以下方法:
- First, the framework must be instantiated. The general syntax will be rkf = Repkfoldcv(n_repeats, n_splits, shuffle = False). n_repeats and n_splits refers to the number of repeats (1 = standard K-Fold Cross Validation) and number of folds respectively while shuffle (True/False) optionally specifies whether the user needs the data shuffled. 首先,必须实例化框架。 通用语法为rkf = Repkfoldcv(n_repeats,n_splits,shuffle = False)。 n_repeats和n_splits分别是指重复次数(1 =标准K折交叉验证)和折叠次数,而改组(True / False)则可选地指定用户是否需要改组数据。
- After the framework has been instantiated, the user simply passes in the feature data (X) through the split method. General syntax will be rfk.split(X), which will return a nested list of n_repeats*n_splits lists each of which contains a set of train and test indices. 实例化框架后,用户只需通过split方法传入特征数据(X)。 通用语法为rfk.split(X),它将返回n_repeats * n_splits嵌套列表,每个列表都包含一组训练和测试索引。
This framework can be integrated directly with the KNN frameworks detailed above. A brief Python example of the syntax required to utilize this framework with the KNN frameworks is as follows:
该框架可以直接与上述KNN框架集成。 使用此框架和KNN框架所需的语法的简短Python示例如下:
scores = []
model = KNNClass(k = 5, weighting = 'standard', distance_metric = 'euclidean')
rfk = Repkfoldcv(n_repeats = 5, n_splits = 5, shuffle = True)
for train,test in rfk.split(X):
xtrain,xtest,ytrain,ytest = X[train], X[test], Y[train], Y[test] #if X or Y is a pandas structure, use X.iloc[train/test] or Y.iloc[train/test]
model.fit(xtrain,ytrain)
scores.append(model.score(xtest,ytest))
print(mean(scores))
For more information on the inner workings of the custom function below, please refer to the comments and general code within the code block.
有关以下自定义函数内部工作原理的更多信息,请参考代码块中的注释和常规代码。
#REPEATED K FOLD CV FRAMEWORK
class Repkfoldcv:
def __init__(self,n_repeats,n_splits,shuffle = False):
#WHEN INSTANTIATING THE OBJECT, SPECIFY N_REPEATS (1 SIGNIFIES STANDARD K FOLD CV), N_SPLITS (SPECIFIES THE NUMBER
#FOLDS THE DATA WILL BE BROKEN INTO) PER ITERATION, AND OPTIONALLY WHETHER THE USER WANTS THE DATA SHUFFLED
self.n_repeats = n_repeats
self.n_splits = n_splits
self.shuffle = shuffle
def split(self,X):
#CREATE LIST TO STORE INDEXES FOR HOLD OUT FOLDS AND NOT HOLD OUT FOLDS
indexes = []
#IDENTIFY BASELINE FOLD SIZE
foldsize = math.floor(len(X)/self.n_splits)
#OUTER LOOP FOR NUMBER OF REPEATS
for repeats in range(0,self.n_repeats):
#CREATE LIST OF SIZE LEN(X) CONTAINING NATURAL NUMBERS BETWEEN 1-N_SPLITS. EACH NATURAL NUMBER WILL OCCUR AT MINIMUM
#FOLDSIZE TIMES IN THE LIST
assignfold = []
for num in range(0,self.n_splits):
for num1 in range(0,foldsize):
assignfold.append(num+1)
while len(assignfold) < len(X):
assignfold.append(np.random.randint(1,self.n_splits+1))
#SHUFFLE LIST IF USER DECLARES SHUFFLE = TRUE
if self.shuffle == True:
np.random.shuffle(assignfold)
#ITERATE THROUGH THE LIST N_SPLITS TIMES. FOR EACH ITERATION, APPEND TO TWO LISTS WHERE LIST 1 STORES THE INDEX POSITION
#WHENEVER THE NATURAL NUMBER IS EQUAL TO THE NUM AND LIST 2 STORES WHENEVER THE NATURAL NUMBER IS NOT EQUAL TO THE NUM.
#THIS CREATES N_SPLITS SETS OF A HOLD OUT (TEST) FOLD AND NOT HOLD OUT (TRAIN) FOLD.
for num in range(0,self.n_splits):
infold = []
notinfold = []
count = 0
for assignments in assignfold:
if assignments == num + 1:
infold.append(count)
else:
notinfold.append(count)
count = count + 1
#APPEND TRAIN FOLD AND TEST FOLD INDEXES TO THE INDEXES LIST
indexes.append([notinfold,infold])
#RETURN INDEXES AS A NESTED LIST
return indexes
案例研究—为鸢尾花数据集找到最佳KNN模型 (Case Study — Finding an Optimal KNN Model for the Iris Flowers Data Set)
This section will conduct an in-depth case study utilizing the methodologies and custom frameworks mentioned above with the Iris Flowers Dataset. Specifically, this case study will include some initial data exploration, feature engineering/scaling, the KNN Classification framework, Repeated K-Fold Cross Validation, and a custom grid search framework. The goal of the study is to find the best mix of features, a K value, a distance metric, and a weighting schema which when used in combination maximizes prediction accuracy, creating an optimal KNN model.
本节将利用Iris Flowers数据集利用上述方法和自定义框架进行深入的案例研究。 具体来说,本案例研究将包括一些初始数据探索,功能工程/缩放,KNN分类框架,重复K折交叉验证和自定义网格搜索框架。 该研究的目的是找到特征,K值,距离度量和权重方案的最佳组合,这些特征组合使用时可最大程度地提高预测准确性,从而创建最佳的KNN模型。
This case study will use NumPy, Pandas, Matplotlib, and Seaborn.
本案例研究将使用NumPy,Pandas,Matplotlib和Seaborn。
鸢尾花数据集 (Iris Flowers Data Set)
The Iris Flowers Data Set contains 150 rows and 5 columns. Each row contains data pertaining to a single flower. 4 of the columns include the following numerical feature data: sepal_length_cm, sepal_width_cm, petal_length_cm, and petal_width_cm. The final column contains the target/label: setosa/versicolor/virginica. During model validation and general model prediction, this is the value we will be attempting to classify.
鸢尾花数据集包含150行和5列。 每行包含与一朵花有关的数据。 列中的4个包含以下数字特征数据:sepal_length_cm,sepal_width_cm,petal_length_cm和petal_width_cm。 最后一列包含目标/标签 :setosa / versicolor / virginica。 在模型验证和一般模型预测期间,这是我们将尝试分类的值。
#LOAD IN IRIS DATASET FOR KNN CLASSIFICATION
data = load_iris()
flowers = []
for items in data['target']:
if items == 0:
flowers.append('setosa')
elif items == 1:
flowers.append('versicolor')
else:
flowers.append('virginica')
iris = pd.DataFrame(data['data'])
iris.columns = ['sepal_length_cm','sepal_width_cm','petal_length_cm','petal_width_cm']
iris['target'] = flowers
#FIRST 5 ROWS OF THE DATASET
iris.head()
数据探索 (Data Exploration)
Before delving into any model development, we will first undergo some initial data exploration and visualization to get a better understanding of how the data looks.
在深入研究任何模型开发之前,我们将首先进行一些初步的数据探索和可视化,以更好地理解数据的外观。
2D图 (2D Plots)
#PLOTTING THE NUMBER OF LABELS PER CLASS
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
sns.countplot(x=iris['target'])
axes.set_title('Number of Labels Per Class',fontsize = 18, fontweight = 'bold')
axes.set_xlabel('Class',fontsize = 14, fontweight = 'bold')
axes.set_ylabel('Number of Labels',fontsize = 14, fontweight = 'bold')
plt.show()
#PLOTTING 2D RELATIONSHIPS BETWEEN THE NUMERICAL FEATURES ON A SCATTER PLOT GRID USING SEABORN
sns.pairplot(iris,hue='target')
plt.show()
特征工程 (Feature Engineering)
Using the 4 feature columns, we will create some additional features of unique pairwise ratios between the columns.
使用4个要素列,我们将在列之间创建一些具有唯一成对比率的附加功能。
#CREATE A COLUMN FOR EVERY UNIQUE PAIRWISE RATIO AMONG THE 4 EXISTING FEATURES
for items in list(combinations(iris.columns[:-1],2)):
iris[items[0] + '/' + items[1]] = iris[items[0]]/iris[items[1]]
#SHOW FIRST 5 ROWS OF UPDATED IRIS DATASET
iris = iris[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm',
'petal_width_cm', 'sepal_length_cm/sepal_width_cm',
'sepal_length_cm/petal_length_cm', 'sepal_length_cm/petal_width_cm',
'sepal_width_cm/petal_length_cm', 'sepal_width_cm/petal_width_cm',
'petal_length_cm/petal_width_cm','target']]
iris.head()
功能缩放 (Feature Scaling)
As stated earlier, because KNN is a distance-based algorithm feature scaling is especially important to prevent the impact of one feature masking input from other features. Feature scaling addresses this problem by ensuring each feature column has the same mean and standard deviation. A common way to execute this practice is to subtract each column by its column mean, before diving the difference by the column standard deviation (resulting in a column with an approximate mean of 0 and approximate standard deviation of 1). This is the method we will execute within this study.
如前所述,由于KNN是基于距离的算法,因此要素缩放对于防止一个要素掩盖来自其他要素的输入的影响尤其重要。 特征缩放通过确保每个特征列具有相同的均值和标准差来解决此问题 。 执行此做法的一种常见方法是,在将差除以列标准偏差之前(减去得出的列的均值约为0,标准差约为1),以其列平均值减去各列。 这是我们将在本研究中执行的方法。
Before feature scaling the data set, we will view descriptive statistics for each of the feature columns. This is important information to know in case new data is being streamed into live models for predictions, or if the presence of new labeled data results in the model being retrained with updated data.
在对数据集进行特征缩放之前,我们将查看每个特征列的描述性统计信息。 这是要知道的重要信息,以防万一有新数据流传输到实时模型中进行预测,或者新标签数据的存在是否导致模型被更新数据重新训练。
iris.describe().iloc[[0,1,2,3,7]].transpose()
#FEATURE SCALING THE DATA
for cols in iris.columns[:-1]:
iris[cols] = (iris[cols] - iris[cols].mean())/(iris[cols].std())
#SHOW FIRST 5 ROWS OF FEATURE SCALED IRIS DATA SET
iris.head()
Following the feature scaling process, by again using the .describe() method we can see that the descriptive statistics of each feature has changed to reflect a mean of ~ 0 and a standard deviation of ~ 1.
在特征缩放过程之后,再次使用.describe()方法,我们可以看到每个特征的描述性统计信息已更改为反映均值〜0和标准差〜1。
iris.describe().iloc[[0,1,2,3,7]].transpose()
模型开发,验证和选择 (Model Development, Validation, and Selection)
With the data properly inspected and scaled, we are ready to undergo the model development, validation, and selection process. This process will include a grid search methodology along with the custom KNN Classification and Repeated K-Fold Cross Validation frameworks above to find an optimal KNN Classification model. This model will find the a subset of features, a K value, a distance metric, and a weighting schema which when used together maximizes prediction accuracy.
在正确检查和缩放数据后,我们准备进行模型开发,验证和选择过程。 该过程将包括网格搜索方法以及自定义的KNN分类和上面的重复K折交叉验证框架,以找到最佳的KNN分类模型。 该模型将找到特征的子集,K值,距离度量和权重方案,将它们一起使用可使预测精度最大化 。
First we will create a list which holds all possible feature subsets:
首先,我们将创建一个列表,其中包含所有可能的功能子集:
featuresubsets = []
for num in range(1,len(iris.columns[:-1])+1):
#FOR SUBSETS OF SIZE 1, APPEND EACH FEATURE COLUMN NAME TO THE FEATURE SUBSETS LIST
if num == 1:
for cols in iris.columns[:-1]:
featuresubsets.append(cols)
#FOR SUBSETS OF SIZE > 1, APPEND ALL UNIQUE SUBSETS OF THAT SIZE TO THE FEATURE SUBSETS LIST
else:
for subsets in list(combinations(iris.columns[:-1],num)):
featuresubsets.append(subsets)
len(featuresubsets)1023
We have now created a list holding all 1023 possible unique subsets of the feature data. With these subsets identified, we can build our grid search. Our grid search will test each feature subset with a K Value between 1–20. In addition, for each feature subset/K Value combination 6 models will be generated where each model holds selects a different weighting schema/distance metric combination. For model validation, each model will be subjected to a Repeated K-Fold Cross Validation Process with 5 folds (80/20 split for each iteration) and 5 repeats. In summary, our grid search algorithm will build 1023 * 20 * 6 = 122760 total models where each model is validated 5 * 5 = 25 times to provide a generalized accuracy score depicting model performance:
现在,我们创建了一个列表,其中包含要素数据的所有1023个可能的唯一子集。 确定了这些子集后,我们可以构建网格搜索。 我们的网格搜索将测试每个特征子集的K值在1-20之间。 另外,对于每个特征子集/ K值组合,将生成6个模型,其中每个模型都选择一个不同的加权方案/距离度量组合。 对于模型验证,将对每个模型进行5次折叠(每个迭代80/20分割)和5次重复的重复K折交叉验证过程。 总而言之,我们的网格搜索算法将构建1023 * 20 * 6 = 122760个总模型,其中每个模型均经过5 * 5 = 25次验证,以提供描述模型性能的广义准确性评分:
#LISTS TO STORE THE RESULTS OF THE GRID SEARCH
featureNames = []
numberOfFeatures = []
numberOfNeighbors = []
weightingSchema = []
distanceMetric = []
trainingAccuracy = []
testingAccuracy = []
#STORING THE FEATURE AND LABEL DATA IN SEPARATE STRUCTURES
X = iris.iloc[:,:-1]
Y = iris['target']
#GRID SEARCH ALGORITHM
#lOOP THROUGH ALL SUBSET POSSIBILITIES
for num in range(0,len(featuresubsets)):
#CREATE XNEW, MODIFIED VERSION OF X THAT ONLY HOLDS THE FEATURE SUBSET
if type(featuresubsets[num]) == str:
XNew = X[[featuresubsets[num]]]
else:
XNew = X[list(featuresubsets[num])]
#CREATE MODELS FOR K VALUES BETWEEN 1 AND 20 FOR EACH FEATURE SUBSET
for K in range(1,21):
#FOR EACH K VALUE/SUBSET COMBINATION, CREATE 6 TOTAL MODELS WITH EVERY WEIGHTING SCHEMA/DISTANCE METRIC COMBINATION
for weight in ['standard','weighted']:
for distance in ['euclidean','manhattan','chebyshev']:
model = KNNClass(K,weighting = weight,distance_metric=distance)
#STORE MODEL TRAIN/TEST ACCURACY SCORES FOR EACH ITERATION OF REPEATED K-FOLD CV MODEL VALIDATION PROCESS
modeltrain = []
modeltest = []
#RUN EACH MODEL THROUGH A 5X5 REPEATED K-FOLD CV PROCESS, WHILE STORING ITS TRAIN/TEST ACCURACY SCORES FOR EVERY ITERATION
rfk = Repkfoldcv(n_repeats = 5,n_splits = 5,shuffle=True)
for train,test in rfk.split(XNew):
xtrain,xtest,ytrain,ytest = XNew.iloc[train],XNew.iloc[test],Y.iloc[train],Y.iloc[test]
model.fit(xtrain,ytrain)
modeltrain.append(model.score(xtrain,ytrain))
modeltest.append(model.score(xtest,ytest))
#STORE FEATURE NAMES OF MODEL
featureNames.append(featuresubsets[num])
#STORE NUMBER OF FEATURES OF MODEL
numberOfFeatures.append(len(XNew.columns))
#STORE K VALUE OF MODEL
numberOfNeighbors.append(K)
#STORE WEIGHTING SCHEMA OF MODEL
weightingSchema.append(weight)
#STORE DISTANCE METRIC OF MODEL
distanceMetric.append(distance)
#STORE AVERAGE TRAIN/TEST ACCURACY SCORES THROUGH MODEL VALIDATION PROCESS, PROVIDING GENERALIZED SCORE OF MODEL PERFORMANCE
trainingAccuracy.append(mean(modeltrain))
testingAccuracy.append(mean(modeltest))#STORE GRID SEARCH RESULTS IN PANDAS DATAFRAME AND RETURN 10 BEST PERFORMING MODELS ACCORDING TO TESTING ACCURACY
SummaryReport = pd.DataFrame()
SummaryReport['Feature Names'] = featureNames
SummaryReport['Number of Features'] = numberOfFeatures
SummaryReport['Number of Neighbors'] = numberOfNeighbors
SummaryReport['Weighting Schema'] = weightingSchema
SummaryReport['Distance Metric'] = distanceMetric
SummaryReport['Training Accuracy'] = trainingAccuracy
SummaryReport['Testing Accuracy'] = testingAccuracy
SummaryReport.sort_values('Testing Accuracy', ascending=False,inplace=True)
SummaryReport = SummaryReport.head(10)
SummaryReport.set_index('Feature Names',inplace = True)
SummaryReport
Based off the summary report of the grid search results, a 3-NN model using 3 ratio features (sepal_length_cm/petal_length_cm, sepal_width_cm/petal_length_cm, sepal_width_cm/petal_width_cm) alongside a standard weighting schema and euclidean distance metric was the best performing model across a Repeated K-Fold CV process with 5 repeats and 5 folds (25 model validation iterations). This best performing model also has near identical training/testing accuracies meaning that the model is neither overfitting or underfitting the data. Finally, with a relatively low number of features and low number of neighbors the model has certainly achieved model parsimony in being a fairly simple model with extremely high predictive power.
根据网格搜索结果的摘要报告, 使用3种比率特征(sepal_length_cm / petal_length_cm,sepal_width_cm / petal_length_cm,sepal_width_cm / petal_width_cm)以及标准加权方案和欧几里德距离距离度量的3-NN模型是跨重复模型中表现最好的模型具有5次重复和5次折叠(25个模型验证迭代)的K折CV过程。 这种性能最佳的模型还具有几乎相同的训练/测试精度,这意味着该模型不会过度拟合或不足拟合数据。 最后,由于具有相对较少的特征和较少的邻居,因此该模型无疑是在具有非常高的预测能力的相当简单的模型中实现了模型简约性。
结束语 (Closing Remarks)
The KNN Algorithm is one of the most widely used models for both classification and regression-based tasking. In closing, we wanted to provide some additional insight on the benefits and drawbacks of the model to help identify the best settings for deploying this model.
KNN算法是用于分类和基于回归的任务分配的最广泛使用的模型之一。 最后,我们想对模型的优缺点提供一些其他的见解,以帮助确定用于部署此模型的最佳设置。
KNN的好处 (Benefits of KNN)
- Intuitive distance-based methodology 直观的基于距离的方法
- Minimal prior assumptions of the data unlike other algorithms (linear/multiple regression) 与其他算法(线性/多元回归)不同,数据的最小先验假设
- Easy model training as the model simply stores the training data 轻松进行模型训练,因为模型仅存储训练数据
- Highly adaptable to new training data and multi-class frameworks 高度适应新的培训数据和多类框架
- Very flexible/customizable framework (adapts to classification/regression, linear/non-linear problems, and can be used with a variety of distance metrics/weighting schemas) 高度灵活/可自定义的框架(适应分类/回归,线性/非线性问题,并且可以与各种距离度量/加权方案一起使用)
KNN的缺点 (Drawbacks of KNN)
- Computationally expensive algorithm as distance needs to be computed between every test observation and every training observation before being sorted 计算上昂贵的算法,因为在分类之前需要在每个测试观察值和每个训练观察值之间计算距离
- Larger datasets can severely slow down the algorithm as a result of the above 由于上述原因,较大的数据集会严重降低算法的速度
- Curse of Dimensionality: As the number of features increase, it becomes harder to find data points that are truly close together in higher dimensional space 维度诅咒:随着要素数量的增加,在更高维度的空间中查找真正紧密靠近的数据点变得更加困难
- Feature scaling is necessary to ensure the scale of one feature does not mask the impact of others 必须进行功能缩放,以确保一个功能的缩放不会掩盖其他功能的影响
- Finding the optimal K-value can be difficult and time consuming 寻找最佳K值可能既困难又耗时
- Imbalanced data and outliers can cause the model to be hypersensitive 数据和异常值的不平衡会导致模型过于敏感
Luckily, Python’s scikit-learn library provides these models and algorithms premade and ready to implement through a single line of code. Scikit-learn provides source code for each of these options, and while we refrained from viewing outside references for the sake of building truly custom frameworks from scratch we highly recommend utilizing the full power of the scikit-learn library. Please view below for relevant packages that coincide with the content discussed throughout the article.
幸运的是,Python的scikit-learn库提供了预制的这些模型和算法,并可以通过单行代码实现。 Scikit-learn为每个选项提供了源代码,尽管为了从头开始构建真正的自定义框架,我们也避免查看外部参考,但我们强烈建议利用scikit-learn库的全部功能。 请在下面查看与本文中讨论的内容一致的相关软件包。
#DISTANCE METRICS
from sklearn.neighbors import DistanceMetric
#KNN CLASSIFICATION
from sklearn.neighbors import KNeighborsClassifier
#KNN REGRESSION
from sklearn.neighbors import KNeighborsRegressor
#KNN REGRESSION ERROR METRICS
from sklearn.metrics import mean_squared_error,mean_absolute_error #RMSE IS SQUARE ROOT OF MEAN_SQUARED_ERROR
#TRADITIONAL TRAIN/TEST SPLIT
from sklearn.model_selection import train_test_split
#REPEATED K-FOLD CROSS VALIDATION
from sklearn.model_selection import RepeatedKFold
About the Author
关于作者
Amit Gattadahalli is a Consultant at Boulevard with a focus in data science. He recently graduated from the University of Maryland College Park with a B.S. in Mathematics and Statistics. His work within the consulting industry has concentrated on supervised/unsupervised machine learning, custom data science-centric algorithm development, data visualization, and general software development.
Amit Gattadahalli是Boulevard的顾问,重点是数据科学。 他最近毕业于马里兰大学公园大学,获得数学和统计学士学位。 他在咨询行业的工作集中在有监督/无监督机器学习,以定制数据科学为中心的算法开发,数据可视化和通用软件开发上。
About Boulevard
关于林荫大道
The Boulevard Consulting Group, a Native American-owned, Small Business Administration (SBA) 8(a) small business, is a modern management and technology consulting firm that helps clients leverage the power of data science, operational transformation, and cutting-edge technology to build a better future. We develop highly scalable, impactful solutions by combining the latest technology trends with time-tested improvement methodologies, and with the personalized attention of a smaller firm. For more information about Boulevard and its services, visit our website at www.boulevardcg.com.
Boulevard Consulting Group是美国原住民所有的小型企业管理(SBA)8(a)小型企业,是一家现代管理和技术咨询公司,可帮助客户利用数据科学,运营转型和尖端技术的力量共建美好未来 我们将最新的技术趋势与久经考验的改进方法相结合,并结合小型公司的个性化关注,开发出具有高度可扩展性和影响力的解决方案。 有关林荫大道及其服务的更多信息,请访问我们的网站www.boulevardcg.com 。
knn 邻居数量k的选取