从头开始学python_从头开始构建python

最新推荐文章于 2021-12-14 11:21:43 发布

weixin_26756255

最新推荐文章于 2021-12-14 11:21:43 发布

阅读量364

点赞数

原文链接：https://medium.com/@chaitanya_kasaraneni/build-knn-from-scratch-python-7b714c47631a

版权

从头开始学python

In this article, we shall understand how k-Nearest Neighbors (kNN) algorithm works and build kNN algorithm from ground up. We also shall evaluate our algorithm using the k-Fold cross-validation which is also developed from scratch.

在本文中，我们将了解k最近邻(kNN)算法的工作原理，并从头开始构建kNN算法。我们还将使用k-Fold交叉验证来评估我们的算法，该验证也是从头开始的。

After completing this tutorial you will know:

完成本教程后，您将知道：

How to code the k-Nearest Neighbors algorithm step-by-step
如何逐步编码k-最近邻居算法
How to use k-Nearest Neighbors to make a prediction for new data
如何使用k最近邻对新数据进行预测
How to code the k-Fold Cross Validation step-by-step
如何逐步编码k-折叠交叉验证
How to evaluate k-Nearest Neighbors on a real dataset using k-Fold Cross Validation
如何使用k-折叠交叉验证评估真实数据集上的k-最近邻居

Prerequisites: Basic understanding of Python and the concept of classes and objects from Object-oriented Programming (OOP)

先决条件：对Python的基本了解以及来自面向对象编程(OOP)的类和对象的概念

k最近邻居 (k-Nearest Neighbors)

k-Nearest Neighbors, kNN for short, is a very simple but powerful technique used for making predictions. The principle behind kNN is to use “most similar historical examples to the new data.”

k-最近邻居，简称kNN，是一种非常简单但功能强大的技术，可用于进行预测。 kNN的原理是使用“与新数据最相似的历史示例”。

‘k’ is a number used to identify similar neighbors for the new data point.

“ k ”是用于标识新数据点的相似邻居的数字。

The entire training dataset is initially stored. When predictions are required for new data, kNN considers the k-most similar neighbors (records) to decide where the new data point will belong to based on feature similarity.

最初存储了整个训练数据集。当需要对新数据进行预测时， k个 NN会考虑k个最相似的邻居(记录)，以基于特征相似度来确定新数据点将属于哪个位置。

Once we find the distance or similarity, we choose the top k closest records. After discovering k closest records, we make the prediction by returning the most common outcome or taking the average. As such, kNN can be used for classification or regression problems.

找到距离或相似度后，我们选择前k个最接近的记录。发现k条最接近的记录后，我们通过返回最常见的结果或取平均值来进行预测。这样， k NN可用于分类或回归问题。

kNN algorithm doesn’t have a training phase. The model just holds the data until a prediction is required and does no work. For this reason, kNN is often referred to as a “lazy learning method”.

k NN算法没有训练阶段。该模型仅保留数据，直到需要预测为止并且不起作用。因此， k NN通常被称为“ 惰性学习方法 ”。

通过4个简单的步骤即可获得k最近邻 (k-Nearest Neighbors in 4 easy steps)

Choose a value for k
为k选择一个值
Find the distance of the new point to each record of training data
找出新点到每条训练数据记录的距离
Get the k-Nearest Neighbors
获取k个最近的邻居
For classification problem, the new data point belongs to the class that most of the neighbors belong to. For regression problem, the prediction can be average or weighted average of the label of k-Nearest Neighbors
对于分类问题 ，新数据点属于大多数邻居所属的类。对于回归问题 ，预测可以是k-最近邻居的标签的平均值或加权平均值。

使用Python从零开始构建kNN (Building kNN from scratch using Python)

You can follow along using the code available in my GitHub.

您可以使用GitHub中可用的代码进行后续操作。

步骤1：选择ak值 (Step 1: Choosing a k value)

Choice of K has a drastic impact on the results we obtain from kNN. Better choose an odd number.

K的选择对我们从k NN获得的结果有极大的影响。 最好选择一个奇数。

步骤2：计算距离 (Step 2: Calculating Distance)

The next step is to calculate distance between two rows in a dataset.

下一步是计算数据集中两行之间的距离。

Problem or data specific methods are used to calculate distance or similarity between two records. In general for tabular or vector data, Euclidean distance is considered as starting point. There are several other similarity or distance metrics such as Manhattan distance, Hamming distance, etc.

问题或数据特定方法用于计算两个记录之间的距离或相似度。通常，对于表格或矢量数据，欧几里德距离被视为起点。还有其他一些相似性或距离指标，例如曼哈顿距离，汉明距离等。

Euclidean distance is defined as the square root of the sum of squared distance (difference) between two points. It is also known as L2 norm.

欧几里得距离定义为两点之间距离(差)的平方和的平方根。它也称为L2规范。

Image for post — x’ and ‘ x ”和“ y’ y ”之间的欧式距离

Manhattan distance is the sum of the absolute values of the differences between two points

曼哈顿距离是两点之差的绝对值之和

Hamming distance is used for categorical variables. In simple terms it tells us if the two categorical variables are same or not.

汉明距离用于分类变量。简单来说，它告诉我们两个分类变量是否相同。

where ‘δ’ is used to check equality of the two elements.

其中“δ”用于检查两个元素的相等性。

In python, we create a separate class that holds the methods to calculate the distance between two vectors.

在python中，我们创建了一个单独的类，其中包含用于计算两个向量之间距离的方法。

class distanceMetrics:
    '''
    Description:
        This class contains methods to calculate various distance metrics
    '''
    def __init__(self):
        '''
        Description:
            Initialization/Constructor function
        '''
        pass
        
    def euclideanDistance(self, vector1, vector2):
        '''
        Description:
            Function to calculate Euclidean Distance
                
        Inputs:
            vector1, vector2: input vectors for which the distance is to be calculated
        Output:
            Calculated euclidean distance of two vectors
        '''
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        distance = 0.0
        for i in range(len(self.vectorA)-1):
            distance += (self.vectorA[i] - self.vectorB[i])**2
        return (distance)**0.5
    
    def manhattanDistance(self, vector1, vector2):
        """
        Desription:
            Takes 2 vectors a, b and returns the manhattan distance
        Inputs:
            vector1, vector2: two vectors for which the distance is to be calculated
        Output:
            Manhattan Distance of two input vectors
        """
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        return np.abs(np.array(self.vectorA) - np.array(self.vectorB)).sum()
    
    def hammingDistance(self, vector1, vector2):
        """
        Desription:
            Takes 2 vectors a, b and returns the hamming distance
            Hamming distance is meant for discrete-valued vectors, though it is a 
            valid metric for real-valued vectors.
        Inputs:
            vector1, vector2: two vectors for which the distance is to be calculated
        Output:
           Hamming Distance of two input vectors 
        """
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        return sum(el1 != el2 for el1, el2 in zip(self.vectorA, self.vectorB))

We shall utilize this class to find the nearest neighbors in the next step.

在下一步中，我们将利用此类来查找最近的邻居。

步骤3：取得最近的邻居 (Step 3: Get Nearest Neighbors)

Neighbors for a piece of new data in the dataset are the top — k closest instances that we obtain using the distance metrics defined above.

数据集中新数据的邻居是最顶部的-我们使用上面定义的距离度量获得的k个最接近的实例。

To locate the neighbors for a new piece of data within a dataset we must first calculate the distance between each record in the dataset to the new piece of data. We can do this by creating an object for the distanceMetric class that we defined above.

为了在数据集中找到新数据的邻居，我们必须首先计算数据集中每个记录到新数据之间的距离。为此，我们可以为上面定义的distanceMetric类创建一个对象。

Once distances are calculated, we must sort all of the records in the training dataset by their distance to the new data. We can then select the top k to return as the most similar neighbors.

一旦计算出距离，我们就必须根据它们与新数据的距离对训练数据集中的所有记录进行排序。然后，我们可以选择顶部k作为最相似的邻居返回。

We can do this by keeping track of the distance for each record in the dataset as a list, sort the list of lists by the distance and then retrieve the neighbors.

我们可以通过以列表的形式跟踪数据集中每个记录的距离，通过距离对列表列表进行排序，然后检索邻居来做到这一点。

def getNeighbors(self, testRow):
        '''
        Description:
            Train kNN model with x data
        Input:
            testRow: testing data with coordinates
        Output:
            k-nearest neighbors to the test data
        '''
        
        calcDM = distanceMetrics()
        distances = []
        for i, trainRow in enumerate(self.trainData):
            if self.distanceMetric == 'euclidean':
                distances.append([trainRow, calcDM.euclideanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'manhattan':
                distances.append([trainRow, calcDM.manhattanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'hamming':
                distances.append([trainRow, calcDM.hammingDistance(testRow, trainRow), self.trainLabels[i]])
            distances.sort(key=operator.itemgetter(1))


        neighbors = []
        for index in range(self.k):
            neighbors.append(distances[index])
        return neighbors

Now that we know how to get top k — neighbors from the dataset, we will use them to make predictions.

现在我们知道了如何从数据集中获取前k个邻居，我们将使用它们进行预测。

步骤4：预测 (Step 4: Predictions)

In this step, we shall use the top — k similar neighbors collected from training dataset to make predictions.

在这一步中，我们将使用从训练数据集中收集的前k个相似邻居进行预测。

def predict(self, xTest, k, distanceMetric):
        '''
        Description:
            Apply kNN model on test data
        Input:
            xTest: testing data with coordinates
            k: number of neighbors
            distanceMetric: technique to calculate distance metric
        Output:
            predicted label 
        '''
        predictions = []
        
        for i, testCase in enumerate(testData):
            neighbors = getNeighbors(testCase)
            output= [row[-1] for row in neighbors]
            prediction = max(set(output), key=output.count)
            predictions.append(prediction)
        
        return predictions

In the case of classification, we can return the most represented class among the neighbors.

在分类的情况下，我们可以返回最代表的类。

We can achieve this by performing the max() function on the list of output values from the neighbors. Given a list of class values observed in the neighbors, the max() function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

我们可以通过对来自邻居的输出值列表执行max()函数来实现此目的。给定在邻居中观察到的类值列表， max()函数采用一组唯一的类值，并为该集中的每个类值调用类值列表上的计数。

Below is the complete kNN class:

以下是完整的k NN类：

class kNNClassifier:
    '''
    Description:
        This class contains the functions to calculate distances
    '''
    def __init__(self,k = 3, distanceMetric = 'euclidean'):
        '''
        Description:
            KNearestNeighbors constructor
        Input    
            k: total of neighbors. Defaulted to 3
            distanceMetric: type of distance metric to be used. Defaulted to euclidean distance.
        '''
        pass
    
    def fit(self, xTrain, yTrain):
        '''
        Description:
            Train kNN model with x data
        Input:
            xTrain: training data with coordinates
            yTrain: labels of training data set
        Output:
            None
        '''
        assert len(xTrain) == len(yTrain)
        self.trainData = xTrain
        self.trainLabels = yTrain


    def getNeighbors(self, testRow):
        '''
        Description:
            Train kNN model with x data
        Input:
            testRow: testing data with coordinates
        Output:
            k-nearest neighbors to the test data
        '''
        
        calcDM = distanceMetrics()
        distances = []
        for i, trainRow in enumerate(self.trainData):
            if self.distanceMetric == 'euclidean':
                distances.append([trainRow, calcDM.euclideanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'manhattan':
                distances.append([trainRow, calcDM.manhattanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'hamming':
                distances.append([trainRow, calcDM.hammingDistance(testRow, trainRow), self.trainLabels[i]])
            distances.sort(key=operator.itemgetter(1))


        neighbors = []
        for index in range(self.k):
            neighbors.append(distances[index])
        return neighbors
        
    def predict(self, xTest, k, distanceMetric):
        '''
        Description:
            Apply kNN model on test data
        Input:
            xTest: testing data with coordinates
            k: number of neighbors
            distanceMetric: technique to calculate distance metric
        Output:
            predicted label 
        '''
        self.testData = xTest
        self.k = k
        self.distanceMetric = distanceMetric
        predictions = []
        
        for i, testCase in enumerate(self.testData):
            neighbors = self.getNeighbors(testCase)
            output= [row[-1] for row in neighbors]
            prediction = max(set(output), key=output.count)
            predictions.append(prediction)
        
        return predictions

Now that we have our predictions, we need to evaluate the performance of our model. For this, we shall use k-Fold Cross Validation which is defined in the next part.

现在我们有了预测，我们需要评估模型的性能。为此，我们将使用在下一部分中定义的k-折叠交叉验证。

k交叉验证 (k Fold Cross validation)

This technique involves randomly dividing the dataset into k-groups or folds of approximately equal size. The first fold is kept for testing and the model is trained on remaining k-1 folds.

该技术涉及将数据集随机分为k组或近似相等大小的折叠。 保留第一折以进行测试，并在剩余的k-1折上训练模型 。

There are many variants of k-Fold Cross Validation. You can read more about them here.

k折叠交叉验证有多种变体。您可以在此处阅读有关它们的更多信息。

In out approach, after each fold, we calculate accuracy, and thus accuracy of k-Fold CV is computed by taking average of the accuracies over k-folds.

在出方法中，每个折叠后，我们计算精度，因而K-折CV的精度通过取平均的精度超过K-倍来计算。

使用Python从零开始构建kFCV (Building kFCV from scratch using Python)

As a first step, we divide the dataset into k- folds.

第一步，我们将数据集分为k-折。

Then for each fold in the k-folds, we perform kNN algorithm, get predictions and evaluate the performance using accuracy as evaluation metric.

然后，对于k折中的每一折，我们执行kNN算法，获得预测并使用准确性作为评估指标来评估性能。

The method to split the data into k-Folds:

将数据拆分为k折的方法：

class kFoldCV:
    '''
    This class is to perform k-Fold Cross validation on a given dataset
    '''
    def __init__(self):
        pass
    
    def crossValSplit(self, dataset, numFolds):
        '''
        Description:
            Function to split the data into number of folds specified
        Input:
            dataset: data that is to be split
            numFolds: integer - number of folds into which the data is to be split
        Output:
            split data
        '''
        dataSplit = list()
        dataCopy = list(dataset)
        foldSize = int(len(dataset) / numFolds)
        for _ in range(numFolds):
            fold = list()
            while len(fold) < foldSize:
                index = randrange(len(dataCopy))
                fold.append(dataCopy.pop(index))
            dataSplit.append(fold)
        return dataSplit
    
    
    def kFCVEvaluate(self, dataset, numFolds, *args):
        '''
        Description:
            Driver function for k-Fold cross validation 
        '''
        knn = kNNClassifier()
        folds = self.crossValSplit(dataset, numFolds)
        scores = list()
        for fold in folds:
            trainSet = list(folds)
            trainSet.remove(fold)
            trainSet = sum(trainSet, [])
            testSet = list()
            for row in fold:
                rowCopy = list(row)
                testSet.append(rowCopy)
                
            trainLabels = [row[-1] for row in trainSet]
            trainSet = [train[:-1] for train in trainSet]
            knn.fit(trainSet,trainLabels)
            
            actual = [row[-1] for row in testSet]
            testSet = [test[:-1] for test in testSet]
            
            predicted = knn.predict(testSet, *args)
            
            accuracy = printMetrics(actual, predicted)
            scores.append(accuracy)


        print('*'*20)
        print('Scores: %s' % scores)
        print('*'*20)        
        print('\nMaximum Accuracy: %3f%%' % max(scores))
        print('\nMean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

The method for evaluation:

评估方法：

class kFoldCV:
    '''
    This class is to perform k-Fold Cross validation on a given dataset
    '''
    def __init__(self):
        pass
    
    def crossValSplit(self, dataset, numFolds):
        '''
        Description:
            Function to split the data into number of folds specified
        Input:
            dataset: data that is to be split
            numFolds: integer - number of folds into which the data is to be split
        Output:
            split data
        '''
        dataSplit = list()
        dataCopy = list(dataset)
        foldSize = int(len(dataset) / numFolds)
        for _ in range(numFolds):
            fold = list()
            while len(fold) < foldSize:
                index = randrange(len(dataCopy))
                fold.append(dataCopy.pop(index))
            dataSplit.append(fold)
        return dataSplit
    
    
    def kFCVEvaluate(self, dataset, numFolds, *args):
        '''
        Description:
            Driver function for k-Fold cross validation 
        '''
        knn = kNNClassifier()
        folds = self.crossValSplit(dataset, numFolds)
        scores = list()
        for fold in folds:
            trainSet = list(folds)
            trainSet.remove(fold)
            trainSet = sum(trainSet, [])
            testSet = list()
            for row in fold:
                rowCopy = list(row)
                testSet.append(rowCopy)
                
            trainLabels = [row[-1] for row in trainSet]
            trainSet = [train[:-1] for train in trainSet]
            knn.fit(trainSet,trainLabels)
            
            actual = [row[-1] for row in testSet]
            testSet = [test[:-1] for test in testSet]
            
            predicted = knn.predict(testSet, *args)
            
            accuracy = printMetrics(actual, predicted)
            scores.append(accuracy)


        print('*'*20)
        print('Scores: %s' % scores)
        print('*'*20)        
        print('\nMaximum Accuracy: %3f%%' % max(scores))
        print('\nMean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Both methods combined into a single class:

两种方法合并为一个类：

class kFoldCV:
    '''
    This class is to perform k-Fold Cross validation on a given dataset
    '''
    def __init__(self):
        pass
    
    def crossValSplit(self, dataset, numFolds):
        '''
        Description:
            Function to split the data into number of folds specified
        Input:
            dataset: data that is to be split
            numFolds: integer - number of folds into which the data is to be split
        Output:
            split data
        '''
        dataSplit = list()
        dataCopy = list(dataset)
        foldSize = int(len(dataset) / numFolds)
        for _ in range(numFolds):
            fold = list()
            while len(fold) < foldSize:
                index = randrange(len(dataCopy))
                fold.append(dataCopy.pop(index))
            dataSplit.append(fold)
        return dataSplit
    
    
    def kFCVEvaluate(self, dataset, numFolds, *args):
        '''
        Description:
            Driver function for k-Fold cross validation 
        '''
        knn = kNNClassifier()
        folds = self.crossValSplit(dataset, numFolds)
        scores = list()
        for fold in folds:
            trainSet = list(folds)
            trainSet.remove(fold)
            trainSet = sum(trainSet, [])
            testSet = list()
            for row in fold:
                rowCopy = list(row)
                testSet.append(rowCopy)
                
            trainLabels = [row[-1] for row in trainSet]
            trainSet = [train[:-1] for train in trainSet]
            knn.fit(trainSet,trainLabels)
            
            actual = [row[-1] for row in testSet]
            testSet = [test[:-1] for test in testSet]
            
            predicted = knn.predict(testSet, *args)
            
            accuracy = printMetrics(actual, predicted)
            scores.append(accuracy)


        print('*'*20)
        print('Scores: %s' % scores)
        print('*'*20)        
        print('\nMaximum Accuracy: %3f%%' % max(scores))
        print('\nMean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

We can execute this by creating an object for k-Fold cross validation method and call the evaluate method as shown below.

我们可以通过为k -Fold交叉验证方法创建一个对象并调用评估方法来执行此操作，如下所示。

kfcv = kFoldCV()kfcv.kFCVEvaluate(data, foldCount, neighborCount, distanceMetric)

The kfcv.kFCVEvaluate() then splits the data into the specified number of folds and evaluated kNN algorithm by considering top-k neighbors using the distanceMetric specified.

然后， kfcv.kFCVEvaluate()将数据拆分为指定的倍数，并使用指定的distanceMetric通过考虑前k个邻居来评估k NN算法。

Examples and implementation can be seen in my GitHub repository.

在我的GitHub存储库中可以看到示例和实现。

结论 (Conclusion)

In this blog, we have seen:

在此博客中，我们看到了：

kNN algorithm
kNN算法
Some distance metrics used in kNN algorithm
kNN算法中使用的一些距离度量
Predictions using kNN algorithm
使用kNN算法进行预测
Evaluating kNN algorithm using kFold Cross validation
使用kFold交叉验证评估kNN算法

Hope you gained some knowledge reading this article. Please remember that this article is just an overview and my understanding of kNN algorithm and kFold Cross validation technique that I read from various online sources.

希望您阅读本文后获得了一些知识。请记住，本文只是概述，也是我对kNN算法和kFold Cross验证技术的理解，我从各种在线资源中读到了该知识。

翻译自: https://medium.com/@chaitanya_kasaraneni/build-knn-from-scratch-python-7b714c47631a

从头开始学python

weixin_26756255

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
从头开始学python_从头开始构建python

从头开始学pythonIn this article, we shall understand how k-Nearest Neighbors (kNN) algorithm works and build kNN algorithm from ground up. We also shall evaluate our algorithm using the k-Fold cross-valida...
复制链接

扫一扫