knn最近邻算法_k最近邻knn算法

本文介绍了k-最近邻(kNN)算法的基本原理,该算法是一种常用的数据分类和预测方法。通过查找数据集中与给定点最近的k个邻居来决定其类别。kNN算法在数据挖掘和机器学习中有着广泛的应用。
摘要由CSDN通过智能技术生成

knn最近邻算法

Data science is hot and it seems that everybody is working on some sort of project involving the latest state-of-the-art (SOTA) algorithm. Of course with good reason as in many cases we can use data to give very reasonable prediction, almost in any field. While there is a lot of focus lately on SOTA algorithms, the simpler methods are sometimes forgotten.

数据科学炙手可热,似乎每个人都在从事某种涉及最新技术(SOTA)算法的项目。 当然有充分的理由,因为在许多情况下,我们几乎可以在任何领域使用数据进行非常合理的预测。 尽管近来有很多关于SOTA算法的关注,但有时却忘记了更简单的方法。

Want to get started with Python? Start here!

想开始使用Python吗? 从这里开始

Recently, I played around with a k-nearest-neighbor (KNN) algorithm and I was amazed how powerful it can be. The technique itself is used in many other fields. For example, I used it to identify the same particles in consecutive frames of a high-speed recording for one of my research projects during my Ph.D. The coordinates of a particle are known and we look in the next frame at the closest particles around that position. Of course, when there are multiple particles very close by, you can get into trouble. For that, you can make use of higher order information from multiple frames such as the velocity or acceleration vector. For KNN in machine learning, we generally do not have temporal data therefore, we only use its first order, which is the simplest form.

最近,我尝试了一种k最近邻(KNN)算法,并且惊讶于它的强大功能。 该技术本身已在许多其他领域中使用。 例如,在我攻读博士学位期间,我用它来识别高速记录连续帧中的相同粒子。 粒子的坐标是已知的,我们在下一帧中查看该位置附近最接近的粒子。 当然,当附近有多个粒子时,您可能会遇到麻烦。 为此,您可以利用多个帧中的高阶信息,例如速度或加速度矢量。 对于机器学习中的KNN,我们通常没有时间数据,因此,我们仅使用其一阶形式,这是最简单的形式。

When we want to use KNN to classify new data, i.e. make a prediction, we use the already known data (or labeled data) as a kind of look-up table. We select data that is similar to the new data we want to predict and select the most prominent class from that selection. So we compare an unknown example to an already known dataset. There is no training, no layers, no weights. The only parameter is k and specifies the amount of neighbors to take into consideration when predicting the class. For example, to classify a kind of fruit, we can select the five most similar examples from the dataset. Now we say that the most prominent class of those five selected examples is probably also the class we want to predict. If we have found three apples and two pears, we would predict an apple as well.

当我们要使用KNN对新数据进行分类(即进行预测)时,我们将已经知道的数据(或标记的数据)用作一种查找表。 我们选择与我们要预测的新数据相似的数据,然后从该选择中选择最突出的类别。 因此,我们将未知示例与已知数据集进行比较。 没有训练,没有层次,没有重量。 唯一的参数是k ,它指定在预测类别时要考虑的邻居数量。 例如,要对一种水果进行分类,我们可以从数据集中选择五个最相似的示例。 现在我们说,这五个选定示例中最突出的类别可能也是我们要预测的类别。 如果我们发现了三个苹果和两个梨,那么我们也会预测一个苹果。

Now we come to another problem: how do we select the most similar examples from a list of features. When we have a single feature, e.g. height this would be very easy. We simply calculate the difference and select the k closest matches. But what to do when we also have weight and width? We have to quantify the difference for each feature and aggregate the result to a single value. Fortunately, there are many ways to do this. One of the most common is the Euclidean distance, which can be seen as the shortest straight line between two points.

现在,我们遇到另一个问题:如何从功能列表中选择最相似的示例。 当我们只有一个特征(例如高度)时,这将非常容易。 我们只计算差值并选择k个最接近的匹配项。 但是当我们也有体重宽度时该怎么办? 我们必须量化每个功能的差异并将结果汇​​总为一个值。 幸运的是,有很多方法可以做到这一点。 欧几里德距离是最常见的距离之一,可以看作是两点之间的最短直线。

Image for post
Euclidean distance: the magic introduced by Pythagoras! (My own diagram skills)
欧氏距离:毕达哥拉斯引入的魔力! (我自己的图表技巧)

The Euclidean distance works for many features (or dimensions), however there is one short-coming that counts for KNN in general. The features have to be numeric and to calculate a distance, the numbers have to represent meaning. The height feature has meaning attached to the number. The larger the number, the larger the object. The smaller the number, the smaller the object. This makes the distance, i.e. the difference between two heights meaningful as it is the difference in height. Now let's take color as a feature. Color is not numeric (at least we use names, not wave-lengths) and therefore, we need to convert it to a categorical unit to create numbers. Now, a value of green is represented by 0, red by 1, etc. While the numbers indicate which color is attached to the object, the value itself has no real meaning. If we have an object that has color value 100 and another object with color value 50, these numbers do not mean anything, other than that the color is different. Therefore, a difference (or distance) between these numbers is meaningless and useless for KNN.

欧几里得距离适用于许多特征(或尺寸),但是对于KNN来说,通常存在一个缺点。 这些要素必须是数字并计算距离,数字必须表示含义。 高度功能具有附加到数字的含义。 数字越大,对象越大。 数字越小,物体越小。 这使得距离,即两个高度之间的差有意义,因为它是高度差。 现在,让我们以颜色为特色。 颜色不是数字(至少我们使用名称而不是波长),因此,我们需要将其转换为分类单位以创建数字。 现在,绿色值由0表示,红色由1表示,依此类推。虽然数字表示对象附着的颜色,但该值本身并不具有实际意义。 如果我们有一个对象的颜色值是100,而另一个对象的颜色值是50,则这些数字没有任何意义,只是颜色不同。 因此,这些数字之间的差异(或距离)对于KNN来说是没有意义的,也没有用。

While real categorical variables are not usable, not everything is lost. We can for example use binary indicators such as is_green to indicate if an object is green. Still, I would only use these if I am sure that the feature adds to the prediction.

尽管无法使用真正的分类变量,但不会丢失所有内容。 例如,我们可以使用诸如is_green之类的二进制指示器来指示对象是否为绿色。 不过,只有在我确定该功能可以添加到预测中时,我才使用这些功能。

创建算法 (Create the algorithm)

We will create an KNN algorithm and use it on the famous Titanic dataset and predict who has survived the tragic disaster. Therefore, all the function will have some kind of link with that dataset. To create an KNN prediction algorithm we have to do the following steps:

我们将创建一个KNN算法,并将其用于著名的泰坦尼克号数据集,并预测谁在悲剧性灾难中幸存下来。 因此,所有功能都将与该数据集建立某种链接。 要创建KNN预测算法,我们必须执行以下步骤:

1. calculate the distance between the unknown point and the known dataset.2. select the k nearest neighbors for from that dataset.3. make a prediction

1.计算未知点和已知数据集之间的距离。2。 从该数据集中选择k个最近的邻居3。 做出预测

Image for post
Simple GIF showing how KNN works (created myself / code available in Github)
显示GNN工作原理的简单GIF(创建自己/ Github中提供的代码)

As seen in the previous figure, we only have to apply Pythagoras to calculate the Euclidean distance:

如上图所示,我们只需应用毕达哥拉斯即可计算出欧几里得距离:

def euclidean_distance(vector1, vector2):
    return np.sqrt(np.sum((vector1 - vector2)**2))


# test function
vec1 = np.array([3, 0])
vec2 = np.array([0, 4])


# this is the 3:4:5 triangle and therefore, it should return 5 (Long live Pythagoras)
euclidean_distance(vec1, vec2)

Next we need to make a loop to calculate and compare all the distance of the dataset with the point we want to predict:

接下来,我们需要做一个循环来计算和比较数据集的所有距离与我们要预测的点:

# Another implementation using Pandas
def get_nearest_neighbor(vector, dataset, number_of_vectors=1, ignore_cols=['Survived'], not_count_duplicates=False):
    ds = dataset.copy()
    ds['distance'] = ds.loc[:, ~ds.columns.isin(ignore_cols)].apply(
        lambda x: euclidean_distance(x, vector), axis=1)
    if not_count_duplicates:
        distances = sorted(ds.distance.unique())[:number_of_vectors]
        return ds.loc[ds.distance <= max(distances)].drop('distance', axis=1)
    return ds.sort_values('distance', ascending=True).head(number_of_vectors).drop('distance', axis=1)
        
# test function
dataset = pd.DataFrame([
    {'a': 1, 'b': 1, 'Survived': 1},
    {'a': 2, 'b': 2, 'Survived': 1},
    {'a': 3, 'b': 3, 'Survived': 0},
    {'a': 4, 'b': 4, 'Survived': 0},
    {'a': 5, 'b': 5, 'Survived': 0},
])
vector = pd.Series({'a': 2.5, 'b': 2.5})


# should be (2,2) and (3,3) (if keeping track of duplicates)
get_nearest_neighbor(vector, dataset)

Now we have a function that gives us a list of k nearest neighbors. The final step is to take the most prominent class from that list and use that as a prediction. This is easily solved by taking the mean and rounding it. Feel free to check if this is true.

现在我们有了一个函数,该函数为我们提供了k个最近邻居的列表。 最后一步是从列表中选出最杰出的班级,并将其用作预测。 这可以通过取平均值并四舍五入来轻松解决。 随时检查是否正确。

def predict(vector, dataset, number_of_neighbors=1, y='Survived'):
    neighbors = get_nearest_neighbor(vector, dataset, number_of_neighbors)
    return round(neighbors[y].mean())


# test function
print(predict(vector, dataset))
print(predict(pd.Series({'a': 4.5, 'b': 4.5}), dataset))

准备数据集(Prepare the dataset)

Before we can test the algorithm we need to prepare the data. This is reading in the data and filling in the missing values in a smart way. We will not be using the Embarked and Cabin feature, so we do not have to bother about those. There is a single Fare missing, which we will fill in using the median of that Pclass. There is quite some Age data missing. We will be using the title to make a good guess for each missing age.

在测试算法之前,我们需要准备数据。 这是在读取数据并以一种聪明的方式填写缺失的值。 我们将不会使用“登船和机舱”功能,因此我们不必为这些而烦恼。 缺少一个票价,我们将使用该Pclass的中位数来填写。 缺少一些年龄数据。 我们将使用标题为每个缺少的年龄做出很好的猜测。

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler


train = pd.read_csv('../datasets/titanic/train.csv')
test = pd.read_csv('../datasets/titanic/test.csv')


train['set'], test['set'] = 'train', 'test'
combined = pd.concat([train, test])


# Fill in missing fare
pclass = combined.loc[combined.Fare.isnull(), 'Pclass'].values[0]
median_fare = combined.loc[combined.Pclass== pclass, 'Fare'].median()
combined.loc[combined.Fare.isnull(), 'Fare'] = median_fare


## Fill in missing ages
# Select everything before the . as title
combined['Title'] = combined['Name'].str.extract('([A-Za-z]+)\.', expand=True)
title_reduction = {'Mr': 'Mr', 'Mrs': 'Mrs', 'Miss': 'Miss', 
                   'Master': 'Master', 'Don': 'Mr', 'Rev': 'Mr',
                   'Dr': 'Mr', 'Mme': 'Miss', 'Ms': 'Miss',
                   'Major': 'Mr', 'Lady': 'Mrs', 'Sir': 'Mr',
                   'Mlle': 'Miss', 'Col': 'Mr', 'Capt': 'Mr',
                   'Countess': 'Mrs','Jonkheer': 'Mr',
                   'Dona': 'Mrs'}
combined['Title'] = combined['Title'].map(title_reduction)
for title, age in combined.groupby('Title')['Age'].median().iteritems():
    combined.loc[(combined['Title']==title) & (combined['Age'].isnull()), 'Age'] = age

Next we create a couple of additional features (features). One that I recently saw on Kaggle is family_survivability. It assumes that families help each other and that if others in your family survive, you are more likely to also have survived. The idea comes from Shunjiang Xu / Blood is thicker than water and I think it is pretty amazing. Furthermore, we will add the family size, change Sex to a binary indicator, and scale all variables and split it back to train/test.

接下来,我们创建几个附加功能(功能)。 我最近在Kaggle上看到的一个是family_survivability。 它假设家人互相帮助,并且如果家人中的其他人能够生存,那么您也更有可能生存下来。 这个想法来自Shunjiang Xu /血液比水浓稠,我认为这非常惊人。 此外,我们将增加家庭人数,将性别更改为二进制指标,并缩放所有变量并将其拆分回训练/测试。

def other_family_members_survived(dataset, label='family_survival'):
    """
    Check if other family members survived
      -> 0 other did not survive
      -> 1 at least one other family member survived
      -> 0.5 unknown if other members survived or person was alone
    
    Parameters
    ----------
    dataset : DataFrame
      The sub-dataframe containing the family
    """
    ds = dataset.copy()
    if len(dataset) == 1:
        ds[label] = 0.5
        return ds
    result = []
    for ix, row in dataset.iterrows():
        survived_fraction = dataset.drop(ix)['Survived'].mean()
        if np.isnan(survived_fraction):
            result.append(0.5)
        elif survived_fraction == 0:
            result.append(0)
        else:
            result.append(1)
    ds[label] = result
    return ds
 
# Get family_survival from surname
combined['surname'] = combined['Name'].apply(lambda x: x.split(",")[0])
combined = combined.groupby(['surname', 'Fare']).apply(other_family_members_survived).reset_index(drop=True)
# Fill in missing from Tickets
combined = combined.groupby(['Ticket']).apply(lambda x: other_family_members_survived(x, label='family_survival_ticket')).reset_index(drop=True)
combined.loc[combined['family_survival'] == 0.5, 'family_survival'] = combined.loc[combined['family_survival'] == 0.5, 'family_survival_ticket']


# Get family size
combined['family_size'] = combined['Parch'] + combined['SibSp']


# Make binary
combined['Sex'] = LabelEncoder().fit_transform(combined['Sex'])


# Bin Age and Fare
combined['Age'] = pd.qcut(combined['Age'], 4, labels=False)
combined['Fare'] = pd.qcut(combined['Fare'], 5, labels=False)


# Scale classes
selected = ['Pclass', 'Sex', 'Age', 'Fare', 'family_size', 'family_survival']
scaler  = StandardScaler()
scaler.fit(combined[selected])
combined[selected] = scaler.transform(combined[selected])


# Split back to train/test
train = combined.loc[combined['set'] == 'train'].drop('set', axis=1).reset_index(drop=True)
test = combined.loc[combined['set'] == 'test'].drop(['set', 'Survived'], axis=1).reset_index(drop=True)

作出预测(Make predictions)

The algorithm is ready, the data is prepared, we are ready to make some predictions!

算法已经准备好,数据已经准备好了,我们已经可以做出一些预测了!

def predict_dataset(dataset, number_of_neighbors=1):
    ds = dataset.copy()
    def predict_row(vector, dataset):
        subset = dataset.loc[~(dataset.index==vector.name)]
        if vector.name % 100 == 0:
            print(vector.name)
        return int(predict(vector, subset, number_of_neighbors))
    ds['predicted'] = ds.loc[:, ds.columns.isin(selected)].apply(
        lambda x: predict_row(x, ds), axis=1)
    return ds


ds = predict_dataset(train, number_of_neighbors=10)


print('Accuracy:', sum(ds['Survived'] == ds['predicted']) / len(ds))

It takes pretty long, because our algorithm is quite slow. Put if all is correct, you should get an accuracy of 83.5%. The only parameter we can tweak is the the number of neighbors to take in consideration. I played around a bit and 10 seems to give the best results. Now lets make a Kaggle submission dataset!

因为我们的算法很慢,所以需要很长时间。 正确地说,您应该获得83.5%的准确度。 我们可以调整的唯一参数是要考虑的邻居数量。 我玩了大约10个,似乎取得了最好的结果。 现在,让我们创建一个Kaggle提交数据集!

def predict_testset(test_dataset, train_dataset, number_of_neighbors=1):
    ds = test_dataset.copy()
    select = selected + ['Survived']
    
    def predict_row(vector, dataset):
        if vector.name % 100 == 0:
            print(vector.name)
        return int(predict(vector, dataset[select], number_of_neighbors))


    ds['Survived'] = ds.loc[:, ds.columns.isin(selected)].apply(
        lambda x: predict_row(x, train_dataset), axis=1)
    
    return ds


final_test = predict_testset(test, train, number_of_neighbors=10)
result = final_test[['PassengerId', 'Survived']].copy()
result.to_csv('results.csv', index=False)

Submitting this to Kaggle will result in a score of 81.3%. Slightly lower than the training set but still quite high. And we managed this by only using a simple KNN algorithm.

提交给Kaggle将会得到81.3%的分数。 略低于训练集,但仍然很高。 而且,我们仅使用简单的KNN算法进行了管理。

最后的想法 (Some last thoughts)

The algorithm we created was powerful and very effective. The only downside is that our implementation is very slow. For each vector, we need to calculate and compare the distance with the complete dataset. Sci-kit learn comes with smarter implementations that use Tree-based approaches to minimize the amount of calculation required and making it tremendously faster.

我们创建的算法功能强大且非常有效。 唯一的缺点是我们的实施非常缓慢。 对于每个向量,我们需要计算距离并将其与完整的数据集进行比较。 Sci-kit学习带有更智能的实现,这些实现使用基于树的方法来最大程度地减少所需的计算量并大大加快计算速度。

Here we used KNN as a classifier, however, it works very similar to regression as well. With regression, the result is continuous, therefore it makes sense to, for example, average nearest neighbors when making a prediction.

这里我们使用KNN作为分类器,但是它的工作原理也非常类似于回归。 通过回归,结果是连续的,因此例如在进行预测时对平均最近邻居有意义。

I hope you had as much fun as I had. If you have any questions, feel free to contact me through LinkedIn.

我希望你和我一样开心。 如有任何疑问,请随时通过LinkedIn与我联系。

All code is available on my Github and on Kaggle.

所有代码都可以在我的GithubKaggle上找到

I would very much appreciate any upvote of my Kaggle notebooks.

非常感谢我对Kaggle笔记本的任何赞扬

翻译自: https://towardsdatascience.com/k-nearest-neighbor-knn-algorithm-3d16dc5c45ef

knn最近邻算法

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值