用Python实现KNN算法
最近在玩imbalance的时候,看到imbalanced-learn中牵扯到了KNN算法,所以,就把KNN仔细地研究了一下。首先,KNN算法原理比较简单,通俗易懂。当然,在实现算法的过程中,参考了sklearn的代码风格,这里不得不说,sklearn真是简约大方,实用方便。
k-近邻算法(k-Nearest Neighbour algorithm),又称为KNN算法,是数据挖掘技术中原理最简单的算法。KNN
的工作原理:给定一个已知标签类别的训练数据集,输入没有标签的新数据后,在训练数据集中找到与新数据最邻
近的k个实例,如果这k个实例的多数属于某个类别,那么新数据就属于这个类别。可以简单理解为:由那些离X最
近的k个点来投票决定X归为哪一类。
import numpy as np
from math import sqrt
from collections import Counter
import pandas as pd
class KNN:
def __init__(self,k):
"""
初始化KNN
"""
assert k >= 1,"k must be valid"
self.k = k
self._X_train = None
self._y_train = None
def fit(self,X_train,y_train):
assert X_train.shape[0] == y_train.shape[0]
assert self.k <= X_train.shape[0]
self._X_train = X_train
self._y_train = y_train
return self
def predict(self,X_test):
assert self._X_train is not None and self._y_train is not None
assert X_test.shape[1] == self._X_train.shape[1]
self.X_test = X_test
y_predict = [self._predict(x) for x in self.X_test]
return np.array(y_predict)
class KNNRegressor(KNN):
def __init__(self,weights = True,k = 5):
self.k = k
self.weights = weights
def _predict(self,x):
assert x.shape[0] == self._X_train.shape[1]
distances = [sqrt(np.sum((x_train - x )**2 )) for x_train in self._X_train ]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
if self.weights is True:
return np.mean(topK_y)
else:
ratio = np.array(1/ np.array(distances[:self.k]))
ratio = ratio / np.sum(ratio)
topK_y = np.array(topK_y)
return np.sum(ratio*topK_y)
def score(self,X_test,y_test):
y_predict = self.predict(X_test)
RSS = np.sum((y_test - y_predict) ** 2)
CSS = np.sum((y_test - np.mean(y_test))** 2)
score = 1 - RSS / CSS
return score
def __repr__(self):
return "KNNRegressor(k = %d)" % self.k
class KNNClassifier(KNN):
def __init__(self,weights = True,k = 5):
self.k = k
self.weights = weights
def _predict(self,x):
assert x.shape[0] == self._X_train.shape[1]
distances = [sqrt(np.sum((x_train - x )**2 )) for x_train in self._X_train ]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topK_y)
if self.weights is True:
return votes.most_common(1)[0][0]
else:
distances_ = np.array(distances) [nearest[:self.k]]
d = pd.Series(1 / distances_ )
df = pd.DataFrame(np.stack((topK_y,d),axis=1),columns = ["topK_y","d"])
index = df.groupby(topK_y)["d"].sum().idxmax()
return index
def score(self,X_test,y_test):
y_predict = self.predict(X_test)
accuary = np.sum(y_test == y_predict) / len(y_test)
return accuary
def __repr__(self):
return "KNN(k = %d)" % self.k
这里简要的介绍一下所写的代码: 采用了继承的方法,KNN为父类。KNNRegressor 与KNNClassifier 是子类。
虽然KNN算法默认是少数服从多数的原则,但是这是建立在样本权重一致的情况下,上图按照少数服从多数的原则,绿色的样本就是被预测蓝色,但是如果样本权重不一样,考虑训练样本与被测样本距离的远近,那么上图的结果很显然绿色样本会被预测为红色,权重一般会取距离的倒数
这里将自己写的算法与sklearn做个对比
- 分类
from sklearn.neighbors import KNeighborsClassifier
data = load_breast_cancer()
X = data.data
y = data.target
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3)
KNC = KNNClassifier(k = 5)
KNC.fit(X_train,Y_train)
y_predict = KNC.predict(X_test)
print(KNC.score(X_test,Y_test))
KNC_ = KNeighborsClassifier(n_jobs = -1).fit(X_train,Y_train)
print(KNC_.score(X_test,Y_test))
2.回归
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
data = load_boston()
X = data.data
y = data.target
score1 = []
score2 = []
for i in range(100):
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.fit_transform(X_test)
KNR = KNNRegressor(k = 5,weights = False)
KNR.fit(X_train,Y_train)
y_predict = KNR.predict(X_test)
score1.append(KNR.score(X_test,Y_test))
KNR_ = KNeighborsRegressor(n_neighbors=5,n_jobs = -1,weights='distance').fit(X_train,Y_train)
score2.append(KNR_.score(X_test,Y_test))
print("my_code r2_score",np.mean(score1))
print("sklearn r2_score",np.mean(score2))
总结
在分类上,与sklearn的结果一致,在回归上相差0.1,说明,,回归类算法还是有些问题,或许sklearn 采取更好的优化方案,有空再研究这部分。