用Python实现KNN算法

用Python实现KNN算法

最近在玩imbalance的时候,看到imbalanced-learn中牵扯到了KNN算法,所以,就把KNN仔细地研究了一下。首先,KNN算法原理比较简单,通俗易懂。当然,在实现算法的过程中,参考了sklearn的代码风格,这里不得不说,sklearn真是简约大方,实用方便。

k-近邻算法(k-Nearest Neighbour algorithm),又称为KNN算法,是数据挖掘技术中原理最简单的算法。KNN
的工作原理:给定一个已知标签类别的训练数据集,输入没有标签的新数据后,在训练数据集中找到与新数据最邻
近的k个实例,如果这k个实例的多数属于某个类别,那么新数据就属于这个类别。可以简单理解为:由那些离X最
近的k个点来投票决定X归为哪一类。

在这里插入图片描述

import numpy as np 
from  math  import sqrt   
from collections import Counter 
import pandas as pd 

class KNN:
    def __init__(self,k):
        """
        初始化KNN
        """
        assert k >= 1,"k must be valid" 
        self.k = k 
        self._X_train = None
        self._y_train = None 
        
    def fit(self,X_train,y_train):
        
        assert X_train.shape[0] == y_train.shape[0] 
        assert self.k <= X_train.shape[0] 
        
        
        self._X_train = X_train 
        self._y_train = y_train 
        return self 
    

    def predict(self,X_test): 
        assert self._X_train is not None and self._y_train is not None
        assert X_test.shape[1] == self._X_train.shape[1] 
        
        self.X_test = X_test 
        y_predict = [self._predict(x) for x in self.X_test] 
        return np.array(y_predict) 
        
    
    
class KNNRegressor(KNN):   
    def __init__(self,weights = True,k = 5):
        self.k = k 
        self.weights = weights 
     
          
    def _predict(self,x):
        assert x.shape[0] == self._X_train.shape[1] 
        
        
        distances = [sqrt(np.sum((x_train - x )**2 )) for x_train in self._X_train ] 
        nearest  = np.argsort(distances) 
        topK_y = [self._y_train[i] for i in nearest[:self.k]] 
        if self.weights is True:
            return np.mean(topK_y)  
        
        else:
            ratio = np.array(1/ np.array(distances[:self.k]))  
            ratio = ratio  / np.sum(ratio) 
            topK_y = np.array(topK_y)
            
            return np.sum(ratio*topK_y) 
    
    
    def score(self,X_test,y_test):
        y_predict = self.predict(X_test)
        RSS = np.sum((y_test - y_predict) ** 2)
        CSS = np.sum((y_test - np.mean(y_test))** 2)  
        score = 1 -  RSS / CSS 
        return score
    
    def __repr__(self):
        return "KNNRegressor(k = %d)" % self.k
    


class KNNClassifier(KNN):
    def __init__(self,weights = True,k = 5):
        self.k = k 
        self.weights = weights 
   
    
    
    def _predict(self,x):
        assert x.shape[0] == self._X_train.shape[1] 
        
        distances = [sqrt(np.sum((x_train - x )**2 )) for x_train in self._X_train ] 
        nearest  = np.argsort(distances) 
        topK_y = [self._y_train[i] for i in nearest[:self.k]] 
        votes = Counter(topK_y) 
        if self.weights is True: 
            return votes.most_common(1)[0][0] 
        else: 
            distances_ = np.array(distances) [nearest[:self.k]]    
            d = pd.Series(1 / distances_ )  
            df = pd.DataFrame(np.stack((topK_y,d),axis=1),columns = ["topK_y","d"])   
            index = df.groupby(topK_y)["d"].sum().idxmax() 
            return index  
    
    
    def score(self,X_test,y_test):
        y_predict = self.predict(X_test)
        accuary = np.sum(y_test == y_predict) / len(y_test) 
        return accuary
    
    def __repr__(self):
        return "KNN(k = %d)" % self.k

这里简要的介绍一下所写的代码: 采用了继承的方法,KNN为父类。KNNRegressor 与KNNClassifier 是子类。

在这里插入图片描述
虽然KNN算法默认是少数服从多数的原则,但是这是建立在样本权重一致的情况下,上图按照少数服从多数的原则,绿色的样本就是被预测蓝色,但是如果样本权重不一样,考虑训练样本与被测样本距离的远近,那么上图的结果很显然绿色样本会被预测为红色,权重一般会取距离的倒数

这里将自己写的算法与sklearn做个对比

  1. 分类
from sklearn.neighbors import KNeighborsClassifier 
data = load_breast_cancer() 
X = data.data 
y = data.target

    
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3)      
KNC = KNNClassifier(k = 5)           
KNC.fit(X_train,Y_train) 
y_predict = KNC.predict(X_test)  
print(KNC.score(X_test,Y_test)) 

KNC_ = KNeighborsClassifier(n_jobs = -1).fit(X_train,Y_train) 
print(KNC_.score(X_test,Y_test))    

在这里插入图片描述

2.回归

from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor 
from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import StandardScaler
data = load_boston() 
X = data.data 
y = data.target

score1 = [] 
score2 = [] 
for i in range(100):     
    X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3)     
    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.fit_transform(X_test)
    
    KNR = KNNRegressor(k = 5,weights = False)          
    KNR.fit(X_train,Y_train) 
    y_predict = KNR.predict(X_test) 
    score1.append(KNR.score(X_test,Y_test)) 
   
    KNR_ = KNeighborsRegressor(n_neighbors=5,n_jobs = -1,weights='distance').fit(X_train,Y_train) 
    score2.append(KNR_.score(X_test,Y_test))
    
print("my_code r2_score",np.mean(score1))     
print("sklearn r2_score",np.mean(score2)) 
   

在这里插入图片描述

总结

在分类上,与sklearn的结果一致,在回归上相差0.1,说明,,回归类算法还是有些问题,或许sklearn 采取更好的优化方案,有空再研究这部分。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值