knn自己实现(python)

前言:

对于knn来说,有两个hyperparameters(超参数:choices about the algorithm that we set rather than learn. Very problem-dependent, must try them all out and see what works best.),其一是怎么选取distance metric,其二是怎么选取k。

这儿说两种distance metric,一种是Manhattan metric, 也叫L1,另一种是Euclidean metric, 即L2。

这两种距离各有应用的方面,目前自己也在学习阶段,并不是特别清楚,但老师说,L1是coordinate dependence。

比如说你有一个关于员工的vector,里面的元素描述的是员工的工资,年龄,性别等等不同的方面,那么,此时

就可以考虑L1距离。

 

代码实现:

    步骤:

                1、导入数据集

                2、数据集分类(训练、测试)

                3、计算test instance与training set的distance

                4、根据distance的大小,选取k个neighbors   

                5、k个邻居进行投票(这儿只讨论binary knn)

    实现:

from sklearn.datasets import load_iris
from sklearn import cross_validation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter
from operator import itemgetter
import numpy as np
import math


# 1) given two data points, calculate the euclidean distance between them
def get_distance(data1, data2):
    points = zip(data1, data2)
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return math.sqrt(sum(diffs_squared_distance))

 
# 2) given a training set and a test instance, use getDistance to calculate all pairwise distances
def get_neighbours(training_set, test_instance, k):
    distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]
 
    # index 1 is the calculated distance between training_instance and test_instance
    sorted_distances = sorted(distances, key=itemgetter(1))
 
    # extract only training instances without distance
    sorted_training_instances = [tuple[0] for tuple in sorted_distances]
 
    # select first k elements
    return sorted_training_instances[:k]

def _get_tuple_distance(training_instance, test_instance):
    return (training_instance, get_distance(test_instance, training_instance[0]))



def get_majority_vote(neighbours):
    # index 1 is the class
    classes = [neighbour[1] for neighbour in neighbours]
    count = Counter(classes)
    return count.most_common()[0][0]


def main():
 
    # load the data and create the training and test sets
    # random_state = 1 is just a seed to permit reproducibility of the train/test split,即设置种子,使得随机结果能够在以后的实验再次出现
    iris = load_iris()
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)
 
    # reformat train/test datasets for convenience
    train = np.array(list(zip(X_train,y_train)))
    test = np.array(list(zip(X_test, y_test)))
    
    #上面的train和test的数据结构如下
    """
        array([[array([5.8, 2.8, 5.1, 2.4]), 2],
       [array([6. , 2.2, 4. , 1. ]), 1],
       [array([5.5, 4.2, 1.4, 0.2]), 0],.....])
    """

 
    # generate predictions
    predictions = []
 
    # let's arbitrarily set k equal to 5, meaning that to predict the class of new instances,
    k = 5
 
    # for each instance in the test set, get nearest neighbours and majority vote on predicted class
    for x in range(len(X_test)):
 
            print('Classifying test instance number ',str(x),":")
            neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)
            majority_vote = get_majority_vote(neighbours)
            predictions.append(majority_vote)
            print('Predicted label=',str(majority_vote),', Actual label=',str(test[x][1]))
            
    # summarize performance of the classification
    y_test_new=y_test.reshape(-1,1)
    print('The overall accuracy of the model is: ',accuracy_score(y_test_new, predictions))
    report = classification_report(y_test, predictions, target_names = iris.target_names)
    print('A detailed classification report: \n\n',report)
 
if __name__ == "__main__":
    main()

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值