机器学习之knn

最新推荐文章于 2024-04-01 00:49:08 发布

数据科学家corten

最新推荐文章于 2024-04-01 00:49:08 发布

阅读量361

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_37634812/article/details/78646920

版权

机器学习专栏收录该内容

38 篇文章 2 订阅

订阅专栏

1、KNN分类算法

KNN分类算法（K-Nearest-Neighbors Classification），又叫K近邻算法，是一个概念极其简单，而分类效果又很优秀的分类算法。

他的核心思想就是，要确定测试样本属于哪一类，就寻找所有训练样本中与该测试样本“距离”最近的前K个样本，然后看这K个样本大部分属于哪一类，那么就认为这个测试样本也属于哪一类。简单的说就是让最相似的K个样本来投票决定。

KNN算法不仅可以用于分类，还可以用于回归。通过找出一个样本的k个最近邻居，将这些邻居的属性的平均值赋给该样本，就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight)，如权值与距离成反比。　该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。无论怎样，数量并不能影响运行结果。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。

该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。

实现 K 近邻算法时，主要考虑的问题是如何对训练数据进行快速 K 近邻搜索，这在特征空间维数大及训练数据容量大时非常必要。

2、数据集介绍

machine-learning-databases/iris 点击打开链接

数据集信息:

这也许是最著名的数据库模式识别文献中被发现。费舍尔的论文是一个典型的,经常被引用。 (见杜达&哈特,例如)。 50个实例的数据集包含3类,其中

每个类是指一种虹膜。一个类是线性可分的从其他2;后者不是线性可分的。

预测属性:类的虹膜。

UCI中的Iris(鸢尾属植物)数据集。Iris数据包含150条样本记录，分剐取自三种不同的鸢尾属植物setosa、versic010r和virginica的花朵样本，每一

类各50条记录，其中每条记录有4个属性：萼片长度(sepal length)、萼片宽度sepalwidth)、花瓣长度(petal length)和花瓣宽度(petal width)。

这是一个极其简单的域。

3、完整源码

[html]view plain copy 
    
 <pre code_snippet_id="1766654" snippet_file_name="blog_20160717_1_8836404" name="code" class="python">#-*- coding: UTF-8 -*-     
 '''''   
 Created on 2016/7/17   
    
 @author: chen   
 '''    
 import csv     #用于处理csv文件    
 import random    #用于随机数    
 import math             
 import operator  #    
 from sklearn import neighbors    
     
 #加载数据集    
 def loadDataset(filename,split,trainingSet=[],testSet = []):    
     with open(filename,"rb") as csvfile:    
         lines = csv.reader(csvfile)    
         dataset = list(lines)    
         for x in range(len(dataset)-1):    
             for y in range(4):    
                 dataset[x][y] = float(dataset[x][y])    
             if random.random()<split:    
                 trainingSet.append(dataset[x])    
             else:    
                 testSet.append(dataset[y])    
     
 #计算距离    
 def euclideanDistance(instance1,instance2,length):    
     distance = 0    
     for x in range(length):    
         distance = pow((instance1[x] - instance2[x]),2)    
     return math.sqrt(distance)    
     
 #返回K个最近邻    
 def getNeighbors(trainingSet,testInstance,k):    
     distances = []    
     length = len(testInstance) -1    
     #计算每一个测试实例到训练集实例的距离    
     for x in range(len(trainingSet)):    
         dist = euclideanDistance(testInstance, trainingSet[x], length)    
         distances.append((trainingSet[x],dist))    
     #对所有的距离进行排序    
     distances.sort(key=operator.itemgetter(1))    
     neighbors = []    
     #返回k个最近邻    
     for x in range(k):    
         neighbors.append(distances[x][0])    
     return neighbors    
     
 #对k个近邻进行合并，返回value最大的key    
 def getResponse(neighbors):    
     classVotes = {}    
     for x in range(len(neighbors)):    
         response = neighbors[x][-1]    
         if response in classVotes:    
             classVotes[response]+=1    
         else:    
             classVotes[response] = 1    
     #排序    
     sortedVotes = sorted(classVotes.iteritems(),key = operator.itemgetter(1),reverse =True)    
     return sortedVotes[0][0]    
     
 #计算准确率    
 def getAccuracy(testSet,predictions):    
     correct = 0    
     for x in range(len(testSet)):    
         if testSet[x][-1] == predictions[x]:    
             correct+=1    
     return (correct/float(len(testSet))) * 100.0    
     
 def main():    
     trainingSet = []  #训练数据集    
     testSet = []      #测试数据集    
     split = 0.67      #分割的比例    
     loadDataset(r"../data/iris.txt", split, trainingSet, testSet)     
     print "Train set :" + repr(len(trainingSet))    
     print "Test set :" + repr(len(testSet))                    
         
     predictions = []    
     k = 3    
     for x in range(len(testSet)):    
         neighbors = getNeighbors(trainingSet, testSet[x], k)    
         result = getResponse(neighbors)    
         predictions.append(result)    
         print ">predicted = " + repr(result) + ",actual = " + repr(testSet[x][-1])    
     accuracy = getAccuracy(testSet, predictions)    
     print "Accuracy:" + repr(accuracy) + "%"    
     
 if __name__ =="__main__":    
     main()  </pre><br>  
 <br>  
 <pre></pre>  
 <div><br>  
 </div>  
  2. sklearn knn的使用，以及cross validation交叉验证 2.1 数据集的准备 数据集来源：https://archive.ics.uci.edu/ml/datasets/Iris
 代码github地址以及数据集github地址，见本人的github
 
 
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
import matplotlib.pyplot as plt


def load_data():
    names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
    # loading training data
    path = '../dataset/knn/iris_data.txt'
    df = pd.read_csv(path, header=None, names=names)
    # print df.head()
    x = np.array(df.ix[:, 0: 4])
    y = np.array(df['class'])

    print x.shape, y.shape
    # x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=40)
    return train_test_split(x, y, test_size=0.33, random_state=40)
     
     1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 
 2.2 验证预测效果def predict():
    x_train, x_test, y_train, y_test = load_data()
    k = 3
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    pred = knn.predict(x_test)
    print accuracy_score(y_test, pred)
     
     1
2
3
4
5
6
7
 2.3 交叉验证def cross_validation():
    x_train, x_test, y_train, y_test = load_data()
    k_lst = list(range(1, 30))
    lst_scores = []

    for k in k_lst:
        knn = KNeighborsClassifier(n_neighbors=k)
        scores = cross_val_score(knn, x_train, y_train, cv=10, scoring='accuracy')
        lst_scores.append(scores.mean())

    # changing to misclassification error
    MSE = [1 - x for x in lst_scores]
    optimal_k = k_lst[MSE.index(min(MSE))]
    print "The optimal number of neighbors is %d" % optimal_k
    # plot misclassification error vs k
    # plt.plot(k_lst, MSE)
    # plt.ylabel('Misclassification Error')
    plt.plot(k_lst, lst_scores)
    plt.xlabel('Number of Neighbors K')
    plt.ylabel('correct classification rate')
    plt.show()
     
     1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 
 
     
 
    
 knn改进方法 
     不同的K值加权
 距离度量标准根据实际问题，使用不同的距离
 特征归一化，例如，身高和体重x=[180,70]，升高计算明显，更影响结果，所有需要对两者分别求平均值，然后归一化。
 如果维数过大，可以做PCA降维处理

数据科学家corten

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习之knn

1、KNN分类算法KNN分类算法（K-Nearest-Neighbors Classification），又叫K近邻算法，是一个概念极其简单，而分类效果又很优秀的分类算法。他的核心思想就是，要确定测试样本属于哪一类，就寻找所有训练样本中与该测试样本“距离”最近的前K个样本，然后看这K个样本大部分属于哪一类，那么就认为这个测试样本也属于哪一类。简单的说就是让最相似的K个样本来投票决定
复制链接

扫一扫

专栏目录