K-近邻算法(KNN)

一、K - 近邻算法概述

K-近邻算法采用测量不同特征值之间的距离方法来进行分类。

​ K - 近邻算法

优点:精度高,对异常值不敏感。

缺点:计算复杂度高、空间复杂度高。

适用数据范围:数值型和标称型。

二、工作原理:

  • 存在一个样本数据集合,也被称作训练样本集合。样本中的每个数据都存在标签,即我们知道样本集中每一数据的所属分类的对应关系。再输入没有标签的样本后,将新数据的每个特征与样本集中包含的数据对应的特征进行性比较,然后提取与样本集合中特征最相似(最邻近)的分类标签。

三、K-近邻算法的伪代码

对已知类别属性的数据集中的数据的每个点依次执行以下操作
1. 计算已知类别数据集中的点与当前点的距离
2. 按照距离递增依次排序
3. 选取与当前点距离最小的K个点
4. 确定前K歌点所在类别出现的频率
5. 返回前k个点出现频率最高的类别作为当前点的预测分类

四、实例

4.1 实例1:K近邻算法的简单实现(判断坐标的标签)

1、准备:使用python导入数据
from numpy import *
import operator

def createDataSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group, labels

调用函数 createDataSet() 封装数据

group,labels = createDataSet()

程序运行结果:

group

array([[1. , 1.1],
       [1. , 1. ],
       [0. , 0. ],
       [0. , 0.1]])

labels

    ['A', 'A', 'B', 'B']
2、从文本文件中解析数据

K-近邻算法

# 4个输入参数:
#    用于分类的输入向量是inX
#    输入的训练样本集为dataSet,
#    标签向量为labels
#    最后的参数 k 表示用于选择最近邻居的数目
# 要求:
#    标签向量的元素数目和矩阵dataSet的行数相同
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet  # tile numpy中 重复某个数组n次,后面也可以数元组n行m列  //相减
    sqDiffMat = diffMat**2  # 平方
    sqDistances = sqDiffMat.sum(axis=1)   # 对每一行求和,每两个向量间
    distances = sqDistances**0.5     # 开平方
    sortedDistIndicies = distances.argsort()    # 排序返回数值从小到的的索引
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
  • 计算任意两个向量之间距离的公式为欧氏距离(根号下平方和)计算完所有点之间的距离后,可以对数据按照从小到大的次序排序。然后,确定前k个距离最小元素所在的主要分类,输入k总是正整数;最后,将classCount字典分解为元组列表,然后使用程序第二行导入运算符模块的itemgetter方法,按照第二个元素的次序对元组进行排序。此处的排序为逆序,即按照从最大到最小次序排序,最后返回发生频率最高的元素标签。
result = classify0([1,1.3],group,labels,3)
result
  • 运算结果:该坐标属于A分类

    ‘A’

4.2电影分类实例(爱情片与动作片)

  • 样本数据集合:训练样本集,样本个数N
  • 未知类型的数据集合一个(需要分类的目标数据);需要判断与每个(N个)样本集合的距离,排序过后选出最近的K个,选择K个数据中出现次数最多的分类(电影类型),我们根据这个原则去判定我们的数据属于那种分类。
1. 距离如何计算
  • 欧氏距离
    • 对于两个向量点a1 和 a2之间的距离,可以通过两点间距离表示
    • 例如输入变量有4个特征(1,2,4,6),(2,3,5,4)之间的距离表示为
      • (1-2)2+(2-3)2+(4-5)2+(6-4)2开根号
2.算法实现
  • 这里使用python的字典dict构建数据集,然后再将其转换为DataFrame格式
import pandas as pd

rowdata={'电影名称':['无问西东','后来的我们','前任3','红海行动','唐人街探案','战狼2'],
'打斗镜头':[1,5,12,108,112,115],
'接吻镜头':[101,89,97,5,9,8],
'电影类型':['爱情片','爱情片','爱情片','动作片','动作片','动作片']}
movie_data= pd.DataFrame(rowdata)
movie_data
电影名称打斗镜头接吻镜头电影类型
0无问西东1101爱情片
1后来的我们589爱情片
2前任31297爱情片
3红海行动1085动作片
4唐人街探案1129动作片
5战狼21158动作片
# 计算已知类别数据集中的点与当前点之间的距离
new_data = [24,67]  # 新电影的坐标 
dist = list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)
movie_data.iloc[:6,1:3]
打斗镜头接吻镜头
0-2334
1-1922
2-1230
384-62
488-58
591-59
(movie_data.iloc[:6,1:3]-new_data)
打斗镜头接吻镜头
0-2334
1-1922
2-1230
384-62
488-58
591-59
(((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))  # 对每一行进行操作
0     1685
1      845
2     1044
3    10900
4    11108
5    11762
dtype: int64
list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)  # 先开根号然后转化为列表的形式
[41.048751503547585,
 29.068883707497267,
 32.31098884280702,
 104.4030650891055,
 105.39449701004318,
 108.45275469069469]
# 将距离降序排列,然后选取最近的K个点
k = 4
dist_l = pd.DataFrame({'dist': dist, 'labels': (movie_data.iloc[:6, 3])})
dr = dist_l.sort_values(by = 'dist')[: k]
pd.DataFrame({'dist': dist, 'labels': (movie_data.iloc[:6, 3])})   # 组成字典形式的数据装化为DataFrame格式的数据
distlabels
041.048752爱情片
129.068884爱情片
232.310989爱情片
3104.403065动作片
4105.394497动作片
5108.452755动作片
dist_l.sort_values(by = 'dist')[: k]   # 对dist_1的数据按dist值进行排序默认升序;切片取前四个
distlabels
129.068884爱情片
232.310989爱情片
041.048752爱情片
3104.403065动作片
# 确定前k个点所在类别的出现频率
re = dr.loc[:,'labels'].value_counts()# DataFrame 不能直接切片,可以通过loc来做切片;loc是基于标签名的索引,也就是我们自定义的索引名
re.index[0]

运算结果:

'爱情片'
result = []
result.append(re.index[0])
result
['爱情片']
* 封装函数
import pandas as pd
"""
函数功能:KNN分类器
参数说明:
    inX:需要预测分类的数据集
    dataSet:已知分类标签的数据集(训练集)
    k:k-近邻算法参数,选择距离最小的k个点
返回:
    result:分类结果
"""

def classify0(inX,dataSet,k):
    result=[]
    dist = list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)
    dist_l = pd.DataFrame({'dist': dist, 'labels': (movie_data.iloc[:6, 3])})
    dr = dist_l.sort_values(by = 'dist')[: k]
    re = dr.loc[:,'labels'].value_counts()
    result.append(re.index[0])
    return result
* 加载数据
inX = new_data
dataSet = movie_data
k= 4
* 使用算法
classify0(inX,dataSet,k)

运算结果:

['爱情片']

4.3约会网站配对效果判定实例

# 导入数据
datingTest = pd.read_table('datingTestSet.txt',header=None)
datingTest.head()
0123
0409208.3269760.953952largeDoses
1144887.1534691.673904smallDoses
2260521.4418710.805124didntLike
37513613.1473940.428964didntLike
4383441.6697880.134296didntLike
datingTest.shape     # 数据的形状
(1000, 4)
datingTest.info()   # 数据的信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
0    1000 non-null int64
1    1000 non-null float64
2    1000 non-null float64
3    1000 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 31.4+ KB
分析数据
# %matplotlib具体作用是当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候,或者生成一个figure画布的时候,可以直接在你的python console里面生成图像。
%matplotlib inline  
import matplotlib as mpl
import matplotlib.pyplot as plt
datingTest.shape[0]   ## 获取行数
1000
# iloc函数:通过行号来取行数据(如取第二行的数据)
# loc函数:通过行索引 "Index" 中的具体值来取行数据
datingTest.iloc[1,-1]    # 从第2行的最后一个数据
'smallDoses'
datingTest.iloc[:,1]   # 所有行的二列数据
0       8.326976
1       7.153469
2       1.441871
3      13.147394
4       1.669788
         ...    
995     3.410627
996     9.974715
997    10.650102
998     9.134528
999     7.882601
Name: 1, Length: 1000, dtype: float64
#把不同标签用颜色区分
Colors = []
for i in range(datingTest.shape[0]):
    m = datingTest.iloc[i,-1]
    if m=='didntLike':
        Colors.append('black')
    if m=='smallDoses':
        Colors.append('orange')
    if m=='largeDoses':
        Colors.append('red')

#绘制两两特征之间的散点图
plt.rcParams['font.sans-serif']=['Simhei'] #图中字体设置为黑体
pl=plt.figure(figsize=(12,8))

fig1=pl.add_subplot(221)
plt.scatter(datingTest.iloc[:,1],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('玩游戏视频所占时间比')
plt.ylabel('每周消费冰淇淋公升数')

fig2=pl.add_subplot(222)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,1],marker='.',c=Colors)
plt.xlabel('每年飞行常客里程')
plt.ylabel('玩游戏视频所占时间比')

fig3=pl.add_subplot(223)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('每年飞行常客里程')
plt.ylabel('每周消费冰淇淋公升数')
plt.show()

在这里插入图片描述

# 数值归一化
def minmax(dataSet):
    minDf = dataSet.min()
    maxDf = dataSet.max()
    normSet = (dataSet - minDf )/(maxDf - minDf)
    return normSet
#  归一化后的数据与标签合并
datingT = pd.concat([minmax(datingTest.iloc[:, :3]), datingTest.iloc[:,3]], axis=1)  
datingT.head()
0123
00.4483250.3980510.562334largeDoses
10.1587330.3419550.987244smallDoses
20.2854290.0689250.474496didntLike
30.8232010.6284800.252489didntLike
40.4201020.0798200.078578didntLike
"""
    函数功能:切分训练集与数据集
    参数说明:
        dataSet:原始数据集
        rate:训练集所占的比例
    返回:切分好的训练集和数据集
"""
def randSplit(dataSet,rate=0.9):
    n = dataSet.shape[0]
    m = int(n*rate)
    train = dataSet.iloc[:m,:]
    test = dataSet.iloc[m:,:]   # 数据集无规律,就没有随机选择
    test.index = range(test.shape[0])  # 重新分配训练集的索引(从0开始)
    return train,test
train,test = randSplit(datingT)
train
0123
00.4483250.3980510.562334largeDoses
10.1587330.3419550.987244smallDoses
20.2854290.0689250.474496didntLike
30.8232010.6284800.252489didntLike
40.4201020.0798200.078578didntLike
50.7997220.4848020.608961didntLike
60.3938510.3265300.715335largeDoses
70.4674550.6346450.320312largeDoses
80.7395070.4126120.441536didntLike
90.3887570.5866900.889360largeDoses
100.5504590.1779930.490309didntLike
110.6932500.4008670.984636didntLike
120.0610150.2330590.429367smallDoses
130.5593330.2237210.368321didntLike
140.8476990.7313600.194879didntLike
150.4784880.0903210.112212didntLike
160.6723130.3593210.748369didntLike
170.7633470.6806710.153555didntLike
180.1716720.0000000.737168smallDoses
190.3121190.5032930.769428largeDoses
200.0710720.1692340.484741smallDoses
210.4131340.1430040.491491didntLike
220.2478280.2532520.376041smallDoses
230.3153400.3152010.109748largeDoses
240.2162630.1346490.994506smallDoses
250.4030550.5955380.382717largeDoses
260.0628990.0000000.976924smallDoses
270.3129840.4765280.430886largeDoses
280.0745890.0652430.377102smallDoses
290.4558960.0110160.679218didntLike
...............
8700.5602530.0874030.597598didntLike
8710.3863350.4836670.682090largeDoses
8720.4686600.5406430.050246largeDoses
8730.7032860.3987710.818841didntLike
8740.1691190.0115550.421646smallDoses
8750.1577900.5010970.999678smallDoses
8760.0694730.4440630.842632smallDoses
8770.1546240.2040890.078510smallDoses
8780.0700100.0000000.111133smallDoses
8790.0963480.0390600.084110smallDoses
8800.4758470.0721050.384508didntLike
8810.4199930.4474290.030162largeDoses
8820.3732540.4805280.324174largeDoses
8830.3376570.5311670.583112largeDoses
8840.2436540.5385430.426649largeDoses
8850.3147150.4963740.149720largeDoses
8860.6252780.1854060.812594didntLike
8870.7934440.6539040.014277didntLike
8880.3099930.5032110.460594largeDoses
8890.1084220.0000000.544773smallDoses
8900.7211440.1963120.640072didntLike
8910.0837600.3881030.867306smallDoses
8920.7810520.3727110.030206didntLike
8930.0561830.1333540.644440smallDoses
8940.1502200.2976650.168851smallDoses
8950.2436650.4861310.979099largeDoses
8960.1653500.0000000.808206smallDoses
8970.0549670.3591580.080380smallDoses
8980.1111060.3939320.058181smallDoses
8990.3897100.6985300.735519largeDoses

900 rows × 4 columns

test
0123
00.5137660.1703200.262181didntLike
10.0895990.1544260.785277smallDoses
20.6111670.1726890.915245didntLike
30.0125780.0000000.195477smallDoses
40.1102410.1879260.287082smallDoses
50.8121130.7052010.681085didntLike
60.7297120.4905450.960202didntLike
70.1303010.1332390.926158smallDoses
80.5577550.7224090.780811largeDoses
90.4370510.2478350.131156largeDoses
100.7221740.1849180.074908didntLike
110.7195780.1676900.016377didntLike
120.6901930.5267490.251657didntLike
130.4037450.1822420.386039didntLike
140.4017510.5285430.222839largeDoses
150.4259310.4219480.590885largeDoses
160.2944790.5341400.871767largeDoses
170.5066780.5500390.248375largeDoses
180.1398110.3727720.086617largeDoses
190.3865550.4854400.807905largeDoses
200.7483700.5088720.408589didntLike
210.3425110.4619260.897321largeDoses
220.3807700.5158100.774052largeDoses
230.1469000.1343510.129138smallDoses
240.3326830.4697090.818801largeDoses
250.1173290.0679430.399234smallDoses
260.2665850.5317190.476847largeDoses
270.4986910.6406610.389745largeDoses
280.0676870.0579490.493195smallDoses
290.1165620.0749760.765075smallDoses
...............
700.5884650.5807900.819148largeDoses
710.7052580.4373790.515681didntLike
720.1017720.4620880.808077smallDoses
730.6640850.1730510.169156didntLike
740.2009140.2504280.739211smallDoses
750.2502930.7034530.886825largeDoses
760.8181610.6905440.714136didntLike
770.3740760.6505710.214290largeDoses
780.1550620.1501760.249725smallDoses
790.1021880.0000000.070700smallDoses
800.2080680.0217380.609152smallDoses
810.1007200.0243940.008994smallDoses
820.0250350.1847180.363083smallDoses
830.1040070.3214260.331622smallDoses
840.0259770.2050430.006732smallDoses
850.1529810.0000000.847443smallDoses
860.0251880.1784770.411431smallDoses
870.0576510.0957290.813893smallDoses
880.0510450.1196320.108045smallDoses
890.1926310.3050830.516670smallDoses
900.3040330.4085570.075279largeDoses
910.1081150.1288270.254764smallDoses
920.2008590.1888800.196029smallDoses
930.0414140.4711520.193598smallDoses
940.1992920.0989020.253058smallDoses
950.1221060.1630370.372224smallDoses
960.7542870.4768180.394621didntLike
970.2911590.5091030.510795largeDoses
980.5271110.4366550.429005largeDoses
990.4794080.3768090.785718largeDoses

100 rows × 4 columns

"""
    函数功能:分类器
    
"""
def datingClass(train,test,k):
      = train.shape[1] - 1   # 取出训练集(原始数据)标签外的所有列
    m = test.shape[0]        # 获取测试集的个数
    result = []    # 存放结果
    for i in range(m):
        dist = list((((train.iloc[:, :n] - test.iloc[i, :n]) ** 2).sum(1))**5)
        dist_l = pd.DataFrame({'dist': dist, 'labels': (train.iloc[:, n])})
        dr = dist_l.sort_values(by = 'dist')[: k]
        re = dr.loc[:, 'labels'].value_counts()
        result.append(re.index[0])  # 加入result
    result = pd.Series(result)     # 吧列表 result 结果转换格式为 Series格式
    test['predict'] = result    # 追加到测试集 变成 DataFrame 格式增加新的一列
    acc = (test.iloc[:,-1]==test.iloc[:,-2]).mean()   # 确认准确率 
    print(f'模型预测准确率为{acc}')
    return test
datingClass(train,test,5)
模型预测准确率为0.95

f:\python3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
0123predict
00.5137660.1703200.262181didntLikedidntLike
10.0895990.1544260.785277smallDosessmallDoses
20.6111670.1726890.915245didntLikedidntLike
30.0125780.0000000.195477smallDosessmallDoses
40.1102410.1879260.287082smallDosessmallDoses
50.8121130.7052010.681085didntLikedidntLike
60.7297120.4905450.960202didntLikedidntLike
70.1303010.1332390.926158smallDosessmallDoses
80.5577550.7224090.780811largeDoseslargeDoses
90.4370510.2478350.131156largeDosesdidntLike
100.7221740.1849180.074908didntLikedidntLike
110.7195780.1676900.016377didntLikedidntLike
120.6901930.5267490.251657didntLikedidntLike
130.4037450.1822420.386039didntLikedidntLike
140.4017510.5285430.222839largeDoseslargeDoses
150.4259310.4219480.590885largeDoseslargeDoses
160.2944790.5341400.871767largeDoseslargeDoses
170.5066780.5500390.248375largeDoseslargeDoses
180.1398110.3727720.086617largeDosessmallDoses
190.3865550.4854400.807905largeDoseslargeDoses
200.7483700.5088720.408589didntLikedidntLike
210.3425110.4619260.897321largeDoseslargeDoses
220.3807700.5158100.774052largeDoseslargeDoses
230.1469000.1343510.129138smallDosessmallDoses
240.3326830.4697090.818801largeDoseslargeDoses
250.1173290.0679430.399234smallDosessmallDoses
260.2665850.5317190.476847largeDoseslargeDoses
270.4986910.6406610.389745largeDoseslargeDoses
280.0676870.0579490.493195smallDosessmallDoses
290.1165620.0749760.765075smallDosessmallDoses
..................
700.5884650.5807900.819148largeDoseslargeDoses
710.7052580.4373790.515681didntLikedidntLike
720.1017720.4620880.808077smallDosessmallDoses
730.6640850.1730510.169156didntLikedidntLike
740.2009140.2504280.739211smallDosessmallDoses
750.2502930.7034530.886825largeDoseslargeDoses
760.8181610.6905440.714136didntLikedidntLike
770.3740760.6505710.214290largeDoseslargeDoses
780.1550620.1501760.249725smallDosessmallDoses
790.1021880.0000000.070700smallDosessmallDoses
800.2080680.0217380.609152smallDosessmallDoses
810.1007200.0243940.008994smallDosessmallDoses
820.0250350.1847180.363083smallDosessmallDoses
830.1040070.3214260.331622smallDosessmallDoses
840.0259770.2050430.006732smallDosessmallDoses
850.1529810.0000000.847443smallDosessmallDoses
860.0251880.1784770.411431smallDosessmallDoses
870.0576510.0957290.813893smallDosessmallDoses
880.0510450.1196320.108045smallDosessmallDoses
890.1926310.3050830.516670smallDosessmallDoses
900.3040330.4085570.075279largeDoseslargeDoses
910.1081150.1288270.254764smallDosessmallDoses
920.2008590.1888800.196029smallDosessmallDoses
930.0414140.4711520.193598smallDosessmallDoses
940.1992920.0989020.253058smallDosessmallDoses
950.1221060.1630370.372224smallDosessmallDoses
960.7542870.4768180.394621didntLikedidntLike
970.2911590.5091030.510795largeDoseslargeDoses
980.5271110.4366550.429005largeDoseslargeDoses
990.4794080.3768090.785718largeDoseslargeDoses

100 rows × 5 columns

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值