震惊某程序员竟然用KNN给女神筛选相亲对象&KNN原理&代码实现

23 篇文章 0 订阅
14 篇文章 1 订阅

震惊

邻居女神小美到了要出嫁的年纪,天天相亲遇见奇葩,所以小美准备拜托她的青梅竹马兼男闺蜜的单身男程序员

你!!!!!

帮她写个程序,提前筛选一下相亲对象
你:???????


震惊

某程序员竟然用k近邻【(K-NearestNeighbor),俗称KNN,数据挖掘分类技术中最简单的方法之一。所谓K最近邻,就是K个最近的邻居的意思,说的是每个样本都可以用它最接近的K个邻近值来代表。近邻算法就是将数据集合中每一个记录进行分类的方法】给女神找相亲对象!


开局先将优略,如不符合您的要求,可以提前结束阅读,避免浪费您的时间。
优势:

  • 理论逻辑简单,容易理解,方便实现。
  • 可以用来分类也可用来回归
  • 对稀有事物分类有奇效
  • 可用于数值型或离散型数据

不足:

  • 时空复杂度过高,运算量大,不适合大数据操作
  • 存在样本不平衡问题
  • 算法的可理解性差,无法给出数据的内在含义

算法的理论讲解

k-近邻(KNN):
给定已知标签类别的训练数据集,输入没有标签的测试集,在训练集中找到于测试集最邻近的k个实例,如果这k个实例中的多数属于类别A,则测试集也属于列别A。

我们使用离Y最近的k个点来决定Y属于哪一类

如图,我们已有类别红点和蓝点,判断图中心的绿点属于哪一类

if   k=3;   because: blue=1 and red = 2 and 2>1  therefore:green = red      

elif k = 5;  because: blue=3 and red = 3 and 3>2  therefore:green = blue

ou~~
懂是懂了,但是距离如何衡量呢?
A:
距离度量。在二维平面上计算两个点之间的距离。
∣ A B ∣   =   ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 |AB|\ =\ \sqrt{\left( x_{_1}-x_2 \right) ^2+\left( y_1-y_2 \right) ^2} AB = (x1x2)2+(y1y2)2
当扩展到多个点时,再往后延续z,q,w,p等就可以了。

举例

import pandas as pd
data = {'电影名称':['无问西东','后来的我们','前任三','红海行动','唐人街探案','战狼2'],
        '打斗镜头':[1,5,12,108,112,115],
        '接吻镜头':[101,89,97,5,9,8],
        '电影类型':['爱情片','爱情片','爱情片','动作片','动作片','动作片']
        }
data = pd.DataFrame(data)
data
电影名称打斗镜头接吻镜头电影类型
0无问西东1101爱情片
1后来的我们589爱情片
2前任三1297爱情片
3红海行动1085动作片
4唐人街探案1129动作片
5战狼21158动作片

ok

接下来输入一条数据

['巴啦啦小魔仙全身边',24,67]  

让你写程序判定一下,这是属于爱情片还是属于动作片

#首先计算“小魔仙”到其他各个点之间的距离
xian_movie = [24,67]
xian_movie = list((((data.iloc[:,1:3]-xian_movie)**2).sum(1))**0.5)
xian_movie
[41.048751503547585,
 29.068883707497267,
 32.31098884280702,
 104.4030650891055,
 105.39449701004318,
 108.45275469069469]

可以看到,我们的"小魔仙"到其它点的距离
之后选取距离最小的k个点,这里我们将k设为4

sorted_data = pd.DataFrame({'movie':xian_movie,'labels':(data.iloc[:6,3])})
sorted_data = sorted_data.sort_values(by = 'movie')[:4]
sorted_data
movielabels
129.068884爱情片
232.310989爱情片
041.048752爱情片
3104.403065动作片

统计出top k,之后我们计算top k 中每个类别出现的次数, 最后选择最高频率的类别作为我们的预测结果

freq = sorted_data.loc[:,'labels'].value_counts()
freq
爱情片    3
动作片    1
Name: labels, dtype: int64
print('巴啦啦小魔仙全身变属于 {} '.format(freq.index[0]))
巴啦啦小魔仙全身变属于 爱情片 

however“巴啦啦小魔仙”真的属于爱情片麽?这里面存在很多很多的问题,需要诸位的努力,加油!!!!!!

生活不易,猫猫叹气.

实例

背景介绍:
小美到了要出嫁的年纪,但是并不是每个人都让她喜欢,而有些人又明显不是她的菜。所以小美准备拜托她的青梅竹马兼男闺蜜的单身程序员小白帮她写个程序提前筛选一番。

小白筛选规则如下:

根据

  • 每年的飞行里程数
  • 玩游戏、视频所占的时间比
  • 每周消费的冰淇凌公升数

小白将相亲对象分为

  • didntLike--------很不喜欢的人
  • smallDoses-------不喜欢的人
  • largeDoses-------有点不喜欢的人
import pandas as pd
data = pd.read_table('./data/data45246/datingTestSet.txt',header=None)
print(data.shape)
data.head()

(1000, 4)
0123
0409208.3269760.953952largeDoses
1144887.1534691.673904smallDoses
2260521.4418710.805124didntLike
37513613.1473940.428964didntLike
4383441.6697880.134296didntLike
import matplotlib as mpl
import matplotlib.pyplot as plot 

Colors = []
for i in range(data.shape[0]):
    m = data.iloc[i,-1]
    if m == 'didntLike':
        Colors.append('black')
    elif m == 'smallDoses':
        Colors.append('orange')
    elif m == 'largeDoses':
        Colors.append('red')

plot.rcParams['font.sans-serif'] = ['Simhei']
pl = plot.figure(figsize = (12,8))

fig1 = pl.add_subplot(221)
plot.scatter(data.iloc[:,1],data.iloc[:,2],marker='.',c = Colors)
plot.xlabel('玩游戏视频所占时间比')
plot.ylabel('每周消费冰淇淋公升数')

fig2 = pl.add_subplot(222)
plot.scatter(data.iloc[:,0],data.iloc[:,1],marker = '.',c = Colors)
plot.xlabel('plane')
plot.ylabel('game & video')

fig3 = pl.add_subplot(223)
plot.scatter(data.iloc[:,0],data.iloc[:,2],marker = '.',c = Colors)
plot.xlabel('plane ')
plot.ylabel('icecream')
plot.show()


20200721115443375

第三张图还不是特别的明显,但是看到第二张图我又有想法了,请把你的想法打在回复上。
但是,在本篇文章中,我们只用来观察数据。

我们在上述观察数据的时候发现,飞行里程都是以“万”为单位,但是每周消费的冰淇凌数却很难超过1,

那么造成的影响便是,我们分类的准确性几乎全部取决于较大数值的飞行里程,而较小数值的冰淇凌消费数很难有话语权,

So

我们需要进行”归一化“处理,使得三个特征所占的权重相同。

归一化处理的方式有很多:Sigmod归一化,Z-Score归一化,而我们这里 使用的是最简单的"0-1归一化"。
0-1归一化公式 = (x-min)/(max-min)

def minmax(dataset):
    min_d = dataset.min()
    max_d = dataset.max()
    normdataset = (dataset-min_d)/(max_d-min_d)
    return normdataset
data_t = pd.concat([minmax(data.iloc[:,:3]),data.iloc[:,3]],axis=1)
data_t.head()
0123
00.4483250.3980510.562334largeDoses
10.1587330.3419550.987244smallDoses
20.2854290.0689250.474496didntLike
30.8232010.6284800.252489didntLike
40.4201020.0798200.078578didntLike

到了划分训练集和测试集的时候了

def randSplit(data,rate = 0.8):
    n = data.shape[0]
    m = int(n*rate)
    train = data.iloc[:m,:]
    test = data.iloc[m:,:]
    test.index = range(test.shape[0])
    return train,test
train,test = randSplit(data_t)
print(train.head())
test.head()
          0         1         2           3
0  0.448325  0.398051  0.562334  largeDoses
1  0.158733  0.341955  0.987244  smallDoses
2  0.285429  0.068925  0.474496   didntLike
3  0.823201  0.628480  0.252489   didntLike
4  0.420102  0.079820  0.078578   didntLike
0123
00.5659510.0800030.431779didntLike
10.0331970.2255970.412308smallDoses
20.3910800.4925960.929750largeDoses
30.0168400.0997600.707143smallDoses
40.0995150.3029840.667006smallDoses

分类器的构建

def data_classing(trian,test,k):
    n = train.shape[1]-1
    m = test.shape[0]
    result = []
    for i in range(m):
        dist = list((((train.iloc[:,:n]-test.iloc[i,:n])**2).sum(1))**0.5)
        dist1 = pd.DataFrame({'dist':dist,'labels':(train.iloc[:,n])})
        dr = dist1.sort_values(by = 'dist')[:k]
        re = dr.loc[:,'labels'].value_counts()
        result.append(re.index[0])
    result = pd.Series(result)
    test['predict'] = result
    acc = (test.iloc[:,-1]==test.iloc[:,-2]).mean()
    print('模型的准确率为{}'.format(acc))
    return test
train,test = randSplit(data_t)
print(train.head())
test.head()
data_classing(train,test,5)
          0         1         2           3
0  0.448325  0.398051  0.562334  largeDoses
1  0.158733  0.341955  0.987244  smallDoses
2  0.285429  0.068925  0.474496   didntLike
3  0.823201  0.628480  0.252489   didntLike
4  0.420102  0.079820  0.078578   didntLike
模型的准确率为0.945
0123predict
00.5659510.0800030.431779didntLikedidntLike
10.0331970.2255970.412308smallDosessmallDoses
20.3910800.4925960.929750largeDoseslargeDoses
30.0168400.0997600.707143smallDosessmallDoses
40.0995150.3029840.667006smallDosessmallDoses
50.5230900.3946070.418764largeDoseslargeDoses
60.7819620.7680500.574250didntLikedidntLike
70.4161140.0823740.182566didntLikedidntLike
80.4643760.1809350.516860didntLikedidntLike
90.2485620.1222580.072347didntLikesmallDoses
100.4328990.4709620.645680largeDoseslargeDoses
110.1302140.1758730.918369smallDosessmallDoses
120.0541670.4679730.502735smallDosessmallDoses
130.8023180.7150800.310177didntLikedidntLike
140.1926640.5345360.759757largeDoseslargeDoses
150.7530920.3598840.977555didntLikedidntLike
160.1513920.2511460.812960smallDosessmallDoses
170.3468930.6666910.841321largeDoseslargeDoses
180.9497440.7436780.843311didntLikedidntLike
190.4734590.5967470.405673largeDoseslargeDoses
200.2639440.1107730.832161didntLikedidntLike
210.5756800.4813590.451476largeDoseslargeDoses
220.6777580.2768840.952764didntLikedidntLike
230.5248320.1978280.280246didntLikedidntLike
240.4074590.6180650.178960largeDoseslargeDoses
250.0658790.4483040.180738smallDosessmallDoses
260.2982590.3996950.969125largeDoseslargeDoses
270.7562700.3795250.781348didntLikedidntLike
280.8616130.5131800.415869didntLikedidntLike
290.3314670.5541630.168224largeDoseslargeDoses
..................
1700.5884650.5807900.819148largeDoseslargeDoses
1710.7052580.4373790.515681didntLikedidntLike
1720.1017720.4620880.808077smallDoseslargeDoses
1730.6640850.1730510.169156didntLikedidntLike
1740.2009140.2504280.739211smallDosessmallDoses
1750.2502930.7034530.886825largeDoseslargeDoses
1760.8181610.6905440.714136didntLikedidntLike
1770.3740760.6505710.214290largeDoseslargeDoses
1780.1550620.1501760.249725smallDosessmallDoses
1790.1021880.0000000.070700smallDosessmallDoses
1800.2080680.0217380.609152smallDosessmallDoses
1810.1007200.0243940.008994smallDosessmallDoses
1820.0250350.1847180.363083smallDosessmallDoses
1830.1040070.3214260.331622smallDosessmallDoses
1840.0259770.2050430.006732smallDosessmallDoses
1850.1529810.0000000.847443smallDosessmallDoses
1860.0251880.1784770.411431smallDosessmallDoses
1870.0576510.0957290.813893smallDosessmallDoses
1880.0510450.1196320.108045smallDosessmallDoses
1890.1926310.3050830.516670smallDosessmallDoses
1900.3040330.4085570.075279largeDoseslargeDoses
1910.1081150.1288270.254764smallDosessmallDoses
1920.2008590.1888800.196029smallDosessmallDoses
1930.0414140.4711520.193598smallDosessmallDoses
1940.1992920.0989020.253058smallDosessmallDoses
1950.1221060.1630370.372224smallDosessmallDoses
1960.7542870.4768180.394621didntLikedidntLike
1970.2911590.5091030.510795largeDoseslargeDoses
1980.5271110.4366550.429005largeDoseslargeDoses
1990.4794080.3768090.785718largeDoseslargeDoses

200 rows × 5 columns

准确率达到了0.945!这么强的麽?

骗你的啦,这是百里挑一的数据集,特别适合CNN,好了,快去试试你的数据集能不能和CNN契合吧!!!!

桥爹妈的

别忘了不要白嫖!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值