K-近邻算法(KNN)
K nearest neighbour
0、导引¶
如何进行电影分类
众所周知,电影可以按照题材分类,然而题材本身是如何定义的?由谁来判定某部电影属于哪 个题材?也就是说同一题材的电影具有哪些公共特征?这些都是在进行电影分类时必须要考虑的问 题。没有哪个电影人会说自己制作的电影和以前的某部电影类似,但我们确实知道每部电影在风格 上的确有可能会和同题材的电影相近。那么动作片具有哪些共有特征,使得动作片之间非常类似, 而与爱情片存在着明显的差别呢?动作片中也会存在接吻镜头,爱情片中也会存在打斗场景,我们 不能单纯依靠是否存在打斗或者亲吻来判断影片的类型。但是爱情片中的亲吻镜头更多,动作片中 的打斗场景也更频繁,基于此类场景在某部电影中出现的次数可以用来进行电影分类。
本章介绍第一个机器学习算法:K-近邻算法,它非常有效而且易于掌握。
1、k-近邻算法原理
简单地说,K-近邻算法采用测量不同特征值之间的距离方法进行分类。
- 优点:精度高、对异常值不敏感、无数据输入假定。
- 缺点:时间复杂度高、空间复杂度高。
- 适用数据范围:数值型和标称型。
工作原理
存在一个样本数据集合,也称作训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据 与所属分类的对应关系。输人没有标签的新数据后,将新数据的每个特征与样本集中数据对应的 特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们 只选择样本数据集中前K个最相似的数据,这就是K-近邻算法中K的出处,通常K是不大于20的整数。 最后 ,选择K个最相似数据中出现次数最多的分类,作为新数据的分类。
回到前面电影分类的例子,使用K-近邻算法分类爱情片和动作片。有人曾经统计过很多电影的打斗镜头和接吻镜头,下图显示了6部电影的打斗和接吻次数。假如有一部未看过的电影,如何确定它是爱情片还是动作片呢?我们可以使用K-近邻算法来解决这个问题。
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
项目一:对电影类型进行分类
In [2]:
# 获取电影数据
# 打开excel 文件的第二个sheet
movie = pd.read_excel('../data/movies.xlsx',sheet_name=1)
movie
Out[2]:
电影名称 | 武打镜头 | 接吻镜头 | 分类情况 | |
---|---|---|---|---|
0 | 大话西游 | 36 | 1 | 动作片 |
1 | 杀破狼 | 43 | 2 | 动作片 |
2 | 前任3 | 0 | 10 | 爱情片 |
3 | 战狼2 | 59 | 1 | 动作片 |
4 | 泰坦尼克号 | 1 | 15 | 爱情片 |
5 | 星语心愿 | 2 | 19 | 爱情片 |
In [3]:
# 对数据根据索引进行切片取值
X = movie.iloc[:,1:3]
y = movie['分类情况']
# X 是DataFrme y 是Seroes
display(X,y)
武打镜头 | 接吻镜头 | |
---|---|---|
0 | 36 | 1 |
1 | 43 | 2 |
2 | 0 | 10 |
3 | 59 | 1 |
4 | 1 | 15 |
5 | 2 | 19 |
0 动作片 1 动作片 2 爱情片 3 动作片 4 爱情片 5 爱情片 Name: 分类情况, dtype: object
In [4]:
# n_neighbore l邻居数量不能大于样本量
knn = KNeighborsClassifier(n_neighbors = 6,weights='distance')
# 训练 knn.fit(X,y)
Out[4]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=6, p=2, weights='distance')
In [5]:
# y预测
# 算法训练或者预测的数据,必须是二维的
X_test = np.array([[50,1],[30,25],[15,80]])
knn.predict(X_test)
Out[5]:
array(['动作片', '动作片', '爱情片'], dtype=object)
项目二:蓝蝴蝶预测
In [6]:
# sklearn 自带一个数据集,导包即可
from sklearn import datasets
In [7]:
iris = datasets.load_iris()
iris
. . .
In [8]:
X = iris['data']
y = iris['target']
display(X,y)
. . .
In [9]:
# 根据索引值把数据打乱
index = np.arange(150)
np.random.shuffle(index)
index
Out[9]:
array([ 13, 11, 15, 7, 51, 91, 81, 24, 31, 76, 112, 47, 124, 6, 71, 109, 93, 36, 149, 113, 56, 46, 73, 110, 66, 138, 75, 148, 131, 16, 99, 120, 23, 49, 103, 135, 86, 1, 19, 33, 18, 114, 67, 97, 58, 82, 144, 21, 17, 137, 106, 136, 87, 30, 126, 108, 43, 78, 107, 74, 102, 79, 95, 62, 123, 140, 147, 142, 143, 55, 59, 48, 44, 72, 129, 32, 39, 8, 4, 60, 61, 57, 146, 116, 35, 41, 92, 3, 85, 89, 9, 132, 121, 28, 42, 40, 53, 27, 94, 96, 128, 10, 22, 100, 12, 88, 115, 141, 105, 84, 65, 122, 37, 52, 101, 125, 118, 77, 69, 119, 68, 54, 127, 104, 117, 64, 83, 130, 20, 45, 70, 0, 139, 63, 50, 80, 145, 90, 34, 98, 134, 14, 2, 5, 25, 133, 26, 29, 111, 38])
In [10]:
X = X[index]
y = y[index]
y
Out[10]:
array([0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 2, 0, 2, 0, 1, 2, 1, 0, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 2, 0, 0, 2, 2, 1, 0, 0, 0, 0, 2, 1, 1, 1, 1, 2, 0, 0, 2, 2, 2, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 0, 1, 2, 0, 0, 0, 0, 1, 1, 1, 2, 2, 0, 0, 1, 0, 1, 1, 0, 2, 2, 0, 0, 0, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1, 2, 2, 2, 1, 1, 2, 0, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 2, 1, 0, 1, 2, 0, 0, 0, 0, 2, 0, 0, 2, 0])
In [11]:
# 将数据一分为二,取一部分作为训练集
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X[:120],y[:120])
#根据算法预测剩下部分的类别
y_ = knn.predict(X[-30:])
y_
Out[11]:
array([1, 1, 2, 2, 2, 1, 2, 2, 0, 0, 2, 0, 2, 1, 1, 1, 2, 1, 0, 1, 2, 0, 0, 0, 0, 2, 0, 0, 2, 0])
In [12]:
#与预测数据做一下对比
y[-30:]
Out[12]:
array([1, 1, 2, 2, 2, 1, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 2, 1, 0, 1, 2, 0, 0, 0, 0, 2, 0, 0, 2, 0])
In [13]:
# 计算准确的概率
(y_ == y[-30:]).sum()/30.0
Out[13]:
0.93333333333333335
In [14]:
# 增加权重后的算法
knn = KNeighborsClassifier(n_neighbors=5,weights='distance')
knn.fit(X[:120],y[:120])
y_ = knn.predict(X[-30:])
# 计算准确率
knn.score(X[-30:],y[-30:])
Out[14]:
0.93333333333333335
In [15]:
# 调换数据集的顺讯重新验证算法模型
scores = []
for i in range(100):
np.random.shuffle(index)
X = X[index]
y = y[index]
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X[:120],y[:120])
score_ = knn.score(X[-30:],y[-30:])
scores.append(score_)
In [16]:
# 在测试一百次以后,求算法准确率的平均值
np.mean(scores)
Out[16]:
0.96233333333333337
In [17]:
X[:10]
Out[17]:
array([[ 7.7, 3.8, 6.7, 2.2], [ 5. , 3.4, 1.5, 0.2], [ 4.7, 3.2, 1.6, 0.2], [ 5.1, 3.3, 1.7, 0.5], [ 5.1, 3.8, 1.6, 0.2], [ 5.1, 3.7, 1.5, 0.4], [ 5.7, 4.4, 1.5, 0.4], [ 5.5, 2.6, 4.4, 1.2], [ 4.8, 3.4, 1.6, 0.2], [ 5.6, 2.8, 4.9, 2. ]])
项目三:手写数字识别
In [18]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
In [19]:
digit = plt.imread('../exercise/data/0/0_1.bmp')
In [20]:
digit.shape
Out[20]:
(28, 28)
In [21]:
# 设置图片的样式
plt.figure(figsize=(1,1))
plt.imshow(digit,cmap = 'gray')
Out[21]:
<matplotlib.image.AxesImage at 0xb24fa90>
In [22]:
digit = plt.imread('../exercise/data/1/1_102.bmp')
plt.figure(figsize=(1,1))
plt.imshow(digit,cmap = 'gray')
Out[22]:
<matplotlib.image.AxesImage at 0xb2c0860>
In [23]:
# 读取所有文件的图片数据5000个
data = []
target = []
for i in range(10):
for j in range(1,501):
digit = plt.imread('../exercise/data/%d/%d_%d.bmp'%(i,i,j))
label = i
data.append(digit)
target.append(label)
In [24]:
data
. . .
In [25]:
target
. . .
In [26]:
X = np.array(data)
y = np.array(target)
In [27]:
# 算法要求数据是一个二维的,目前的数据是三维,所以需要进行转换
X.shape
Out[27]:
(5000, 28, 28)
In [28]:
y.shape
Out[28]:
(5000,)
In [29]:
# 对X进行转换
X =X.reshape(5000,-1)
X.shape
Out[29]:
(5000, 784)
In [30]:
# 训练数据,将数据分为两部分
In [31]:
a = np.arange(10)
a
Out[31]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [32]:
b = np.arange(20,30)
b
Out[32]:
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
In [33]:
# 打乱顺序,将数据一分为二,多个分割方式完全一致
train_test_split(a,b,test_size = 3)
Out[33]:
[array([9, 4, 7, 1, 2, 5, 0]), array([3, 8, 6]), array([29, 24, 27, 21, 22, 25, 20]), array([23, 28, 26])]
In [34]:
train_test_split(a,b,test_size = 0.4)
Out[34]:
[array([9, 6, 7, 0, 4, 5]), array([8, 1, 2, 3]), array([29, 26, 27, 20, 24, 25]), array([28, 21, 22, 23])]
In [35]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.15)
In [41]:
display(X_train.shape,X_test.shape)
(4250, 784)
(750, 784)
In [38]:
display(y_train.shape,y_test.shape)
(4250,)
(750,)
In [40]:
y_test
Out[40]:
array([4, 7, 1, 6, 2, 8, 2, 1, 2, 9, 4, 6, 8, 3, 2, 8, 4, 3, 7, 7, 1, 9, 3, 0, 6, 3, 2, 7, 8, 0, 8, 6, 8, 7, 2, 5, 6, 9, 0, 9, 9, 6, 4, 9, 9, 0, 2, 6, 9, 1, 1, 8, 7, 9, 8, 4, 5, 5, 8, 6, 6, 0, 0, 6, 3, 1, 4, 5, 8, 1, 1, 8, 0, 0, 1, 0, 1, 4, 3, 3, 2, 2, 7, 5, 3, 2, 2, 7, 9, 2, 8, 0, 7, 6, 5, 5, 4, 1, 8, 1, 1, 3, 9, 0, 4, 1, 6, 0, 6, 8, 6, 3, 9, 2, 0, 4, 8, 9, 4, 4, 2, 0, 6, 2, 2, 4, 3, 6, 6, 6, 3, 6, 1, 2, 3, 9, 9, 1, 6, 0, 7, 6, 9, 5, 0, 5, 9, 5, 4, 0, 4, 1, 7, 5, 0, 1, 1, 9, 6, 1, 4, 8, 4, 2, 8, 2, 4, 9, 8, 7, 7, 1, 0, 3, 4, 0, 1, 4, 8, 2, 4, 6, 7, 6, 5, 0, 9, 6, 7, 4, 8, 6, 6, 6, 6, 6, 4, 5, 0, 9, 7, 5, 1, 9, 1, 6, 8, 0, 7, 0, 6, 1, 6, 3, 1, 7, 6, 5, 0, 0, 6, 7, 1, 9, 7, 7, 5, 9, 1, 9, 2, 7, 6, 6, 1, 2, 7, 0, 6, 1, 5, 7, 5, 8, 7, 8, 5, 0, 5, 5, 4, 3, 7, 9, 7, 4, 5, 7, 7, 6, 4, 4, 6, 5, 0, 5, 6, 2, 7, 5, 4, 9, 0, 9, 4, 5, 2, 9, 9, 8, 9, 2, 4, 2, 1, 3, 8, 5, 4, 1, 6, 3, 3, 8, 2, 8, 0, 4, 4, 6, 3, 7, 2, 3, 3, 2, 2, 9, 2, 3, 9, 9, 4, 4, 8, 8, 9, 9, 8, 6, 8, 2, 9, 0, 5, 1, 1, 9, 1, 5, 0, 6, 1, 4, 8, 2, 1, 4, 5, 8, 9, 9, 9, 1, 4, 4, 8, 7, 0, 7, 6, 1, 9, 0, 9, 8, 8, 5, 3, 1, 2, 2, 5, 6, 3, 2, 0, 8, 8, 5, 6, 7, 5, 0, 0, 7, 1, 4, 6, 7, 1, 1, 0, 8, 9, 8, 0, 1, 9, 6, 5, 2, 6, 0, 7, 2, 8, 4, 5, 9, 6, 3, 0, 9, 8, 2, 0, 7, 0, 8, 1, 8, 4, 2, 4, 8, 3, 7, 5, 2, 4, 0, 1, 2, 9, 7, 9, 6, 8, 3, 0, 2, 6, 0, 5, 3, 0, 7, 9, 7, 4, 9, 6, 7, 6, 1, 6, 5, 0, 4, 3, 0, 9, 6, 4, 4, 2, 1, 3, 0, 9, 9, 7, 4, 2, 0, 4, 1, 3, 1, 6, 7, 4, 1, 4, 1, 9, 2, 5, 6, 2, 3, 3, 5, 6, 0, 9, 0, 2, 5, 4, 5, 1, 2, 0, 9, 9, 5, 9, 5, 0, 4, 9, 3, 2, 2, 5, 4, 9, 7, 5, 1, 6, 6, 7, 4, 1, 2, 9, 4, 1, 5, 7, 2, 9, 7, 8, 9, 9, 0, 9, 9, 6, 6, 1, 7, 4, 9, 1, 0, 9, 7, 9, 4, 5, 6, 1, 6, 4, 5, 0, 0, 1, 3, 5, 5, 2, 4, 6, 4, 3, 3, 3, 1, 9, 5, 8, 9, 8, 3, 2, 0, 7, 9, 0, 5, 3, 2, 1, 4, 3, 8, 4, 1, 0, 3, 6, 3, 3, 8, 0, 7, 5, 5, 1, 7, 7, 2, 3, 5, 3, 7, 9, 4, 7, 6, 8, 2, 4, 6, 3, 1, 0, 8, 1, 0, 6, 0, 4, 2, 7, 0, 8, 9, 0, 0, 5, 1, 6, 4, 8, 2, 1, 4, 4, 3, 6, 6, 3, 5, 8, 2, 5, 1, 5, 8, 8, 9, 5, 4, 6, 3, 2, 3, 9, 3, 3, 2, 4, 9, 6, 6, 5, 2, 9, 9, 3, 5, 8, 6, 7, 2, 5, 2, 6, 8, 0, 3, 1, 7, 7, 9, 7, 3, 4, 5, 5, 2, 1, 1, 9, 8, 8, 9, 3, 3, 7, 2, 4, 5, 6, 7, 3, 9, 3, 4, 7, 4, 0, 6, 3, 7, 1, 8, 1, 1, 4, 6, 1, 5, 2, 9, 3, 6, 4, 8, 7, 2, 9, 1, 8, 1, 7, 7, 6, 3, 0, 2, 5, 6, 4, 2, 8, 2, 5, 5, 4, 2, 8, 4])
In [43]:
knn = KNeighborsClassifier(n_neighbors=5,weights='distance')
# 训练4250个数据
knn.fit(X_train,y_train)
y_ = knn.predict(X_test)
# y_test 样本准确度概率
knn.score(X_test,y_test)
Out[43]:
0.95199999999999996
In [44]:
# 将原来的数据进行测试
knn.score(X_train,y_train)
Out[44]:
1.0
In [45]:
plt.figure(figsize=(10*1,10*1.5))
for i in range(100):
axes = plt.subplot(10,10,i+1)
axes.imshow(X_test[i].reshape(28,28))
t = y_test[i]
p = y_[i]
axes.set_title('Ture:%d\nPredict:%d'%(t,p))
axes.axis('off')
In [ ]: