基本梳理
-
思维导图
-
k近邻算法
-
原理
-
特点
- 优点
- 精度高
- 对异常值不敏感
- 无数据输入假定
- 缺点
- 计算复杂度高
- 空间复杂度高
- 适用数据范围
- 数值型和标称型
- 优点
-
工作原理
- 训练样本集,知道样本集中每个数据与所属分类的对应的关系
- 输入没有标签的新数据后,讲新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签
- 一般来说,只选择样本数据集中前N个最相似的数据.K一般不大于20,最后,选择k个中出现次数最多的分类,作为新数据的分类
-
一般流程
- 收集数据
- 准备数据
- 分析数据
- 训练算法(此步骤knn)中不适用
- 测试算法
- 使用算法
-
-
k近邻模型
-
模型
-
距离度量
-
L-p距离
- L p ( x i , x j ) = ( ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ p ) 1 p L _ { p } \left( x _ { i } , x _ { j } \right) = \left( \sum _ { l = 1 } ^ { n } \left| x _ { i } ^ { ( l ) } - x _ { j } ^ { ( l ) } \right| ^ { p } \right) ^ { \frac { 1 } { p } } Lp(xi,xj)=(∑l=1n∣∣∣xi(l)−xj(l)∣∣∣p)p1
-
欧式距离
- L 2 ( x i , x j ) = ( ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ 2 ) 1 2 L _ { 2 } \left( x _ { i } , x _ { j } \right) = \left( \sum _ { l = 1 } ^ { n } \left| x _ { i } ^ { ( l ) } - x _ { j } ^ { ( l ) } \right| ^ { 2 } \right) ^ { \frac { 1 } { 2 } } L2(xi,xj)=(∑l=1n∣∣∣xi(l)−xj(l)∣∣∣2)21
-
曼哈顿距离
- L 1 ( x i , x j ) = ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ L _ { 1 } \left( x _ { i } , x _ { j } \right) = \sum _ { l = 1 } ^ { n } \left| x _ { i } ^ { ( l ) } - x _ { j } ^ { ( l ) } \right| L1(xi,xj)=∑l=1n∣∣∣xi(l)−xj(l)∣∣∣
-
L∞距离
- L ∞ ( x i , x j ) = max l ∣ x i ( l ) − x j ( l ) ∣ L _ { \infty } \left( x _ { i } , x _ { j } \right) = \max _ { l } \left| x _ { i } ^ { ( l ) } - x _ { j } ^ { ( l ) } \right| L∞(xi,xj)=maxl∣∣∣xi(l)−xj(l)∣∣∣
-
-
k值的选择
- k偏小
- 近似误差会减小,估计误差会增大
- 噪声敏感
- 整体模型变得复杂,容易过拟合
- k偏大
- 估计误差减少,近似误差增大
- 模型简单
- k偏小
-
分类决策规则
- 多数表决准则
-
-
k近邻法的实现:kd树
- 加快速度
代码小练习
距离度量
- p = 1 曼哈顿距离
- p = 2 欧氏距离
- p = inf 闵式距离minkowski_distance
import math
# x, y 默认欧式距离
def L(x, y, p = 2):
if len(x) == len(y) and len(x) > 1:
sum = 0
for i in range(len(x)):
sum += math.pow(abs(x[i] - y[i]), p )
return math.pow(sum,1/p)
else:
return 0
k近邻法(少数服从多数)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
载入数据
# data
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
# data = np.array(df.iloc[:100, [0, 1, -1]])
plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0')
plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label='1')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()
<matplotlib.legend.Legend at 0x1eb97c2c4a8>
data = np.array(df.iloc[:100, [0, 1, -1]])
X, y = data[:,:-1], data[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
构造模型
class KNN:
def __init__(self, X_train, y_train, n_neighbors = 3, p = 2):
self.n = n_neighbors
self.p = p
self.X_train = X_train
self.y_train = y_train
def predict(self,X):
knn_list = []
for i in range(self.n):
dist = np.linalg.norm(X-self.X_train[i],ord=self.p)
knn_list.append((dist,self.y_train[i]))
for i in range(self.n,len(self.X_train)):
max_index = knn_list.index(max(knn_list,key=lambda x : x[0]))
dist = np.linalg.norm(X-self.X_train[i],ord=self.p)
if knn_list[max_index][0] > dist:
knn_list[max_index] = (dist,self.y_train[i])
knn = [k[-1] for k in knn_list]
count_pairs = Counter(knn)
return count_pairs.most_common(1)[0][0]
def score(self, X_test, y_test):
right_count = 0
n = 10
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right_count += 1
return right_count / len(X_test)
clf = KNN(X_train, y_train)
clf.score(X_test, y_test)
1.0
test_point = [6.0, 3.0]
print('Test Point: {}'.format(clf.predict(test_point)))
Test Point: 1.0
plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0')
plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label='1')
plt.plot(test_point[0], test_point[1], 'bo', label='test_point')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()
<matplotlib.legend.Legend at 0x1eb9a0b7128>
scikit - learn
sklearn.neighbors.KNeighborsClassifier
- n_neighbors: 临近点个数
- p: 距离度量
- algorithm: 近邻算法,可选{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
- weights: 确定近邻的权重
from sklearn.neighbors import KNeighborsClassifier
clf_sk = KNeighborsClassifier()
clf_sk.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
clf_sk.score(X_test, y_test)
1.0