K近邻算法 —— KNN
1.示意图
欲判断绿点的种类:选取了与它距离最近的三个点,其中红色占两个,蓝色占一个,所以预测它为红色种类
2.KNN代码实现
X_train
:数据
y_train
:标签
x
:待预测数据
import numpy as np
from math import sqrt
distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
欧拉公式(计算距离)
二维: ( x a − x b ) 2 + ( y a − y b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2} (xa−xb)2+(ya−yb)2
三维: ( x a − x b ) 2 + ( y a − y b ) 2 + ( z a − z b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2+ (z^a - z^b)^2} (xa−xb)2+(ya−yb)2+(za−zb)2
n维: ∑ i = 1 n ( x i a − x i b ) 2 \sqrt{\sum_{i=1}^n(x^a_i - x^b_i)^2} ∑i=1n(xia−xib)2
k = 6
nearest = np.argsort(distances)
topK_y = {y_train[i] for i in nearest[:k]} # 距离x最近的k个点的标签值
# {1, 1, 1, 1, 1, 0}
from collections import Counter
votes = Counter(topK_y) # 计数器
# Counter({0: 1, 1: 5})
votes.most_common(1) # 出现频率最高的键对
# [(1, 5)]
predict_y = votes.most_common(1)[0][0] # 预测的标签
# 1
3.KNN的模型
对于KNN来说,训练集就是模型
4.使用scikit-learn中的KNN
from sklearn.neighbors import KNeighborsClassifier
KNN_classifier = KNeighborsClassifier(n_neighbors=6) # 构造实例
KNN_classifier.fit(X_train, y_train) # fit拟合
KNN_classifier.predict(x) # 传入向量的形式不推荐
KNN_classifier.predict(x.reshape(1, -1))
5.判断机器学习算法的性能
训练 - 测试分离(train_test_split)
import numpy as np
def train_test_split(X, y, test_ratio = 0.2, seed = None):
if seed:
np.random.seed(seed)
shuffle_indexes = np.random.permutation(len(X)) # len(X)个数随机排列
test_size = int(len(X) * test_ratio)
test_indexes = shuffle_indexes[: test_size]
train_indexes = shuffle_indexes[test_size:]
X_train = X[train_indexes]
X_test = X[test_indexes]
y_train = y[train_indexes]
y_test = y[test_indexes]
return X_train, X_test, y_train, y_test
检测KNN算法的准确率(accuracy_score)
X_train, X_test, y_train, y_test = train_test_split(X, y)
KNN_classifier = KNeighborsClassifier(n_neighbors = 3)
KNN_classifier.fit(X_train, y_train)
y_predict = KNN_classifier.predict(X_test)
accuracy_score = sum(y_predict == y_test) / len(y_test)
sklearn中的train_test_split
和accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
6.超参数
超参数:在算法运行前需要决定的参数
模型参数:算法过程中学习的参数
KNN算法没有模型参数,k是典型的超参数
寻找好的超参数的方法:
- 领域知识
- 经验数值
- 实验搜索
from sklearn.neighbors import KNeighborsClassifier
best_score = 0.0
best_k = -1
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
print("best_k = ", best_k)
print("best_score = ", best_score)
KNN中另一个超参数——weights
weights的取值:
-
“uniform”:权重相同
-
“distance”:权重是距离的倒数
7.网格搜索(Grid Search)
穷举所有的超参数组合,找到表现最好的超参数
设置超参数遍历的范围
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)]
}
]
导入GridSearchCV
,并进行拟合
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2) # n_jobs: 核的数量,default=1
# verbose: 打印信息
grid_search.fit(X_train, y_train)
查看结果
grid_search.best_estimator_ # 最佳模型
grid_search.best_score_ # 最佳精度
grid_search.best_params_ # 最佳超参数
8.数据归一化
当特征的尺度相差较大,某些特征的影响会占主导,如:
肿瘤大小(厘米) | 发现时间(天) | |
---|---|---|
样本1 | 1 | 200 |
样本2 | 5 | 100 |
此时样本距离已经被发现时间所主导
解决方法——数据归一化
-
最值归一化(normalization):把所有数据映射到0-1之间,适用于分布有明显边界的情况
X s c a l e = X − X m i n X m a x − X m i n X_{scale} = \frac{X - X_{min}}{X_{max} - X_{min}} Xscale=Xmax−XminX−Xmin
-
均值方差归一化(standardization):把所有数据归一到均值为0方差为1的分布中
适用:数据分布没有明显的边界,或者存在极端数据值
X s c a l e = X − X m e a n X s t d X_{scale} = \frac{X - X_{mean}}{X_{std}} Xscale=XstdX−Xmean
用scikit-learn对测试数据集进行归一化
以均值方差归一化为例,fit
过程中保留的关键信息就是均值和方差
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666)
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
standardScaler.mean_ # 查看均值
# array([5.83416667, 3.08666667, 3.70833333, 1.17 ])
standardScaler.scale_ # 查看标准差
# array([0.81019502, 0.44327067, 1.76401924, 0.75317107])
# 对训练集和验证集都进行归一化处理
X_train = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
# 对归一化效果进行检验
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors = 3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test_standard, y_test) # 传入的一定是X_test_standard !!
# 1.0
可以看到,准确率是1.0,效果很好
对于最值归一化,则需要导入sklearn.preprocessing.MinMaxScaler
9.一些延伸
-
K近邻算法解决回归问题
导入库
sklearn.neighbors.KNeighborsRegressor
,文档 -
K近邻算法的缺点
效率低。如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n)
高度数据相关。错误数据影响很大
预测结果不具有可解释性。
维数灾难。随着维度的增加,“看似相近”的两个点之间的距离越来越大