定义
超参数:在算法运行前需要决定的参数
模型参数:算法过程中学习的参数
kNN算法中的k值即为超参数,本文将重点介绍他。
确定kNN算法中的k
首先我们将k值定为3,来测试一下他的准确度:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
digits = datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666)
knn_clf = KNeighborsClassifier(n_neighbors = 3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)
结果为:0.9916666666666667。
注:其中的random_state为随机数种子,当我们固定random_state后,每次构建的模型是相同的、生成的数据集是相同的、每次的拆分结果也是相同的。
确定kNN算法中的k:
best_score = 0.0
besk_k = -1
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors = k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
print("best_k = ", best_k)
print("best_score = ", best_score)
只需要列出一个k的范围然后依次循环即可。如果我们确定的值刚好为10的话,我们有必要继续向上搜索。之所以如此,是因为我们如果得出的值是边界值,那么我们就不能确保这个值是最好的。
kNN中的另一个超参数weights
有时候我们需要判断距离的权重,如下图:
权重为距离的倒数,然后可以得出结果为红色胜利。
其中有两种结果,一种为uniform,一种为distance。uniform的意思即不考虑
距离的权重,distance即考虑。以下代码为判断是否考虑距离:
best_method = ""
best_score = 0.0
besk_k = -1
for method in ["uniform", "distance"]:
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors = k, weights = method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
print("best_method_ = ", best_method)
print("best_k = ", best_k)
print("best_score = ", best_score)
结果如下:
best_method_ = uniform
best_k = 3
best_score = 0.9916666666666667
因此不需要考虑距离。
kNN中的另一个超参数p
明可夫斯基距离:
另一个超参数p即为明可夫斯基距离,以下代码则为确定p的值:
best_p = -1
best_score = 0.0
besk_k = -1
for k in range(1, 11):
for p in range(1, 6):
knn_clf = KNeighborsClassifier(n_neighbors = k, weights = "distance", p = p)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_p = p
print("best_p = ", best_p)
print("best_k = ", best_k)
print("best_score = ", best_score)