在前面的博客中,对鸢尾花数据集以及手写字体数据集进行了分析,knn算法对于较大的数据集也能有比较好的预测结果,但是比较困扰的问题是,对于其中的一些参数的设置,怎样才算好的呢,什么样的参数才能让这个算法更加高效呢?对此进行了如下总结。
1. algorithm
1.1 4种算法
在建立knn模型的时候,对于搜索最近的k个点,可以采取不同的算法:
- 暴力搜索brute
该算法会计算待测样本和样本集中所有样本的距离,再选出前k个最近样本,这种方法需要遍历整个样本集,所以效率比较低(若有N个样本点,则时间复杂度大约为K*N,若N太大,则效率较低); - KD树kd_tree
关于kd树的介绍写在了另一篇单独的博客中:
对kd树的学习
但需要注意的是,维数灾难使得大部分的搜索算法在高维情况下都显得花哨且不实用。
同样的,在高维空间中,k-d树叶也不能做很高效的最邻近搜索。一般的准则是:在k维(k表示特征数)情况下,数据点数目N应当远远大于 2 k 2^{k} 2k时 ,k-d树的最邻近搜索才可以很好的发挥其作用。不然的话,大部分的点都会被搜索,最终算法效率也不会比brute的整体遍历好到哪里去。 - 球树ball_tree
KD树算法在一定程度上提高了KNN搜索的效率,但在处理分布不均匀的数据集时,效率并不高,故产生了球树这种优化的算法。
球树构建流程:
① 先构建一个超球体,这个超球体是可以包含所有样本的最小球体;
② 从球中选择一个离球的中心最远的点,然后选择第二个点离第一个点最远,将球中所有的点分配到离这两个聚类中心最近的一个上,然后就散每个聚类的中心,以及聚类能够包含它所有数据点所需的最小半径。这样我们得到了两个子超球体,和KD树里面的左右子树对应。
③ 对于这两个子超球体,递归执行步骤2,最终得到了一个球树。
KD树和球树类似,主要区别在于球树得到的是节点样本组成的最小超球体,而KD树得到的是节点样本组成的超矩形体,这个超球体要比对应的KD树的超矩形体小,所以在做最近邻搜索的时候,可以避免一些无谓的搜索。 - 默认值
该种情况会自动选取上述3种最快的
1.2 4种算法运行速度
情况一:样本数目N>> 2 k 2^{k} 2k
- 暴力搜索brute
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=10, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为brute
b_classify = KNeighborsClassifier(algorithm='brute')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
make_classification参数解释:
- n_samples:待生成的样本的总数
- n_features:每个样本的特征数
- n_classes:分类类别
- random_state:如果是int,random_state是随机数发生器使用的种子; 如果是RandomState实例,random_state是随机数生成器; 如果没有,则随机数生成器是np.random使用的RandomState实例
- KD树
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=10, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为kd_tree
b_classify = KNeighborsClassifier(algorithm='kd_tree')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
- 球树
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=10, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为ball_tree
b_classify = KNeighborsClassifier(algorithm='ball_tree')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
- 默认值
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=10, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为auto
b_classify = KNeighborsClassifier(algorithm='auto')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
由以上过程可以明显观察出:在样本数目N>>
2
k
2^{k}
2k时,KD树和球树的速度明显快于暴力搜索brute算法,不指定(auto)则会自动选择三者中最快的算法。
情况二:样本数目N< 2 k 2^{k} 2k
- 暴力搜索brute
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=20, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为brute
b_classify = KNeighborsClassifier(algorithm='brute')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
- KD树
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=20, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为kd_tree
b_classify = KNeighborsClassifier(algorithm='kd_tree')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
- 球树
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=20, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为ball_tree
b_classify = KNeighborsClassifier(algorithm='ball_tree')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
- 默认值
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from time import time
# 创建随机的分类数据集
data = datasets.make_classification(n_samples=20000, n_features=20, n_classes=2)
X = data[0]
Y = data[1]
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 指定使用搜索算法为auto
b_classify = KNeighborsClassifier(algorithm='auto')
# 通过开始时间以及结束时间,计算运行时间
start_time = time()
b_classify.fit(X_train, Y_train)
b_score = b_classify.score(X_test, Y_test)
end_time = time()
run_time = end_time - start_time
print("Runtime:", run_time, "seconds")
print("Score:", b_score)
由以上过程可以明显观察出:在样本数目N< 2 k 2^{k} 2k时,KD树和球树的速度明显变慢。
综上,在一般情况下,knn算法中,对于搜索算法的设置,通常我们让模型自动选择合适的算法,而不过多干预。
2. n_neighbors
n_neighbors即我们说的k值,最近要选择的k个点,默认值为5.
还是通过鸢尾花数据集进行验证合理的k值选取:
设置k = 3时:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 加载鸢尾花数据集
irisData = datasets.load_iris()
X = irisData.data
Y = irisData.target
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
knn_classify = KNeighborsClassifier(n_neighbors=3)
knn_classify.fit(X_train, Y_train)
knn_score = knn_classify.score(X_test, Y_test)
print("Score = ", knn_score)
k使用默认值5时:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 加载鸢尾花数据集
irisData = datasets.load_iris()
X = irisData.data
Y = irisData.target
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
knn_classify = KNeighborsClassifier(n_neighbors=5)
knn_classify.fit(X_train, Y_train)
knn_score = knn_classify.score(X_test, Y_test)
print("Score = ", knn_score)
由上述结果可知:k=3时,模型得分0.93;k=5时,模型得分0.97;故此时默认值是更好的选择,但是我们怎么确定还有没有比5更好的选择呢,怎么获取得分最高的k呢?
可以通过递归搜索的方式来寻找最优的k值
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 加载鸢尾花数据集
irisData = datasets.load_iris()
X = irisData.data
Y = irisData.target
# 划分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
best_n_neighbors = 0
prime_score = 0
for n_neighbors in range(1, 15):
knn_classify = KNeighborsClassifier(n_neighbors)
knn_classify.fit(X_train, Y_train)
knn_score = knn_classify.score(X_test, Y_test)
if knn_score > prime_score:
prime_score = knn_score
best_n_neighbors = n_neighbors
print("best n_neighbors is:", best_n_neighbors)
print("prime score is:", prime_score)
可以看出,k取1-15中,建立的15个模型里,在k=5的时候模型得分最高,可以此来寻找最佳的k。
所以对于最佳k值的选取,我们可以通过循环的方式来实现。