什么是KNN?
- KNN 算法通常应用于分类算法,对于要进行预测的样本,它的预测规则是:计算各个训练样本与要进行预测的样本之间的距离(常用欧式距离来表示),然后根据指定的k值统计距离预测样本最近的前k个样本中属于各个类别的样本数量,预测结果就是类别中样本数量最多的那一个。
K值的奇偶性
- k值通常选奇数,这样就可以避免出现平局的情况。假设只有两个类别c1, c2,k=4,如果对于某一个测试数据,与它最近的前四个样本中,属于c1的有两个,属于c2的有两个,这是程序将无法对测试数据的类别进行判断。
sklearn实现KNN
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
correct = np.count_nonzero((clf.predict(X_test) == y_test) == True)
print("Accuracy is: %.3f" % (correct / len(X_test)))
自实现KNN
from sklearn import datasets
from collections import Counter
from sklearn.model_selection import train_test_split
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)
def euc_dis(instance1, instance2):
"""
计算两个样本之间的欧式距离
"""
dist = np.sqrt(np.sum((instance1 - instance2) ** 2))
return dist
def knn_classify(X, y, test_Instance, k):
"""
给定测试数据test_Instance,通过knn算法预测它的标签
X:训练样本的特征
y:训练样本的标签
"""
distances = [euc_dis(x, test_Instance) for x in X]
kneighbors = np.argsort(distances)[:k]
count = Counter(y[kneighbors])
return count.most_common()[0][0]
K值的选择–K折交叉验证(K-Fold Cross Validation)
- K折交叉验证,会在我们已经划分好的训练集和测试集中的训练集进行再次划分,把它划分为训练集和验证集;使用规则:对于不同的k值,我们先使用训练集训练模型,然后再使用验证集评估模型的准确率,这样对于每一个k值我们会得到一个相应的准确率,则准确率较高的k即为我们要使用的k。
- 测试集作用:用来测试模型是否满足上线要求,它不能用于模型训练过程的任何一个环节。
sklearn实现K折交叉验证实现
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
iris = datasets.load_iris()
X = iris.data
y = iris.target
ks = [1, 3, 5, 7, 9, 11, 13, 15]
"""
KFold返回的是每一折中训练数据和验证数据的下标
假设样本为:[1, 3, 5, 6, 11, 12, 43, 12, 44, 2]
则可能的kf格式为(前面是训练数据的下标,后面是验证集的下标):
[0, 1, 2, 3, 4, 5, 6, 7, 8] [9, 10]
"""
kf = KFold(n_splits=5, random_state=2001, shuffle=True)
best_k = ks[0]
best_score = 0
for k in ks:
cur_score = 0
for train_index, valid_index in kf.split(X):
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X[train_index], y[train_index])
cur_score += clf.score(X[valid_index], y[valid_index])
avg_score = cur_score / 5
if avg_score > best_score:
best_score = avg_score
best_k = k
print("current best score is %.2f" % best_score, "best k: %d" % best_k)
print("after cross validation, the final best k is: %d" % best_k)
网格搜索K折交叉验证实现
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
parameters = {"n_neighbors": [1, 3, 5, 7, 9, 11, 13, 15]}
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X, y)
print(clf.best_score_, clf.best_params_)
特征缩放
- 线性归一化Min-max Normalization
x_new = (x - min(x)) / (max(x) - min(x)) - 标准差归一化Z-score Normalization
x_new = (x - mean(x)) / std(x)