一、K-NN算法的流程
(1)计算已知类别数据集的点与当前点之间的距离
(2)按照距离递增次序排列
(3)选取与当前点距离最小的k个点
欧式距离公式:
(4)确定前k个点所在类别出现的频率
(5)返回前k个点出现频率最高的类别作为当前点的预测分类
二、R语言实现
接口:class包 knn( )
setwd("D:\\")
wbcd <- read.csv("wisc_bc_data.csv",stringsAsFactors = FALSE)
wbcd
wbcd <- wbcd[-1]#第一列id列没有用,消掉第一列
wbcd$diagnosis
table(wbcd$diagnosis)#统计下目标列的信息
#很多R语言机器学习需要把目标值转化为factor
wbcd$diagnosis <- factor(wbcd$diagnosis,levels=c("B","M"),labels =c("Begnign","Malignant"))
wbcd$diagnosis
round(prop.table(table(wbcd$diagnosis))*100,digits=1)
normalize <- function(x){
return (x-min(x))/(max(x)-min(x))
}
wbcd_n <- as.data.frame(lapply(wbcd[3:31],normalize))#把特征值归一化
wbcd_train <- wbcd_n[1:469,]
wbcd_test <- wbcd_n[470:569,]
wbcd_train_label <- wbcd[1:469,1]
wbcd_test_label <- wbcd[470:569,1]
#knn
library("class")
wbcd_test_pred <- knn(train=wbcd_train,test=wbcd_test,cl=wbcd_train_label,k=21)
?knn
wbcd_test_pred
#evaluating model performance use crosstable
library(gmodels)
CrossTable(x=wbcd_test_label,y=wbcd_test_pred,prop.chisq = FALSE)
#if crosstable is wrong,first delete the memory of R and run it again,through crosstable,95 percent is correct
rm(list=ls())#delete all variable
三、python实现
接口:
•sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm=‘auto’) ◦n_neighbors:int,可选(默认= 5),k_neighbors查询默认使用的邻居数
◦algorithm:{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法:‘ball_tree’将会使用 BallTree,‘kd_tree’将使用 KDTree。‘auto’将尝试根据传递给fit方法的值来决定最合适的算法。 (不同实现方式影响效率)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.externals import joblib
class KNN:
def knn_cls(self):
wbcd = pd.read_csv("D:\\wisc_bc_data.csv")
y_data = wbcd["diagnosis"]
x_data = wbcd.drop(["id", "diagnosis"], axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.3)
# 标准化处理
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.fit_transform(x_test)
knn = KNeighborsClassifier(n_neighbors=21)
knnmodel = knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
#查看准确率
score = knn.score(X=x_test, y=y_test)
#保存模型
joblib.dump(knnmodel,"d:\\knn.model")
loadmodel=joblib.load("d:\\knn.model")
print(loadmodel.predict(x_test))
pass
if __name__ == "__main__":
knn = KNN()
knn.knn_cls()