模型
sklearn.neighbors类库里有4个主要的模型,k最邻近:KNeighborsClassifier
、KNeighborsRegressor
;半径最邻近:RadiusNeighborsClassifier
,RadiusNeighborsRegressor
。classifier分类,regression回归
这里使用KNeighborsClassifier
参数详解
最重要两个参数:metric
knn的距离计算方式和n_neighbors
k的大小。
预处理
import pandas as pd
path = "../Data/classify.csv"
rawdata = pd.read_csv(path)
X = rawdata.iloc[:,:13]
Y = rawdata.iloc[:,14] # {”A":0,"B":1,"C":2}
Y = pd.Categorical(Y).codes # ABC变成123
建模
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = k, metric="minkowski")
训练+评价
def svc_model(model):
model.fit(x_train, y_train)
acu_train = model.score(x_train, y_train)
acu_test = model.score(x_test, y_test)
y_pred = model.predict(x_test)
recall = recall_score(y_test, y_pred, average="macro")
return acu_train, acu_test, recall
def knn_model(k):
return KNeighborsClassifier(n_neighbors = k, metric="minkowski")
def run_knn(kmax):
result = {
"k":[],
"acu_train": [],
"acu_test": [],
"recall": []
}
for i in range(1,kmax+1):
acu_train, acu_test, recall = svc_model(knn_model(i))
result["k"].append(i)
result["acu_train"].append(acu_train)
result["acu_test"].append(acu_test)
result["recall"].append(recall)
return pd.DataFrame(result)
run_knn(20)
然后更改model里的metric再跑几遍
结果
随着k值增大,准确率先升后降,存在最优的k值。曼哈顿距离在本例中是比较好的。分类准确率最高达到了80%
补充:RadiusNeighborsClassifier和KNeighborsClassifier差不多,只不过要设置的是radius而不是k。radius如果设小了会无法分类。所以多试几遍。本例radius得55以上,最高的准确率不超过70%,所以没有使用也没有写。