KNN算法
KNN(k近邻算法),极其简单我用于机器学习入门
选择K个邻居,根据他们的类别投票,分类回归都可以用他
1. knn的使用流程化
- train_test_split
我自定义的该方法
import numpy as np
def train_test_split(x_data,y_data,rate=0.8,seed=None):
'''
实现自己的train_test_split
:param x_data: 原生数据集合
:param y_data:原生数据集合
:return: X_train, X_test, y_train, y_test
'''
if seed:
np.random.seed(seed)
index_list=[i for i in range(len(y_data))]
index_list=np.random.permutation(index_list)
x_data=x_data[index_list]
y_data=y_data[index_list]
rate=int(len(y_data)*rate)
return x_data[:rate],x_data[rate:],y_data[:rate],y_data[rate:]
sklearn里面的使用方法类似,参数名:train_size random_state
2. 数据归一化,数据预处理(important)
什么时候归一化,不要所有时候都归一化
不恰当的时候的归一化,会降低准确度,当某一个特征比另外的特征的数值在计算时远大时归一化
我们一般使用Standard_Scaler
from sklearn.preprocessing import StandardScaler
standardScaler=StandardScaler()
standardScaler.fit(X_train)
X_train=standardScaler.transform(X_train)
X_test=standardScaler.transform(X_test)
- 调参(确定超参数):
KNN的超参 K,P,weight
weigjt :“uniform” | “distance”
P:“明可夫斯基距离"
我们一般使用sklearn的GridSearchCV来进行网格搜索,该方法使用交叉验证的方式进行评估,扩展性广
param_grid=[
{
"weights":['uniform'],
"n_neighbors":[i for i in range(1,11)]
},
{
"weights":['distance'],
"n_neighbors":[i for i in range(1,11)],
"p":[i for i in range(1,11)],
}
]
knns=KNeighborsClassifier()
grid_search=GridSearchCV(knns,param_grid=params_grid,n_jobs=-1,verbose=2)
grid_search.fit(X_train,y_train)
print(grid_search.best_estimator_)
print(grid_search.best_score_)
在KNN里面我们可以自己用准确率来评估,但判断标准比不上交叉验证
weights=['distance','uniform']
best_k=0
best_p=0
best_score=0
best_weights=''
for weight in weights:
for k in range(1,11):
for p in range(1,11):
if weight=='distance':
knn=KNeighborsClassifier(n_neighbors=k,weights=weight,n_jobs=-1,p=p)
else:
knn = KNeighborsClassifier(n_neighbors=k, weights=weight,n_jobs=-1)
knn.fit(X_train,y_train)
now_score=knn.score(X_test,y_test)
if now_score>best_score:
best_score=now_score
best_p=p
best_weights=weight
best_k=k
print(best_score,best_p,best_k,best_weights)
- fit模型 ,predict预测