目录
4-5 超参数 05-Hyper-Parameters
random_state=666 随机种子,保证每次运行的结果一样
best_score = 0.0
best_k = -1
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
print("best_k =", best_k)
print("best_score =", best_score)
如果最好的值在边界上,则有可能好的值在边界外面,如果是10,则要对10以上的一些数计算
只计了投票数,没有权重,近的则权重大一点,比较合理
权重是距离的倒数
各有一票,则是平票, 解决平票的情况
sklearn.neighbors.KNeighborsClassifier — scikit-learn 1.0 documentation
官方文档的说明
best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)
()----》| |
有一定的一致性两者在数学上,对其进行推广
p = 1为莫达顿距离, 2为欧拉距离 又是一个超参数
best_score = 0.0
best_k = -1
best_p = -1
for k in range(1, 11):
for p in range(1, 6):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_p = p
best_score = score
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)
distance和p有关,而uniform则和p无关
4-6 网格搜索与k近邻算法中更多超参数
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)],
'p': [i for i in range(1, 6)]
}
]
uniform 10
weights 10*5=50
数组,里面是字典,定义探索参数的集合
knn_clf = KNeighborsClassifier()
10+50= 60种不同的结果
两次运行weights可以不同,因为使用的CV交叉验证,这个和算法有关
n_jobs指定使用的计算机核数,并行运算,-1使用所有的核
运行没有什么输出, verbose越大则输出的信息越详细,输出的信息就是使用verbose的意义
鸢尾花的分类案例
import seaborn as sns
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X = iris.data[:,:2]
# X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.15, random_state = 6)
# Create color maps
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ['darkorange', 'c', 'darkblue']
h = .02 # step size in the mesh
def drawBoundary(knn_clf,n_neighbors,weights):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = knn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=cmap_light)
# plt.contour(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y],
palette=cmap_bold, alpha=1.0, edgecolor="black")
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.show() # 当有多个图片要显示时只能一张显示后关了才能显示第二张
# 自己实现的网格搜索
best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
for k in range(1, 18):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
# drawBoundary(knn_clf, best_k, best_method)
# 如果这个绘制函数放在drawBoundary函数里当有多个图片要显示时只能一张显示后关了才能显示第二张
# 但把这句放在下面这儿就不会
# plt.show() # 在drawBoundary后面一定要有这句不然图像绘不出来,单步调试时也只会显示一部分,但程序运行完后就不显示
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)
# 采用系统自带的网格搜索
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 18)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 18)],
'p': [i for i in range(1, 6)]
}
]
from sklearn.model_selection import GridSearchCV
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
grid_srearch = GridSearchCV(clf, param_grid, n_jobs = -1, verbose = -1)
grid_srearch.fit(X_train, y_train)
print(10*"-------------")
print("best:%f using %s" % (grid_srearch.best_score_,grid_srearch.best_params_))
# print(grid_srearch.best_params_['n_neighbors'])
# print(grid_srearch.best_params_['weights'])
# print(grid_srearch.best_estimator_)
# means = grid_srearch.cv_results_['mean_test_score']
# params = grid_srearch.cv_results_['params']
#
# for mean, param in zip(means,params):
# print("%f with: %r" % (mean,param))
drawBoundary(grid_srearch.best_estimator_, grid_srearch.best_params_['n_neighbors'], grid_srearch.best_params_['weights'])
pandas读取数据
pandas在excel中读取的数据类型与numpy的数据类型是不一样
pandas是DataFrame,numpy是array
excel表格数据
其他超参数
sklearn.neighbors.DistanceMetric — scikit-learn 1.0 documentation