误差模型：过拟合，交叉验证,偏差-方差权衡

最新推荐文章于 2020-10-23 18:41:46 发布

凌风探梅

最新推荐文章于 2020-10-23 18:41:46 发布

阅读量1.5k

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

96 篇文章 1 订阅

订阅专栏

from: http://www.voidcn.com/blog/Mark_LQ/article/p-5983585.html

Introduction

In this post you will get to grips with what is perhaps the most essential concept in machine learning: thebias-variance trade-off（偏差-方差权衡）. The main idea here is that you want to create models that areas good at prediction as possible but that are still applicable to new data (i.e. they are generalizable). The danger is that you can easily create models thatoverfit to the local noise in your specific dataset, which isn’t too helpful and leads to poor generalizability since the noise is random and therefore different in each dataset. Essentially, you want to create models that capture only the useful components of a dataset. On the other hand, models that generalize very well but are too inflexible to generate good predictions are the other extreme you want to avoid (this is calledunderfitting).

We discuss and demonstrate these concepts using the k-nearest neighbors algorithm, which has a simple parameter k which can be varied to cleanly demonstrate these ideas of underfitting, overfitting and generalization（泛化：指模型对未知样本的适应能力）. Together, this bundle of concepts related to the balance between underfitting and overfitting is referred to as the bias-variance trade-off. Here is a table summarizing some different but related characteristics of models which either underfit or overfit, which you can refer to throughout this post:

bias_variance_table

We will explain what all of these terms mean and how they are interrelated. We will also discusscross-validation, which is a good way of estimating the accuracy and generalizability of your models.

You will encounter all of these concepts in future blog posts, which will cover model optimization, random forests, Naive Bayes, logistic regression and how to combine different models into an ensembled meta-model.

Generating the dataset

Let’s start off by building an artificial dataset to play with. You can do this easily with themake_classification() function fromsklearn.datasets. Specifically, you will generate a relatively simple binary classification problem. To make it a bit more interesting, let’s make the data crescent-shaped and add some random noise（数据呈现月牙型并加入一些随机噪声）. This should make it more realistic and increase the difficulty of classifying observations.

<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Creating the dataset</span>
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># e.g. make_moons generates crescent-shaped data</span>
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Check out make_classification, which generates linearly-separable data</span>
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> sklearn.datasets <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> make_moons
 
X, y =make_moons(
    n_samples=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">500</span>, <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># the number of observations</span>
    random_state=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>,
    noise=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.3</span>
)
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Take a peek</span>
print(X[:<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">10</span>,])
print(y[:<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">10</span>])

The dataset you just generated looks a bit like this:
[[ 0.50316464  0.11135559]
 [ 1.06597837 -0.63035547]
 [ 0.95663377  0.58199637]
 [ 0.33961202  0.40713937]
 [ 2.17952333 -0.08488181]
 [ 2.00520942  0.7817976 ]
 [ 0.12531776 -0.14925731]
 [ 1.06990641  0.36447753]
 [-0.76391099 -0.6136396 ]
 [ 0.55678871  0.8810501 ]]
[1 1 0 0 1 1 1 0 0 0]

<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> matplotlib.pyplot <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">as</span> plt
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> matplotlib.colors <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> ListedColormap
 
%matplotlib inline <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># for the plots to appear inline in jupyter notebooks</span>
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Plot the first feature against the other, color by class</span>
plt.scatter(X[y == <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>], color=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"#EE3D34"</span>, marker=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"x"</span>)
plt.scatter(X[y == <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>], color=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"#4458A7"</span>, marker=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"o"</span>)

intro

Next up, let’s split the dataset into a training set andtest set. The training set will be used to develop and tune our models. The test set will be completely left alone until the very end, at which point you’ll run your finished models on it. Having a test set will allow you to get a good estimate of how well your models would perform out in the wild on previously unseen data.

<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> sklearn.cross_validation <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> train_test_split
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Split into training and test sets</span>
XTrain, XTest, yTrain, yTest= train_test_split(X, y, random_state=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, test_size=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.5</span>)

You are going to try to predict the classes in our dataset with a k Nearest Neighbor (kNN) classifier. Chapter 2 of theIntroduction to Statistical Learning book provides a great intro to the theory behind kNN. We are huge fans of the ISLR book, so definitely check it out if you have the time. You could also have a look at this previous post that teaches youhow to implement the algorithm from scratch in Python.

Introducing the k hyperparameter in kNN

The kNN algorithm works by using information about the k-nearest neighbors of a new data point in order to assign it a class label. It simply looks at the class of other data points most similar to it (its ‘nearest neighbors’) and assigns the new data point to the most common class of these neighbors. When using kNN, you have to set the value ofk that you want the algorithm to use ahead of time, and it is not trivial to know which value to use.

If the value for k is high (e.g. k=99), then the model considers a large number of neighbors when making a a decision about the class of an unknown datapoint. This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances. In other words, a high number fork give rise to relatively “rigid” model behaviour.

By contrast, if the value for k is low (e.g. k=1 or k=2), then only a few neighbors are taken into account when making a classification decision. It is a very flexible model with a lot of complexity – it really fits very closely to the precise shape of the dataset. Hence, the predictions of the model are much more dependent on the local tendencies of the data (crucially, this includes the noise!).

Take a look at how the kNN algorithm separates the training cases when k=99 compared to whenk=1. The green line is the decision boundary on the training data (i.e. the threshold at which the algorithm decides whether a data point belong in the blue or red class).

training_data_only_99_1

At the bottom of the post you learn how to generate these plots yourself, but let’s delve into some theory first.

When k=99 (on the left), it looks like the model fit might be a bit too smooth and could stand to fit the data a bit closer. The model haslow flexibility and low complexity. It paints the decision boundary with a broad brush. It has relativelyhigh bias because one can tell it is not modelling the data as well as it could – it models the underlying generative process of the data as something too simple, and this is highly biased away from the ground truth. But, the decision boundary would probably look very similar if you redrew it on a slightly different dataset. It is a stable model that won’t vary a lot – it haslow variance.

When k=1 (on the right), you can see that the model is massively overfitting to the noise. It is technically generating perfectly correct predictions on the training set (the error in the bottom right hand corner is equal to 0.00!), but hopefully you can see how this fit is way too sensitive to individual data points. Keep in mind that you added noise to the dataset – it looks like this model fit is taking the noise too seriously and is fitting very closely to it. You can say that thek=1 model has high flexibility andhigh complexity because it tunes very tightly to the data. It also haslow bias – if nothing else, the decision boundary certainly fits the trends you observe in the data. But, the fitted boundary would drastically change on even slightly different data – it would vary significantly, i.e. the k=1 model has high variance.

But how well do these models generalize, i.e. how well would they perform on new data?

You have so far only looked at the training data, but quantifying training error isn’t that useful. You are not interested in how well models can recapitulate what they just learned on the training set.Let’s take a look at how they perform on test data, since that gives a better impression of whether your models are actually good or not. Try it yourself using a few different values ofk:

<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> sklearn.neighbors <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> KNeighborsClassifier
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> sklearn <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> metrics
 
knn99 =KNeighborsClassifier(n_neighbors =<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">99</span>)
knn99.fit(XTrain, yTrain)
yPredK99 = knn99.predict(XTest)
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">print</span><span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Overall Error of k=99 Model:"</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span> - round(metrics.accuracy_score(yTest, yPredK99), <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">2</span>)
 
knn1 =KNeighborsClassifier(n_neighbors =<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>)
knn1.fit(XTrain, yTrain)
yPredK1 = knn1.predict(XTest)
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">print</span><span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Overall Error of k=1 Model:"</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span> - round(metrics.accuracy_score(yTest, yPredK1), <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">2</span>)

Overall Error of k=99 Model: 0.15
Overall Error of k=1 Model: 0.15

Actually, it looks like these models perform approximately equally well on the test data. Here are the decision boundaries you learned on the training set, applied to the test set. See if you can figure out where the two models are making their mistakes.

training_data_and_test_99_1

The two models are making mistakes for very different reasons. It seems that thek=99 model isn’t doing a good job at capturing the crescent shape of the data (it isunderfitting), while the k=1 model is making mistakes by being horriblyoverfitted to the noise. Remember, the hallmark of overfitting is good training performance and bad testing performance, which is what you observe here.

Maybe intermediate values of k are where you want to be? Try a value of k between 99 and 1, e.g.:

knn50 =KNeighborsClassifier(n_neighbors =<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">50</span>)
knn50.fit(XTrain, yTrain)
yPredK50 = knn50.predict(XTest)
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">print</span><span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Overall Error of k=50 Model:"</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span> - round(metrics.accuracy_score(yTest, yPredK50), <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">2</span>)

Overall Error of k=50 Model: 0.11

Looking better! Let’s check out the decision boundary for the k=50 model.

training_data_and_test_99_50_1

Much better – the model fit is similar to the actual trend in the dataset and this improvement is reflected in a lower test set error.

The bias-variance trade-off: concluding comments

Hopefully you now have a good intuition for what it means for models to underfit and overfit. See if all of the terms in the table at the beginning of this post now make sense. Basically, finding the right balance between overfitting and underfitting corresponds to the bias-variance trade-off.

To recap, when you train machine learning algorithms on a data set, what you are really interested in ishow your model will perform on an independent data set. It is not enough to do a good job classifying instances on the training set. Essentially, you are only interested in building models that aregeneralizable（提高泛化能力） – getting 100% accuracy on the training set is not impressive, and is simply an indicator ofoverfitting. Overfitting is the situation in which you have fitted your model too closely to the data, and have tuned to the noise instead of just to the signal.

To be clear: strictly speaking, you are not trying to model the trends in the dataset. You try to model the real world process that has led to us observing the data. The specific dataset you happen to be working with is just a small set of instances (i.e. a sample) of the ground truth, which brings with it its own noise and peculiarities.

Here is a summary figure showing how under-fitting (high bias, low variance), properly fitting, andover-fitting (low bias, high variance)models fare on the training compared to the test sets:

figures_mod4_over_underfitting

This idea of building generalizable models is the motivationg your dataset into atraining set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance).

But – BIG warning! It’s also possibly to overfit to the test set. If you were to try out lots of different models and keep changing them in order to chase accuracy points on the test set, then the information from the test set can inadvertentlyleak into our model creation phase. You need a way around this.

Estimating model performance using k-fold cross validation

Enter k-fold cross-validation, which is a handy technique for measuring a model’s performance usingonly the training set. Say that you want to do e.g. 10-fold cross-validation. The process is as follows:you randomly partition the training set into 10 equal sections.Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. You then evaluate its performance on the remaining 1 section. This gives you some measure of the model’s performance (e.g. overall accuracy). You then train the algorithm on a different9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section.You continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, you could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:

You can use k-fold cross validation to get an estimate of model accuracy, and you can use these estimates to tweak（调整） your model until you are happy. This lets you leave the test data alone until the very end, thus side-stepping the danger of overfitting to it. In other words, cross-validation provides a way to simulate having more data than you actually have so that you do not have to “spend” your test data until the very end of model building. k-fold cross validation, and its variants, are extremely popular and very useful, especially if you’re trying out lots and lots of different models (e.g. if you want to test how well a load of differently parameterized models perform).

Comparing training error, cross-validation error, and test error

So what would be the best value for k（k指的是K-NN算法中的参数k）? Try out different values fork when building models with the training data and see how well the resulting models fare when predicting the classes of either the training set itself or the test set. Finally, see how well k-fold cross validation will be at indicating the bestk.

Note: in practice, when scanning a parameter like this, it would be a bad idea to use the training set for testing the model. In equal measure, you would never scan a parameter using the test set multiple times (once for each parameter value tried). In the following plot you use these calculations just as an illustration to see what they would look like. In practice, only k-fold cross validation is a safe approach!

<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> numpy <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">as</span> np
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">from</span> sklearn.cross_validation <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">import</span> train_test_split, cross_val_score
 
knn =KNeighborsClassifier()
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># the range of number of neighbors you want to test</span>
n_neighbors = np.arange(<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">141</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">2</span>)
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># here you store the models for each dataset used</span>
train_scores = list()
test_scores = list()
cv_scores = list()
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># loop through possible n_neighbors and try them out</span>
<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">for</span> n <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">in</span> n_neighbors:
    knn.n_neighbors= n
    knn.fit(XTrain, yTrain)
    train_scores.append(<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>- metrics.accuracy_score(yTrain, knn.predict(XTrain))) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># this will over-estimate the accuracy</span>
    test_scores.append(<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>- metrics.accuracy_score(yTest, knn.predict(XTest)))
    cv_scores.append(<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>- cross_val_score(knn, XTrain, yTrain, cv= <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">10</span>).mean()) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># you take the mean of the CV scores</span>

So what would be the best value to pick for k? When multiple values are giving the same prediction error, you just pick the smallest value for k.

<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># what do these different datasets think is the best value of k?</span>
print(
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'The best values of k are:\n'</span>\
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'{} according to the Training Set\n'</span>\
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'{} according to the Test Set and\n'</span>\
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'{} according to Cross-Validation'</span>.format(
        min(n_neighbors[train_scores==min(train_scores)]),
        min(n_neighbors[test_scores==min(test_scores)]),
        min(n_neighbors[cv_scores==min(cv_scores)])       
    )
)

The best values of k are: 
1 according to the Training Set
23 according to the Test Set and
11 according to Cross-Validation

Rather than just collecting the best ks, have a peek at what the prediction error is over the range ofks tested.

<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># let's plot the error you get with different values of k</span>
plt.figure(figsize=(<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">10</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">7.5</span>))
plt.plot(n_neighbors, train_scores, c=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"black"</span>, label=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Training Set"</span>)
plt.plot(n_neighbors, test_scores, c=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"black"</span>, linestyle=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"--"</span>, label=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Test Set"</span>)
plt.plot(n_neighbors, cv_scores, c=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"green"</span>, label=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Cross-Validation"</span>)
plt.xlabel(<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'Number of K Nearest Neighbors'</span>)
plt.ylabel(<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'Classification Error'</span>)
plt.gca().invert_xaxis()
plt.legend(loc = <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"lower left"</span>)
plt.show()

bias_variance_lines

Let’s talk about the classification error on the training set. The fewer neighbors you consider, the lower the prediction error when evaluating the models on the training set (black solid line). This makes sense, since you approach the scenario where each point only considers its own self when making new classifications, which leads to perfect “predictions”. The test data error follows a similar trajectory, but experiences an increase after a certain point because of local overfitting. This behaviour indicates that now the specific test set sample is not modelled very well by the model fit which was built on the training data.

In this plot, you see that especially for low values of k, using k-fold cross-validation highlights a region in the parameter space (i.e. very low values ofk) that is very prone to overfitting. Even though cross-validation and the test set evaluation lead to somewhat different optima, they are both pretty decent and you are clearly in the right ballpark. You can also see that cross-validation is a reasonable estimator of test error. This type of plot is nice to study in order to get a good feeling for how a certain parameter influences model performance and help build an intuition about your dataset.

Just show me the code!

Here is the code for generating the above plots, and doing the training and testing of different kNN algorithms. This code is an adapted version ofthis scikit-learn example, and most it deals with the finicky details of calculating decision boundaries and making the plots look nice. The meaty machine learning parts of splitting the dataset, fitting the algorithm, and testing it were covered above.

<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">detect_plot_dimension</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(X, h=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.02</span>, b=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.05</span>)</span>:</span>
    x_min, x_max= X[:, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>].min()- b, X[:, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>].max()+ b
    y_min, y_max= X[:, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>].min()- b, X[:, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>].max()+ b
    xx, yy= np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    dimension= xx, yy
    returndimension
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">detect_decision_boundary</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(dimension, model)</span>:</span>
    xx, yy= dimension <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># unpack the dimensions</span>
    boundary= model.predict(np.c_[xx.ravel(), yy.ravel()])
    boundary= boundary.reshape(xx.shape) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Put the result into a color plot</span>
    returnboundary
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">plot_decision_boundary</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(panel, dimension, boundary, colors=[<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'#DADDED'</span>,<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'#FBD8D8'</span>])</span>:</span>
    xx, yy= dimension <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># unpack the dimensions</span>
    panel.contourf(xx, yy, boundary, cmap=ListedColormap(colors), alpha=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>)
    panel.contour(xx, yy, boundary, colors=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"g"</span>, alpha=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, linewidths=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.5</span>) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># the decision boundary in green</span>
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">plot_dataset</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(panel, X, y, colors=[<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"#EE3D34"</span>,<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"#4458A7"</span>], markers=[<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"x"</span>,<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"o"</span>])</span>:</span>
    panel.scatter(X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>], color=colors[<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], marker=markers[<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>])
    panel.scatter(X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], X[y==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>], color=colors[<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>], marker=markers[<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>])
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">calculate_prediction_error</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(model, X, y)</span>:</span>
    yPred= model.predict(X)
    score= <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span> - round(metrics.accuracy_score(y, yPred),<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">2</span>)
    returnscore
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">plot_prediction_error</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(panel, dimension, score, b=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">.3</span>)</span>:</span>
    xx, yy= dimension <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># unpack the dimensions</span>
    panel.text(xx.max()- b, yy.min()+ b, (<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'%.2f'</span>% score).lstrip(<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'0'</span>), size=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">15</span>, horizontalalignment=<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'right'</span>)
 
<span class="hljs-function" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; background: transparent;"><span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">def</span> <span class="hljs-title" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(25, 70, 157); background: transparent;">explore_fitting_boundaries</span><span class="hljs-params" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(0, 0, 255); background: transparent;">(model, n_neighbors, datasets, width)</span>:</span> 
    <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># determine the height of the plot given the aspect ration of each panel should be equal</span>
    height= float(width)/len(n_neighbors)* len(datasets.keys())
 
    nrows= len(datasets.keys())
    ncols= len(n_neighbors)
 
    <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># set up the plot</span>
    figure, axes= plt.subplots(
        nrows,
        ncols,
        figsize=(width, height),
        sharex=<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">True</span>,
        sharey=<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">True</span>
    )
 
    dimension= detect_plot_dimension(X, h=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.02</span>) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># the dimension each subplot based on the data</span>
 
    <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># Plotting the dataset and decision boundaries</span>
    i= <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>
    forn <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">in</span> n_neighbors:
        model.n_neighbors= n
        model.fit(datasets[<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Training Set"</span>][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], datasets[<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Training Set"</span>][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>])
        boundary= detect_decision_boundary(dimension, model)
        j= <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>
        ford <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">in</span> datasets.keys():
            <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">try</span>:
                panel= axes[j, i]
            <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">except</span>(TypeError, IndexError):
                <span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">if</span>(nrows * ncols) ==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>:
                    panel= axes
                elifnrows ==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>:  <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># if you only have one dataset</span>
                    panel= axes[i]
                elifncols ==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>:  <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># if you only try one number of neighbors</span>
                    panel= axes[j]
            plot_decision_boundary(panel, dimension, boundary) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># plot the decision boundary</span>
            plot_dataset(panel, X=datasets[d][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], y=datasets[d][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>]) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># plot the observations</span>
            score= calculate_prediction_error(model, X=datasets[d][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>], y=datasets[d][<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>])
            plot_prediction_error(panel, dimension, score, b=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0.2</span>) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># plot the score</span>
 
            <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># make compacted layout</span>
            panel.set_frame_on(<span class="hljs-keyword" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(149, 65, 33); background: transparent;">False</span>)
            panel.set_xticks([])
            panel.set_yticks([])
 
            <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># format the axis labels</span>
            ifi ==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>:
                panel.set_ylabel(d)
            ifj ==<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>:
                panel.set_title(<span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">'k={}'</span>.format(n))
            j+=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>
        i+=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>   
 
    plt.subplots_adjust(hspace=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>, wspace=<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">0</span>) <span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># make compacted layout</span>

You can then run the code like this:

<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># specify the model and settings</span>
model =KNeighborsClassifier()
n_neighbors = [<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">200</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">99</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">50</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">23</span>, <span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">11</span>,<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">1</span>]
datasets = {
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Training Set"</span>: [XTrain, yTrain],
    <span class="hljs-string" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(33, 145, 97); background: transparent;">"Test Set"</span>: [XTest, yTest]
}
width =<span class="hljs-number" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: inherit; font-family: inherit; vertical-align: baseline; color: rgb(64, 160, 112); background: transparent;">20</span>
 
<span class="hljs-comment" style="border: 0px; margin: 0px; padding: 0px; font-weight: inherit; font-style: italic; font-family: inherit; vertical-align: baseline; color: rgb(64, 128, 128); background: transparent;"># explore_fitting_boundaries(model, n_neighbors, datasets, width)</span>
explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)

exploration_grid

Conclusion

The bias-variance trade-off appears in a lot of different areas of machine learning. All algorithms can be considered to have a certain degree of flexibility and this is certainly not specific to kNN. The goal of finding the sweet spot of flexibility that describes the patterns in the data well but is still generalizable to new data applies to basically all algorithms.

凌风探梅

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
误差模型：过拟合，交叉验证,偏差-方差权衡

from: http://www.voidcn.com/blog/Mark_LQ/article/p-5983585.htmlIntroductionIn this post you will get to grips with what is perhaps the most essential concept in machine learning: thebias-varia
复制链接

扫一扫

专栏目录