Core Concepts:
- K-Nearest Neighbor
- Linear classifiers: SVM,
- Linear classifiers:Softmax
- Two-layer neural network
- Image features
As a core task in Computer Vision, Image Classification has two basic data-driven approaches which is K-nearest neighbor and linear classifier.
In Machine Learning, the Data-Driven approach can be summarized into three key steps:
- Collect a dataset of images and labels
- Use Machine Learning algorithms to train a classifier
- Evaluate the classifier on new images
Image Classification
Let's talk about the Nearest Neighbor Classifier first.
Nerest Neighbor Classifier
There are a set of images that contain different classes of things. In these images, we want to find images of cats.
In the first step of the data-driven approach to machine learning, we need to label all the images.
Here, we have 5 classes for training, which are labeled.
For finding the target images, the core solution of the Nerest Neighbor Classifier is to figure out the distance metric of the target image and predict the image.
To calculate the distance metric, we can use matrix.
Using the one-color channel of test image minus the training image, we can get the distance. The test image and training image we talk about are matrices. Training data consists of examples and corresponding labels, where the example shape is N x D, where N is the number of training examples and D is the number of features. Label is 1-dimension of size N.
class NearestNeighbor:
def __init__(self):
pass
def train(self, X, y):
# X (N, D) y(N,)
self.Xtr = X
self.ytr = y
def predict(self, X):
num_test = X.shape[0]
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
for i in range(num_test):
# L1 distance
# axis = 1 is the features,
"""
compute the distance's and sum aboslute value differences
"""
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1)
"""
get the index with smallest distance
the smallest, the closest
"""
min_index = np.argmin(distances)
# predict the label of the nearest example
Ypred[i] = self.ytr[min_index]
return Ypred
Intuitively, let's use a diagram to explain the calculation of distance.
Here, each image size is 3x32x32 (color channel, height pixel, width pixel). The shape is NxD, which we have already discussed. N is examples, and D is vector with features, which is 3x32x32 in this example.
shows each row of minus , after pixel-wise and get the absolute value of each element. We need to sum along the horizontal axis to get the distances. Then, we will get a smallest number's index by argmin() method, by using this index to get the label in , which is also the predict label of ith example.
Example image classification dataset: CIFAR-10.
This dataset consists of 10 classes. The images of each class are shown below:
Suppose now we are given the CIFAR-10 dataset for training and the right side's for prediction. In the nearest neighbor classifier, we need to calculate the distance of each training class with the right side's images by using L1 distance:
And find the smallest distance; the label of the smallest distance is the label of the prediction image.
The choice of distance:
There are so many ways to compute distance between two vectors.
Another common choice is L2 distance.
distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1))
K-Nerest Neighbor
Different from only using the label of the nerest image, K-Nerest Neighbor finds the top K closet images and votes on the label of the test image.
Intuitively, higher values of k have a smoothing effect that makes the classifier more resistant to outliers [1].
Here's an example of an NN classifier with a 5-NN classifier. There are 3 classes (red, blue, green). The color region shows the decision based on these data. In the NN classifier, each data point has a color region, even the outlier data points. While the 5-NN classifier smooths over these irregularities, in the color region of the 5-NN classifier, we can see that the top 5 closet data points are the same color, and as a result, we can vote on the color of this region.
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# Output: sqrt((x-y)^2)
# (x-y)^2 = x^2 + y^2 - 2xy
test_sum = np.sum(np.square(X), axis=1) # num_test x 1
train_sum = np.sum(np.square(self.X_train), axis=1) # num_train x 1
inner_product = np.dot(X, self.X_train.T) # num_test x num_train
dists = np.sqrt(-2 * inner_product + test_sum.reshape(-1, 1) + train_sum) # broadcast
return dists
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Input:
dists - A num_test x num_train array where dists[i, j] gives the distance
between the ith test point and the jth training point.
Output:
y - A vector of length num_test where y[i] is the predicted label for the
ith test point.
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
y_indicies = np.argsort(dists[i, :], axis = 0)
closest_y = self.y_train[y_indicies[:k]]
y_pred[i] = np.argmax(np.bincount(closest_y))
return y_pred
But what value of k is the best choice? And what is the best distance to use? (L1, L2)
These are all called hyperparameters
Validation sets for Hyperparameter tuning
In the previous section, we talked about training data and test data. The classifier is learned using training data, and it uses the test data to predict the image. The trained classifier cannot be used to tweak the hyperparameter.
So, we put forward new data called the validation set, which is split from the training data but smaller.
If the training data is small, cross-validation might be the better way to tune the hyperparameters.
cross-validation
The idea of cross-validation is split data into folds, try each fold as validation set and average the results, which shown below:
Linear Classifier
In the previous section, we talked about using distance to predict the labels. In KNN we need space to store all training data for future comparision with test data. If the dataset is big enough, the more space will cost, which is not benifit for compute. Secondly, it's cost a lot to compare the test data with all training data.
To develop a more powerful apporch, there are two major components needed: 1. score function: maps the raw data to class scores. 2. loss funciton: quantifies the difference between predicted values and the actual values.
Parameterized mapping from images to label scores
Let's take a concrete example, in CIFAR-10 dataset, the image's size is 3x32x32. The D is 3072 apparently. Training dataset of image , which associated with label , . In CIFAR-10, the N = 50,000, k=10.
In linear classifier, we add a new parameter W (matrix), and add parameter b(vector) which is the bias between predicted values and actual values.
The score function can be defined as:
In this function, the shape of x is (3072, 1), W is (10, 3072), b is (10, 1), so the shape of f(x, W) is (10, 1).
The following diagram shows an example of mapping an image to class scores.
In this example, we assume the image only has 4 pixels, not considering the color channels. And we have 3 classes (red (cat), green (dog), blue (ship) class). The shape of W is 3x4, b is 3x1, is 4x1.
Notice:
1. When we calculate , the shapes of W and are not aligned for mathematical operations. We need to use transpose (.T) to transpose one of the matrices to make the dimensions match.
2. The images are stretched into high-dimensional column vectors.
As we saw above, each row of the weight is the classifier of one of the classes. Let score function as a Linear equation.
- In Linear equation, W can be seen as the slope. In the score function, where W and b are matrices and vectors, the score function can be seen as a hyperplate. With the change of W, the hyperplate will rotate in different directions.
- In a linear equation, the parameter 'b' represents the distance to the x-axis (y=b); if b is changed, the new line is the parallel line of the original one. In the context of the score function, bias b decides the distance to move in that direction.
Bias trick
It's a little bit redundant to compute weight and bias separately. The common trick is to combine these two parameters into one. Meanwhile, we need to add one more dimension to , which is constant 1. The new score function will be like this:
With the CIFAR-10 example, is now [3071x1] instead of [3072x1], and W is now [10x3073] instead of [10x3072]. The illusation could be more clearly shown below:
Image preprocessing
In the example above, we use the raw pixel values, which range from [0...255]. In Machine Learning, we need to use normalization to process the input feature; each pixel could be thought of as a feature.
We will use each feature to subtract the mean, where the pixels range from approximately [-127, 127].
Further common preprocessing is to scale each input feature so that the value ranges from [-1,1].
There are some conditions that make it hard to use a linear classifier:
Loss function
In the previous section, we set a score function to calculate the class score. But unhapiness, we received a terrible score on the class cat.
In the score of cat, we got 2.9 which is smaller than image frog.
The bad score of cat which remained us to find the good W. The loss function help us to measure the score. The loss will be high if we're doing a poor job, vice versa.
Let's give a example of this loss, ther are N size of traning examples. The dataset is , where is the dataset, is the label. The loss can be:
First we compute the score of by using score function . Then, we can compute each dataset's loss, and sum them. Last, the average of the sum is the real loss.
Multiclass Support Vector Machine loss
The first loss we will talk is Multiclass Support Vector Machine loss. The SVM loss wanna the correct class get a high score than the incorrect class with a margin .
Recall the process to get , we get the class score first. Now, we save the scores into vector s. The score for the class is the element. The is the acutal classes. The formula can be:
Let's take three classes, for example:
Here, we have 3 classes and 3 training examples. The score of the first column, which is an image cat, is 3.2, 5.1, -1.7. The is set to be 1.
With the result, we can see that, the correct class's loss is smaller than the other two.
In the SVM loss, we want the correct class's score to be higher than the incorrect class's score with an added margin. If any other class's score is inside the red area, there will be an accumulated loss. Otherwise, the loss is 0.
In this section, we use as score function, so the loss function can be rewritten as:
Where is j-th row of W reshaped as a column, which we have discussed before.
There is a problem when we set the parameter W. Suppose there is a dataset with a set of W that can classify this data correctly. Which means all. The W is not unique.
The way to figure out this problem is to multiple a new parameter with W, where . But the loss is still zero. In other words, we wanna add some things to remove the ambiguity of W.
We can do by extending the loss with a regularization penalty R(W).
1. The most common penalty is L2 norm, with the L2norm we can discourage large weights
2. We wanna a sample model, which means there are more 0 in W, the L1 norm might be the best choice.
In the expression above, we can make up the new loss with regularization penatly:
or
Let's take a example for explain this more clearly.
Suppose the input vector is , the weights are , then the . Both weights get the same dot product, but the L2 penatly of w1is 1.0, w2 is 0.25. w2 is smaller than w1, which means the w2 is better. Intuitively, the value of w2 is smaller and more diffuse. This encourage classifier to take all input dimensional in account.
Hyperparameter chose of
We haven't talked about . Partically, It's safe to set in call class.
Here, we got the formula of loss, next, we need to find the weight W that can minimize loss.
Softmax classifier
As a multiple classes classifier, Softmax, interpret raw classifier scores as probabilities.
In the example above, the score function calculate the score of 'cat'. The softmax will interpret the scores as probabilities by using this formula below:
is the j-th class's score, is the actual label's score.
Then we use this formula to calcuate the probability of score we compute before.
As we can see, the value is unnormalized. We need the sum of these probabilities equal to 1. we need to normalize class probabilities. The Li would be like:
The Loss would be:
Then the probabilities with normalized will be:
Practical issues: Numeric stability
When we are computing the intermediate terms might be large due to the exponentials. We can multiply with a constant C:
The common choice for C is to set . Which shown in code would be like:
s = np.array([123, 456, 789]) #the score of 3 classes
probabilities = np.exp(s) / np.sum(np.exp(s)) # after with exponential the values might be blowup
"""
instead: shift the highest value to 0
"""
s -= np.max(s) # s becomes [-666, -333, 0]
probabilities = np.exp(f) / np.sum(np.exp(s))
Name conventions:
1. SVM classifier uses hinge loss (max-margin loss).
2. Softmax classifier uses cross-entropy loss.
SVM vs. Softmax
Here is an example of the difference between the SVM and Softmax classifiers for one datapoint.
Briefly, the Softmax compute the probability of each class, with the values normalized the output probabilities will approach a uniform distribution.