CS231n-Lecture Note02-Optimization

最新推荐文章于 2024-06-13 09:27:15 发布

itspollyyy

最新推荐文章于 2024-06-13 09:27:15 发布

阅读量188

点赞数 6

分类专栏：深度学习文章标签：机器学习深度学习

本文链接：https://blog.csdn.net/weixin_43399179/article/details/134353922

版权

深度学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Recall:

In the last lecture, we learned about image classifiers, nerest neighbor and K-NN. And use cross-validation to tune hyperparameters. With the lack of nerest neighbor and K-NN, we have implemented Linear Classifier which will reduce the expense of calculation. To calculate the class's score, we implemented a score function. The loss function is used to compare the predicted class with the actual class. There are two different losses discussed: hinge loss (SVM) and Cross-entropy loss (softmax). The difference between these two is that the Cross-entropy loss interpert the class's score into probability.

In the loss, to find the best W, we set a penalty for the loss, which is so-called regularization. The common penatly we use is L1 norm and L2 norm.

Optimization

In the above recall, we use the loss function to quantify the quanlity of W. Next step, we need to minimize loss.

Briefly:

Loss function: quantify the quanlity of W

optimization: minimize the loss

The main thing we are doing is finding the best 'W'.

We often use the downhill problem as a metaphor for the search for W. When we are going down the mountain, we want to choose the fastest way. The thing we need to consider is which direction we should take. There are two choices: one is a random search, and the other is to follow the slop.

It's a bad idea to use random search to solve this problem, apparently. Which will cost a lot of time; we need to try so many times to find the time. It's expensive.

Let's discuss the concept of 'W' in the context of following the slope idea.

Computing the gradient

Numerical gradient

In 1-dimension, the derivative of a function:

$\frac{df(x)}{dx} = \underset{h \to 0}{lim} \frac{f(x+h)-f(x)}{h}$

The diagram shows the definition of slope:

Here is a vector of 'W', we need to add h to compute each dimension's result to get the gradient dW.

In the first dimension, the h we choose is 0.0001, and the loss after adding h is 1.25322, the dW is -2.5 with that formula. Then we need to loop over all dimensions to calculate the dW.

Shown in code would be like:

def eval_numerical_gradient(f, x):
  """
  a naive implementation of numerical gradient of f at x
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x+h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    # compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad

Partially, using the centered difference formula $\frac{f(x+h)-f(x-h)}{2h}$ would better.

Let's use CIFAR-10 to compute the loss.

# to use the generic code above we want a function that takes a single argument
# (the weights in our case) so we close over X_train and Y_train
def CIFAR10_loss_fun(W):
  return L(X_train, Y_train, W)

W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient

loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )

# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
  step_size = 10 ** step_size_log
  W_new = W - step_size * df # new position in the weight space
  loss_new = CIFAR10_loss_fun(W_new)
  print 'for step size %f new loss: %f' % (step_size, loss_new)

# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036

In this code, we use 10 difference steo to compare the result.

Notice:

The weight update in negative way. Because we wanna loss function decrease not increase.
The affect of step size (learning rate), the step size decided the speed of descend. Which shows in the output as well.

Computing the gradient numercially is too slow and cost lot. The another way is analytic gradient.

Analytic gradient

In analytic gradient, we will compute gradient with calculus.

There are some problem when use this way to compute gradient:

1. It's would be fast to compute when we use calculus.

2. It can be more error when we implement it.

Let's take SVM loss for example:

$L_i =\underset{j\neq y_i}{\sum}[max(0, w_j^Tx_i - w_{y_j}^Tx_i + \Delta)]$

After calculus:

$\bigtriangledown _{w_{y_i} }Li = - (\underset{j\neq y_i}{\sum}1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0))x_i$ ( the 1 should be 𝟙)

The 𝟙 is the indicator function that is one if the condition inside is true or zero.

Notice:

This gradient only works for the row of W that corresponds to the correct class.

for those, $j \neq y_j$ the gradient is $\bigtriangledown _{w_{j} }Li = - 1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0)x_i$

Gradient Descent

The procedure of repeatedly evaluating gradient and parameter upadates is called gradient descent

Vanilla Gradient Descent

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
    # update weight with step size
  weights += - step_size * weights_grad # perform parameter update

Mini-batch Gradient Descent

With the a huge amount of dataset, it's wasteful to compute the parameter update one by one. A common way is to compute it in batches of the training dataset.

For example, there are 50,000 samples in CIFAR-10, if we set 256 samples as a batch.

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

There is a extreme case if mini-batch, only contains a single example. We call it Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent (SGD) dalam Machine Learning

Well, about other optimization, I have talked in this article.

Reference:

1. Lecture 3

2. Optimization

3.深度学习与CV教程(3) | 损失函数与最优化

itspollyyy

关注

6
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS231n-Lecture Note02-Optimization

In the last lecture, we learned about image classifiers, nerest neighbor and K-NN. And use cross-validation to tune hyperparameters. With the lack of nerest neighbor and K-NN, we have implemented Linear Classifier which will reduce the expense of calculati
复制链接

扫一扫