CS231n-Lecture Note02-Optimization

Recall:

In the last lecture, we learned about image classifiers, nerest neighbor and K-NN. And use cross-validation to tune hyperparameters. With the lack of nerest neighbor and K-NN, we have implemented Linear Classifier which will reduce the expense of calculation. To calculate the class's score, we implemented a score function. The loss function is used to compare the predicted class with the actual class. There are two different losses discussed: hinge loss (SVM) and Cross-entropy loss (softmax). The difference between these two is that the Cross-entropy loss interpert the class's score into probability.

In the loss, to find the best W, we set a penalty for the loss, which is so-called regularization. The common penatly we use is L1 norm and L2 norm.

Optimization

In the above recall, we use the loss function to quantify the quanlity of W. Next step, we need to minimize loss.

Briefly:

Loss function: quantify the quanlity of W

optimization: minimize the loss

The main thing we are doing is finding the best 'W'.

We often use the downhill problem as a metaphor for the search for W. When we are going down the mountain, we want to choose the fastest way. The thing we need to consider is which direction we should take. There are two choices: one is a random search, and the other is to follow the slop.

It's a bad idea to use random search to solve this problem, apparently. Which will cost a lot of time; we need to try so many times to find the time. It's expensive.

Let's discuss the concept of 'W' in the context of following the slope idea.

Computing the gradient

Numerical gradient

In 1-dimension, the derivative of a function:

                        \frac{df(x)}{dx} = \underset{h \to 0}{lim} \frac{f(x+h)-f(x)}{h}

The diagram shows the definition of slope:

Here is a vector of 'W', we need to add h to compute each dimension's result to get the gradient dW.

In the first dimension, the h we choose is 0.0001, and the loss after adding h is 1.25322, the dW is -2.5 with that formula. Then we need to loop over all dimensions to calculate the dW. 

Shown in code would be like:

def eval_numerical_gradient(f, x):
  """
  a naive implementation of numerical gradient of f at x
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x+h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    # compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad

Partially, using the centered difference formula \frac{f(x+h)-f(x-h)}{2h} would better.

Let's use CIFAR-10 to compute the loss.

# to use the generic code above we want a function that takes a single argument
# (the weights in our case) so we close over X_train and Y_train
def CIFAR10_loss_fun(W):
  return L(X_train, Y_train, W)

W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient

loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )

# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
  step_size = 10 ** step_size_log
  W_new = W - step_size * df # new position in the weight space
  loss_new = CIFAR10_loss_fun(W_new)
  print 'for step size %f new loss: %f' % (step_size, loss_new)

# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036

In this code, we use 10 difference steo to compare the result.

Notice:

  • The weight update in negative way. Because we wanna loss function decrease not increase.
  • The affect of step size (learning rate), the step size decided the speed of descend. Which shows in the output as well.

Computing the gradient numercially is too slow and cost lot. The another way is analytic gradient.

Analytic gradient

In analytic gradient, we will compute gradient with calculus

There are some problem when use this way to compute gradient:

1. It's would be fast to compute when we use calculus.

2. It can be more error when we implement it.

Let's take SVM loss for example:

                ​​​​​​​        ​​​​​​​        L_i =\underset{j\neq y_i}{\sum}[max(0, w_j^Tx_i - w_{y_j}^Tx_i + \Delta)]

After calculus:

\bigtriangledown _{w_{y_i} }Li = - (\underset{j\neq y_i}{\sum}1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0))x_i( the 1 should be 𝟙) 

The 𝟙 is the indicator function that is one if the condition inside is true or zero.

Notice:

This gradient only works for the row of W that corresponds to the correct class.

for those, j \neq y_j the gradient is \bigtriangledown _{w_{j} }Li = - 1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0)x_i

Gradient Descent

The procedure of repeatedly evaluating gradient and parameter upadates is called gradient descent

Vanilla Gradient Descent

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
    # update weight with step size
  weights += - step_size * weights_grad # perform parameter update

Mini-batch Gradient Descent

With the a huge amount of dataset, it's wasteful to compute the parameter update one by one. A common way is to compute it in batches of the training dataset.

For example, there are 50,000 samples in CIFAR-10, if we set 256 samples as a batch.

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

There is a extreme case if mini-batch, only contains a single example. We call it Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent (SGD) dalam Machine Learning

Well, about other optimization, I have talked in this article.

Reference:

1. Lecture 3

2. Optimization

3.深度学习与CV教程(3) | 损失函数与最优化

  • 6
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值