Loss Functions and Optimization
Preview the Goal in this lecture
- Define a loss function
- Come up with a way of finding the paras that minimize the (1)
(optimization)
The Remain Problem from last lecture
- How to choose the W para ?
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ss70CYIY-1604970144386)(https://s1.ax1x.com/2020/11/08/BTZxgK.png)]
Loss function
A loss function tells how good our current classifier is.
( x i , y i ) i = 1 N {(x_i,y_i)}_{i=1}^N (xi,yi)i=1N
The X i X_i Xi is image and the y i y_i yi is label (int)
The Total loss is defined as the func follows.
L
=
1
N
∑
i
L
i
(
f
(
x
i
,
W
)
,
y
i
)
L = \frac{1}{N}\sum\limits_iL_i(f(x_i,W),y_i)
L=N1i∑Li(f(xi,W),yi)
Which is the sum of every single test’s loss
Muticlass SVM loss
Given an example ( x i , y i ) (x_i,y_i) (xi,yi) where x i x_i xi is the image and where y i y_i yi is the (int) label, using the shorthand for the score vec s = f ( x i , W ) s = f(x_i,W) s=f(xi,W)
The SVM loss has the form:
if the incorrect score is smaller than the right score (x margin), we set the loss to 0.
in this case the safe margin is set to one
Margin choice depends on our need
- Then we loop the class
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-I2VQ33KJ-1604970144389)(https://s1.ax1x.com/2020/11/08/BTZLNR.png)]
- What if we use
L = 1 N ∑ i L i ( f ( x i , W ) , y i ) 2 L = \frac{1}{N}\sum\limits_iL_i(f(x_i,W),y_i)^2 L=N1i∑Li(f(xi,W),yi)2
This is not a linear function and totally different, it’s may be useful sometimes depends on the way you care about the errors.
Example Code
def L_i_vectorized(x, y, W):
scores = W.dot(x)
margins = np.maximun(0, scores - scores[y] + margin)
margins[y] = 0
loss_i = np.sum(margins)
return loss_i
# pretty easy
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JbB4CUcY-1604970144390)(https://s1.ax1x.com/2020/11/08/BTZO41.png)]
It just change the gap bettween scores
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BSwLNCok-1604970144392)(https://s1.ax1x.com/2020/11/08/BTZzjO.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7330C6pB-1604970144394)(https://s1.ax1x.com/2020/11/08/BTe9De.png)]
often use L2 regularization just Euclid norm.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fEo4KOUq-1604970144394)(https://s1.ax1x.com/2020/11/08/BTepuD.png)]
In this case the L1 and L2 reg is equal, but we can tell that L1 prefers the w 1 w_1 w1 for it contains more zero, while the L2 prefers the w 2 w_2 w2 for the weight is evenly spreaded through the test case.
The Multiclass SVM loss just care about the gap bettween the right labels and the wrongs.
Softmax Classifier
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qF5qD5Hi-1604970144395)(https://s1.ax1x.com/2020/11/08/BTeiEd.png)]
We just want to make the true probability closer to 1 (closer the better, eq is the best), so the loss func can be chosed by using the -log on the P P P.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IPQTMmLZ-1604970144396)(https://s1.ax1x.com/2020/11/08/BTeCHH.png)]
If we want to get the zero loss, the score may goes to inf! But Computer don’t like that.
- Debugging Way
outcomes might be l o g C logC logC
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SSQzpARG-1604970144397)(https://s1.ax1x.com/2020/11/08/BTek4I.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WibZdaqJ-1604970144398)(https://s1.ax1x.com/2020/11/08/BTeECt.png)]
Optimization
Random Search - The Naive but Simplest way
Really Slow !!!
Gradient Descent
We just get the Gradient of W and go down to the bottom (maybe local best?)
Code
# Vanilla Gradient Descent
while True:
weight_grad = evaluate_gradient(loss_fun, data, weights)
weights += -step_size * weight_grad
Step size is called elearning rate which is important
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fbkOtSdW-1604970144403)(https://s1.ax1x.com/2020/11/08/BTeV8P.png)]
Since the N might be super large, we sample some sets called minibatch and use it to estimate the true gradient.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f7MlW5nk-1604970144404)(https://s1.ax1x.com/2020/11/08/BTeZgf.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IYgPftTP-1604970144405)(https://s1.ax1x.com/2020/11/08/BTenKS.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eqyviMgg-1604970144406)(https://s1.ax1x.com/2020/11/08/BTeuDg.png)]
Color Feature
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-imYM2LYI-1604970144408)(https://s1.ax1x.com/2020/11/08/BTeQEj.png)]
Gradient Extract the edge info
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vNHNimwU-1604970144409)(https://s1.ax1x.com/2020/11/08/BTelUs.png)]
NLP?
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-okmeYIzg-1604970144410)(https://s1.ax1x.com/2020/11/08/BTeG80.png)]
clustering different image patches from images
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ghKuObqE-1604970144410)(https://s1.ax1x.com/2020/11/08/BTe15n.png)]
- Differences
- Extract the Feature at first and feed into the linear classificator
- Convolutional Neutral Network would learn the feature automatically during the training process.