Deep learning 2023/07/08~2023/07/10
Backpropagation
-
∂ C ∂ w → ∂ z ∂ w ∂ C ∂ z \frac{\partial C}{\partial w} \rightarrow \frac{\partial z}{\partial w}\frac{\partial C}{\partial z} ∂w∂C→∂w∂z∂z∂C
- Forward pass :
∂ z ∂ w (1) \frac{\partial z}{\partial w} \tag{1} ∂w∂z(1)
Compute (1) for all parameters
∂ z ∂ w 1 = ? (1a) \frac{\partial z}{\partial w1}=? \tag{1a} ∂w1∂z=?(1a)∂ z ∂ w 2 = ? (1b) \frac{\partial z}{\partial w2}=? \tag{1b} ∂w2∂z=?(1b)
The value of the input connected by the weight.
- Backward pass :
∂ C ∂ z (2) \frac{\partial C}{\partial z} \tag{2} ∂z∂C(2)
Compute (2) for all activation function inputs z.
Compute (2) from the output layer.
∂
C
∂
z
=
∂
a
∂
z
∂
C
∂
a
(2a)
\frac{\partial C}{\partial z}=\frac{\partial a}{\partial z}\frac{\partial C}{\partial a} \tag{2a}
∂z∂C=∂z∂a∂a∂C(2a)
∂ C ∂ z = ∂ z ′ ∂ a ∂ C ∂ z ′ + ∂ z ′ ′ ∂ a ∂ C ∂ z ′ ′ (2b) \frac{\partial C}{\partial z}=\frac{\partial z'}{\partial a}\frac{\partial C}{\partial z'} + \frac{\partial z''}{\partial a}\frac{\partial C}{\partial z''} \tag{2b} ∂z∂C=∂a∂z′∂z′∂C+∂a∂z′′∂z′′∂C(2b)
- Backpropagation Summary
F o r w a r d P a s s ⇒ ∂ z ∂ w = a Forward \ Pass \Rightarrow \frac{\partial z}{\partial w} = a Forward Pass⇒∂w∂z=a
B a c k w a r d P a s s ⇒ ∂ C ∂ z Backward \ Pass \Rightarrow \frac{\partial C}{\partial z} Backward Pass⇒∂z∂C
F o r w a r d P a s s ∗ B a c k w a r d P a s s ⇒ ∂ z ∂ w ∂ C ∂ z = ∂ C ∂ w Forward\ Pass \ * \ Backward \ Pass \Rightarrow \frac{\partial z}{\partial w}\frac{\partial C}{\partial z} = \frac{\partial C}{\partial w} Forward Pass ∗ Backward Pass⇒∂w∂z∂z∂C=∂w∂C
Regression
- Stock Market Forecast.
- Self-driving Car
- Recommendation
- Estimating the Combat Power(CP) of a pokemon after evolution
Estimating the Combat Power(CP) of a pokemon after evolution
-
Step 1: Model
-
Step 2: Goodness of Function
-
Loss function L:
- Input: a function, output: how bad it is
L ( f ) = L ( w , b ) L(f)=L(w,b) L(f)=L(w,b)
- Input: a function, output: how bad it is
-
y = b + ∑ w i x i y=b+\sum w_ix_i y=b+∑wixi
L = ∑ n L=\sum_n L=n∑
-
-
Step 3: Best Function & Gradient Descent
-
Best Function
f ∗ = a r g min f L ( f ) f^*=arg\min_fL(f) f∗=argfminL(f)w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^*,b^*=arg\ min_{w,b}L(w,b) w∗,b∗=arg minw,bL(w,b)
-
Gradient Descent
-
Consider loss function L(w) with one parameter w:
w ∗ = a r g min w L ( w ) w^*=arg\ \min_wL(w) w∗=arg wminL(w)-
Pick an initial value w^0
-
Compute (1)
d L d w ∣ w = w 0 (1) \frac{dL}{dw}|_{w=w^0} \tag{1} dwdL∣w=w0(1)w 1 ← w 0 − η d L d w ∣ w = w 0 w^1 \leftarrow w^0-\eta\frac{dL}{dw}|_{w=w^0} w1←w0−ηdwdL∣w=w0
- η is called “learning rate”
-
Compute (2)
d L d w ∣ w = w 1 (2) \frac{dL}{dw}|_{w=w^1} \tag{2} dwdL∣w=w1(2)- next
w 2 ← w 1 − η d L d w ∣ w = w 1 w^2 \leftarrow w^1-\eta \frac{dL}{dw}|_{w=w^1} w2←w1−ηdwdL∣w=w1
-
-
How about two parameters ?
w ∗ , b ∗ = a r g min w , b L ( w , b ) w^*,b^*=arg\min_{w,b}L(w,b) w∗,b∗=argw,bminL(w,b)
-
-
Worry
- In liner regression the loss function L is convex. (No local optimal)
-
A more complex model dose not always lead to better performance on testing data
- This is Overfitting
-
-
How to do Classification
-
Training data for Classification
-
Classification as Regression ?
- Binary classification as example ?
- Training: Class 1 means the target is 1; Class 2 means the target is -1.
- Testing: closer to 1 → class 1; closer to -1 → class 2
-
Ideal Alternatives
- Function (Model):
x ⇒ f ( x ) = { g ( x ) > 0 O u t p u t = c l a s s 1 e l s e O u t p u t = c l a s s 2 x \Rightarrow f(x)=\begin{cases} g(x)>0 \ \ \ Output = class 1 \\ else \ \ \ \ \ \ \ \ \ \ \ Output = class 2 \end{cases} x⇒f(x)={g(x)>0 Output=class1else Output=class2
- Function (Model):
-
-
Loss function:
L ( f ) = ∑ n δ ( f ( x n ) ) ≠ y ^ n L(f)=\sum_n \delta(f(x^n)) \neq \hat{y}^n L(f)=n∑δ(f(xn))=y^n
The number of times f get incorrect results on training data. -
Find the best function:
- Example: Perceptron, SVM
-
Gaussian Distribution
f u , Σ ( x ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 e x p { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } f_{u,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\} fu,Σ(x)=(2π)D/21∣Σ∣1/21exp{−21(x−μ)TΣ−1(x−μ)}- Input: vector x, output: probability of sampling x.
- The shape of the function determines by mean μ and covariance matrix ∑
- Input: vector x, output: probability of sampling x.
-
Maximun Likelihood
f u , Σ ( x ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 e x p { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } f_{u,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\} fu,Σ(x)=(2π)D/21∣Σ∣1/21exp{−21(x−μ)TΣ−1(x−μ)}
The Gaussian with any mean μ and covariance matrix ∑ can generate these points.
x ⇒ P ( C 1 ∣ x ) = P ( x ∣ C 1 ) P ( C 1 ) P ( x ∣ C 1 ) P ( C 1 ) + P ( x ∣ C 2 ) P ( C 2 ) x \Rightarrow P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} x⇒P(C1∣x)=P(x∣C1)P(C1)+P(x∣C2)P(C2)P(x∣C1)P(C1) -
Three Steps
- Function Set(Model) :
x ⇒ { i f P ( C 1 ∣ x ) > 0.5 , o u t p u t : c l a s s 1 O t h e r w i s e , o u t p u t : c l a s s 2 x \Rightarrow \begin{cases} if \ \ \ P(C_1|x) > 0.5,output: class \ 1 \\ Otherwise, \ \ \ \ \ \ \ \ \ \ \ \ \ output: class \ \ 2\end{cases} x⇒{if P(C1∣x)>0.5,output:class 1Otherwise, output:class 2
- Goodness of a function
- The mean μ and covariance ∑ that maximzing the likelihood(the probability of generating data)
- Find the best function: easy
-
Posterior Probability
P ( C 1 ∣ x ) = σ ( z ) s i g m o i d z = l n P ( x ∣ C 1 ) P ( C 1 ) P ( x ∣ C 2 ) P ( C 2 ) P(C_1|x)=\sigma(z) \ \ \ \ \ \ sigmoid \ \ \ \ \ \ z=ln\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)} P(C1∣x)=σ(z) sigmoid z=lnP(x∣C2)P(C2)P(x∣C1)P(C1)
z = l n P ( x ∣ C 1 ) P ( x ∣ C 2 ) + l n P ( C 1 ) P ( C 2 ) ⇒ N 1 N 1 + N 2 N 2 N 1 + N 2 = N 1 N 2 z=ln\frac{P(x|C_1)}{P(x|C_2)} + ln\frac{P(C_1)}{P(C_2)} \Rightarrow \frac{\frac{N_1}{N_1+N_2}}{\frac{N_2}{N_1+N_2}}=\frac{N_1}{N_2} z=lnP(x∣C2)P(x∣C1)+lnP(C2)P(C1)⇒N1+N2N2N1+N2N1=N2N1
P ( x ∣ C 1 ) = 1 ( 2 π ) D / 2 1 ∣ Σ 1 ∣ 1 / 2 e x p { − 1 2 ( x − μ 1 ) T ( Σ 1 ) − 1 ( x − μ 1 ) } (3a) P(x|C_1)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^1|^{1/2}}exp\{-\frac{1}{2}(x-\mu^1)^T (\Sigma^{1})^{-1}(x-\mu^1)\} \tag{3a} P(x∣C1)=(2π)D/21∣Σ1∣1/21exp{−21(x−μ1)T(Σ1)−1(x−μ1)}(3a)
P ( x ∣ C 2 ) = 1 ( 2 π ) D / 2 1 ∣ Σ 2 ∣ 1 / 2 e x p { − 1 2 ( x − μ 2 ) T ( Σ 2 ) − 1 ( x − μ 2 ) } (3b) P(x|C_2)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^2|^{1/2}}exp\{-\frac{1}{2}(x-\mu^2)^T (\Sigma^{2})^{-1}(x-\mu^2)\} \tag{3b} P(x∣C2)=(2π)D/21∣Σ2∣1/21exp{−21(x−μ2)T(Σ2)−1(x−μ2)}(3b)
- According to (3a) and (3b), it is derived as follows
l n 1 ( 2 π ) D / 2 1 ∣ Σ 1 ∣ 1 / 2 e x p { − 1 2 ( x − μ 1 ) T ( Σ 1 ) − 1 ( x − μ 1 ) } 1 ( 2 π ) D / 2 1 ∣ Σ 2 ∣ 1 / 2 e x p { − 1 2 ( x − μ 2 ) T ( Σ 2 ) − 1 ( x − μ 2 ) } ln\frac{\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^1|^{1/2}}exp\{-\frac{1}{2}(x-\mu^1)^T (\Sigma^{1})^{-1}(x-\mu^1)\}}{\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^2|^{1/2}}exp\{-\frac{1}{2}(x-\mu^2)^T (\Sigma^{2})^{-1}(x-\mu^2)\}} ln(2π)D/21∣Σ2∣1/21exp{−21(x−μ2)T(Σ2)−1(x−μ2)}(2π)D/21∣Σ1∣1/21exp{−21(x−μ1)T(Σ1)−1(x−μ1)}