Chapter 1 : Introduction
1 Machine Learning Introduction
1.1 AI vs ML vs DL
- AI: Enables machines to mimic human behavior
- ML: Use statistical methods to enable machines to improve with experience
- DL: A kind of ML which makes the multi-layer neural network feasible
1.2 Machine Learning Process
D a t a C o l l e c t i o n − > D a t a p r e p a r a t i o n − > T r a i n i n g − > E v a l u a t i o n − > T u n i n g Data Collection -> Data preparation -> Training -> Evaluation -> Tuning DataCollection−>Datapreparation−>Training−>Evaluation−>Tuning
1.3 Machine Learning Approaches
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Reinforcement Learning
1.4 Supervised Learning
The goal is to learn the mapping between a set of inputs and outputs
1.4.1 Classification
The output could be a category.
1.4.2 Regression
The output could be a real-world scalar.
1.5 Unsupervised Learning
Only input data is provided and there are no labeled example outputs to aim for
1.5.1 Clustering
Most used and is the act of creating groups with different characteristics.
1.5.2 Association
Used for recommending or finding related items.
1.5.3 Anomaly Detection
Used to separate and detect strange occurrences.
1.5.4 Dimensionality Reduction
Aim to find the most important features to reduce the original features.
1.6 Semi-supervised Learning
A mix between supervised and unsupervised approaches
1.6.1 Generative Adversarial Networks
GANs use two neural networks,a generator and discriminator and by battling against each other they both become increasingly skilled
1.7 Reinforcement Learning
In this approach,occasional positive and negative feedback is used to reinforce behavior
2 Matrix Calculus
2.1 Matrix Calculus
2.1.1 Define Jacobi Matrix
f : R n → R m . y ⃗ = f ( x ⃗ ) , y ⃗ ∈ R m , x ⃗ ∈ R n f: R^{n} \rightarrow R^{m} . \quad \vec{y}=f(\vec{x}), \quad \vec{y} \in R^{m}, \quad \vec{x} \in R^{n} f:Rn→Rm.y=f(x),y∈Rm,x∈Rn
∂ y ⃗ ∂ x ⃗ ⇒ \frac{\partial \vec{y}}{\partial \vec{x}} \Rightarrow ∂x∂y⇒ Jacobi Matrix
Example:
- 2 dimensions:
f : R 2 → R , y = f ( x 1 , x 2 ) f: R^{2} \rightarrow R, \quad y=f\left(x_{1}, x_{2}\right) f:R2→R,y=f(x1,x2)
∇ f ( x 1 , x 2 ) = [ ∂ f ( x 1 , x 2 ) ∂ x 1 , ∂ f ( x 1 , x 2 ) ∂ x 2 ] \left.\nabla f\left(x_{1}, x_{2}\right)=[ \frac{\partial f\left(x_{1}, x_{2}\right)}{\partial x_{1}}, \quad \frac{\partial f\left(x_{1}, x_{2}\right)}{\partial x_{2}}\right] ∇f(x1,x2)=[∂x1∂f(x1,x2),∂x2∂f(x1,x2)]
f : R 2 → R 2 , y ⃗ = [ y 1 y 2 ] = [ f 2 ( x 1 , x 2 ) f 2 ( x 1 , x 1 ) ] f: R^{2} \rightarrow R^{2}, \quad \vec{y}=\left[\begin{array}{l}y_{1} \\ y_{2}\end{array}\right]=\left[\begin{array}{l}f_2\left(x_{1}, x_{2}\right) \\ f_{2}\left(x_{1}, x_{1}\right)\end{array}\right] f:R2→R2,y=[y1y2]=[f2(x1,x2)f2(x1,x1)]
J x = [ ∇ f 1 ( x 1 , x 2 ) ∇ f 2 ( x 1 , x 2 ) ] = [ ∂ f 1 ( x 1 , x 2 ) ∂ x 1 , ∂ f 1 ( x 1 , x 2 ) ∂ x 2 ∂ f 2 ( x 1 , x 0 ) ∂ x 1 , ∂ f 2 ( x 1 x 1 ) ∂ x 2 ] J_{x}=\left[\begin{array}{l}\nabla f_{1}\left(x_{1}, x_{2}\right) \\ \nabla f_{2}\left(x_1, x_{2}\right)\end{array}\right]=\left[\begin{array}{ll}\frac{\partial f_{1}\left(x_{1}, x_{2}\right)}{\partial x_{1}} & , \frac{\partial f_{1}\left(x_{1}, x_{2}\right)}{\partial x_{2}} \\ \frac{\partial f_{2}\left(x_{1}, x_{0}\right)}{\partial x_{1}} & , \frac{\partial f_{2}\left(x_{1} x_{1}\right)}{\partial x_{2}}\end{array}\right] Jx=[∇f1(x1,x2)∇f2(x1,x2)]=[∂x1∂f1(x1,x2)∂x1∂f2(x1,x0),∂x2∂f1(x1,x2),∂x2∂f2(x1x1)]
- m dimensions:
y ⃗ ∈ R m , x ⃗ ∈ R n . \vec{y}\in R^{m}, \quad \vec{x} \in R^{n}. y∈Rm,x∈Rn.
J x = ∂ y ⃗ ∂ x ⃗ = [ ∇ f 1 ( x ⃗ ) ∇ f 2 ( x ⃗ ) ⋮ ∇ f m ( x ⃗ ) ] J_{x}=\frac{\partial \vec{y}}{\partial \vec{x}}=\left[\begin{array}{c}\nabla f_{1}(\vec{x}) \\ \nabla f_{2}(\vec{x}) \\ \vdots \\ \nabla f_{m}(\vec{x})\end{array}\right] Jx=∂x∂y=⎣⎢⎢⎢⎡∇f1(x)∇f2(x)⋮∇fm(x)⎦⎥⎥⎥⎤
2.1.2 Vector Sum Reduction
y = ∑ i = 1 n f i ( x ⃗ ) : R n → R y=\sum_{i=1}^{n} f_{i}(\vec{x}): R^{n} \rightarrow R y=∑i=1nfi(x):Rn→R , $\vec{x} \in R^{n} $ , ( y ⃗ = f ( x ⃗ ) ) ⇒ x ⃗ ∈ R n , y ⃗ ∈ R m , R n → R m (\vec{y}=f(\vec{x})) \Rightarrow \vec{x} \in R^{n}, \vec{y} \in R^{m}, \quad R^{n} \rightarrow R^{m} (y=f(x))⇒x∈Rn,y∈Rm,Rn→Rm : J x : m × n J_{x}: m \times n Jx:m×n
R n → R : J x : 1 × n R^{n} \rightarrow R: \quad J_{x}: 1 \times n Rn→R:Jx:1×n
∂ y ∂ x ⃗ = [ ∂ y ∂ x 1 , ∂ y ∂ x 2 , ⋯ , ∂ y ∂ x n ] \frac{\partial y}{\partial \vec{x}}=\left[\frac{\partial y}{\partial x_{1}}, \frac{\partial y}{\partial x_{2}}, \cdots, \frac{\partial y}{\partial x_{n}}\right] ∂x∂y=[∂x1∂y,∂x2∂y,⋯,∂xn∂y]
= [ ∂ ∂ x 1 ∑ i = 1 n f i ( x ⃗ ) , ∂ ∂ x 2 ∑ i = 1 n f i ( x ⃗ ) , . . . . , ∂ ∂ x n ∑ i = 1 n f i ( x ⃗ ) ] =[\frac{\partial}{\partial x_{1}} \sum_{i=1}^{n} f_{i}(\vec{x}),\frac{\partial}{\partial x_{2}} \sum_{i=1}^{n} f_{i}(\vec{x}),....,\frac{\partial}{\partial x_{n}} \sum_{i=1}^{n} f_{i}(\vec{x})] =[∂x1∂∑i=1nfi(x),∂x2∂∑i=1nfi(x),....,∂xn∂∑i=1nfi(x)]
= [ ∑ i = 1 n ∂ f i ⋅ ( x ⃗ ) ∂ x 1 , ∑ i = 1 n ∂ f i ⋅ ( x ⃗ ) ∂ x 2 , . . . , ∑ i = 1 n ∂ f i ⋅ ( x ⃗ ) ∂ x n ] =[\sum_{i=1}^{n} \frac{\partial f_i \cdot(\vec{x})}{\partial x_{1}},\sum_{i=1}^{n} \frac{\partial f_i \cdot(\vec{x})}{\partial x_{2}},...,\sum_{i=1}^{n} \frac{\partial f_i \cdot(\vec{x})}{\partial x_{n}}] =[∑i=1n∂x1∂fi⋅(x),∑i=1n∂x2∂fi⋅(x),...,∑i=1n∂xn∂fi⋅(x)]
2.1.3 Vector Chain Rules
1) Single-variable Chain Rule: d y d x = d y d u ⋅ d u d x \frac{d y}{d x}=\frac{d y}{d u} \cdot \frac{d u}{d x} dxdy=dudy⋅dxdu
2) Single-variable total-derivative Chain Rule:
∂ f ( x , u 1 , u 2 , ⋯ , u n ) ∂ x = ∂ f ∂ x + ∂ f ∂ u 1 ∂ u 1 ∂ x + ∂ f ∂ u 2 ∂ u 2 ∂ x + ⋯ ⋯ + ∂ f ∂ u n ⋅ ∂ u n ∂ x \frac{\partial f\left(x, u_{1}, u_{2}, \cdots, u_{n}\right)}{\partial x}=\frac{\partial f}{\partial x}+\frac{\partial f}{\partial u_{1}} \frac{\partial u_{1}}{\partial x}+\frac{\partial f}{\partial u_{2}} \frac{\partial u_{2}}{\partial x}+\cdots \cdots+\frac{\partial f}{\partial u_{n}} \cdot \frac{\partial u_{n}}{\partial x} ∂x∂f(x,u1,u2,⋯,un)=∂x∂f+∂u1∂f∂x∂u1+∂u2∂f∂x∂u2+⋯⋯+∂un∂f⋅∂x∂un = ∂ f ∂ x + ∑ i = 1 n ∂ t ∂ u i ∂ u i ∂ x =\frac{\partial f}{\partial x}+\sum_{i=1}^{n} \frac{\partial t}{\partial u_{i}} \frac{\partial u_{i}}{\partial x} =∂x∂f+∑i=1n∂ui∂t∂x∂ui
3) Vector Chain Rules:
f : R → R 2 f: R \rightarrow R^{2} f:R→R2 , y ⃗ = [ f 1 ( x ) f 2 ( x ) ] \vec{y}=\left[\begin{array}{c}f_1(x) \\ f_{2}(x)\end{array}\right] y=[f1(x)f2(x)]
y ⃗ = [ y 1 y 2 ] = [ f 1 ( x ) f 2 ( x ) ] = [ ln ( x 2 ) sin ( 3 x ) ] \vec{y}=\left[\begin{array}{l}y_{1} \\ y_{2}\end{array}\right]=\left[\begin{array}{l}f_{1}(x) \\ f_{2}(x)\end{array}\right]=\left[\begin{array}{l}\ln \left(x^{2}\right) \\ \sin (3 x)\end{array}\right] y=[y1y2]=[f1(x)f2(x)]=[ln(x2)sin(3x)]
g ⃗ = [ g 1 ( x ) g 2 ( x ) ] = [ x 2 3 x ] \vec{g}=\left[\begin{array}{l}g_{1}(x) \\ g_{2}(x)\end{array}\right]=\left[\begin{array}{l}x^{2} \\ 3 x\end{array}\right] g=[g1(x)g2(x)]=[x23x]
[ f 1 ( x ) f 2 ( x ) ] = [ f 1 ( g ⃗ ) f 2 ( g ⃗ ) ] = [ ln ( g 1 ) sin ( g 2 ) ] \left[\begin{array}{c}f_{1}(x) \\ f_2(x)\end{array}\right]=\left[\begin{array}{c}f_{1}(\vec{g}) \\ f_{2}(\vec{g})\end{array}\right]=\left[\begin{array}{c}\ln \left(g_{1}\right) \\ \sin \left(g_{2}\right)\end{array}\right] [f1(x)f2(x)]=[f1(g)f2(g)]=[ln(g1)sin(g2)]
∂ y ⃗ ∂ x R → R 2 : J x : 2 × 1 \frac{\partial \vec{y}}{\partial x} \quad R \rightarrow R^{2}: \quad J_{x}: 2 \times 1 ∂x∂yR→R2:Jx:2×1
∂ y ⃗ ∂ x = [ ∂ f 1 ( g ⃗ ) ∂ x ] ∂ f 2 ( g ⃗ ) ∂ x ] \frac{\partial \vec{y}}{\partial x}=\left[\begin{array}{l}\left.\frac{\partial f_1 (\vec{g})}{\partial x}\right] \\ \frac{\partial f_2 (\vec{g})}{\partial x}\end{array}\right] ∂x∂y=[∂x∂f1(g)]∂x∂f2(g)]= [ ∂ f 1 ∂ g 1 ⋅ ∂ y 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ⋅ ∂ y 1 ∂ x + ∂ f 2 ∂ g 2 ⋅ ∂ g 2 ∂ x ] \left[\begin{array}{l}\frac{\partial f_{1}}{\partial g_{1}} \cdot \frac{\partial y_{1}}{\partial x}+\frac{\partial f_{1}}{\partial g_{2}} \frac{\partial g_{2}}{\partial x} \\ \frac{\partial f_{2}}{\partial g_{1}} \cdot \frac{\partial y_{1}}{\partial x}+\frac{\partial f_{2}}{\partial g_{2}} \cdot \frac{\partial g_{2}}{\partial x}\end{array}\right] [∂g1∂f1⋅∂x∂y1+∂g2∂f1∂x∂g2∂g1∂f2⋅∂x∂y1+∂g2∂f2⋅∂x∂g2]= [ 1 g 1 2 x + 0 0 + cos ( g 2 ) ⋅ 3 ] \left[\begin{array}{cc}\frac{1}{g_1} 2 x+0 \\ 0 & +\cos \left(g_{2}\right) \cdot 3\end{array}\right] [g112x+00+cos(g2)⋅3]
∂ x f ⃗ ( g ⃗ ( x ) ) \frac{\partial}{x} \vec{f}(\vec{g}(x)) x∂f(g(x))= ∂ f ⃗ ∂ g ⃗ ⋅ ∂ g → ∂ x \frac{\partial \vec{f}}{\partial \vec{g}} \cdot \frac{\overrightarrow{\partial g}}{\partial x} ∂g∂f⋅∂x∂g
2.1.4 Matrix Differention
consistent
Proposition 5:
y ⃗ = A x ⃗ , y ⃗ ∈ R m , x ⃗ ∈ R n , A ∈ R m × n \vec{y}=\mathbb{A} \vec{x}, \quad \vec{y} \in R^{m}, \quad \vec{x} \in R^{n}, \mathbb{A} \in R^{m \times n} y=Ax,y∈Rm,x∈Rn,A∈Rm×n , A \mathbb{A} A doesn’t depend on x ⃗ \vec{x} x
∂ y ⃗ ∂ x ⃗ = A \frac{\partial \vec{y}}{\partial \vec{x}}=\mathbb{A} ∂x∂y=A
Proposition 6:
y ⃗ = A x ⃗ , y ⃗ ∈ R n , x ⃗ ∈ R n , A ∈ R m × n \vec{y}=\mathbb{A} \vec{x}, \quad \vec{y}\in R^{n}, \quad \vec{x} \in R^{n}, \quad \mathbb{A} \in R^{m \times n} y=Ax,y∈Rn,x∈Rn,A∈Rm×n , A \mathbb{A} A doesn’t depend on x ⃗ \vec{x} x , Suppose x ⃗ \vec{x} x is a function of $ \vec{z} $ , A \mathbb{A} A is independent of z ⃗ \vec{z} z ,
Then: ∂ y ⃗ ∂ z ⃗ = A ⋅ ∂ x ⃗ ∂ z ⃗ \frac{\partial \vec{y}}{\partial \vec{z}}=\mathbb{A} \cdot \frac{\partial \vec{x}}{\partial \vec{z}} ∂z∂y=A⋅∂z∂x
Pf : ∂ y ⃗ ∂ z ⃗ = ∂ y ⃗ ∂ x ⃗ ⋅ ∂ x ⃗ ∂ z ⃗ = A ⋅ ∂ x ⃗ ∂ z ⃗ \frac{\partial \vec{y}}{\partial \vec{z}}=\frac{\partial \vec{y}}{\partial \vec{x}} \cdot \frac{\partial \vec{x}}{\partial \vec{z}}=\mathbb{A} \cdot \frac{\partial \vec{x}}{\partial \vec{z}} ∂z∂y=∂x∂y⋅∂z∂x=A⋅∂z∂x
Proposition 7:
α = y ⃗ ⊤ A x ⃗ , α ∈ R ′ , y ⃗ ∈ R m , x ⃗ ∈ R n \alpha=\vec{y}^{\top} \mathbb{A} \vec{x}, \quad \alpha \in R^{\prime}, \quad \vec{y} \in R^{m}, \quad \vec{x} \in R^{n} α=y⊤Ax,α∈R′,y∈Rm,x∈Rn , A ∈ R m × n \mathbb{A} \in R^{m \times n} A∈Rm×n , A \mathbb{A} A is independent of x ⃗ , y ⃗ \vec{x},\vec{y} x,y .
Then : ∂ ∂ ∂ x ⃗ = y ⃗ ′ A \frac{\partial \partial}{\partial \vec{x}}=\vec{y}^{\prime} \mathbb{A} ∂x∂∂=y′A , proposition 5 : y ⃗ ′ A = B \vec{y}^{\prime} \mathbb{A}=B y′A=B , α = B x ⃗ \alpha =B \vec{x} α=Bx ⇒ \Rightarrow ⇒ ∂ α ∂ x ⃗ = B = y ⃗ ′ A \frac{\partial \alpha}{\partial \vec{x}}=B=\vec{y}^{\prime} \mathbb{A} ∂x∂α=B=y′A
Then : ∂ α ∂ y ˙ = x ⃗ T ⋅ A \frac{\partial \alpha}{\partial \dot{y}}=\vec{x}^{T} \cdot \mathbb{A} ∂y˙∂α=xT⋅A
Pf : $\alpha =\vec{y}^{\top} \mathbb{A} \vec{x} \quad, \quad \alpha^{\top}=\alpha $ , α = α T = ( y ⃗ τ A x ⃗ ) T = x ⃗ ′ A T y ⃗ \alpha=\alpha^{T}=\left(\vec{y}^{\tau} \mathbb{A} \vec{x}\right)^{T}=\vec{x}^{\prime} \mathbb{A}^{T} \vec{y} α=αT=(yτAx)T=x′ATy
∂ ∂ ∂ y = x ⃗ ⊤ A ⊤ \frac{\partial \partial}{\partial y}=\vec{x}^{\top} \mathbb{A}^{\top} ∂y∂∂=x⊤A⊤
Proposition :
Let the scalar α \alpha α be defined by : α = y ⃗ T ⋅ x ⃗ , y ⃗ ∈ R n , x ⃗ ∈ R n \alpha=\vec{y}^{T} \cdot \vec{x}, \quad \vec{y} \in R^{n}, \quad \vec{x} \in R^{n} α=yT⋅x,y∈Rn,x∈Rn , y ⃗ , x ⃗ \vec{y} , \vec{x} y,x are function vector z ⃗ \vec{z} z , then : ∂ α ∂ z = x ⃗ ⊤ ∂ y ⃗ ∂ z ⃗ + y ⃗ ∂ x ⃗ ∂ z ⃗ \frac{\partial \alpha}{\partial z}=\vec{x}^{\top} \frac{\partial \vec{y}}{\partial \vec{z}}+\vec{y} \frac{\partial \vec{x}}{\partial \vec{z}} ∂z∂α=x⊤∂z∂y+y∂z∂x
Pf : ∂ α ∂ z = ∂ α ∂ y ⃗ ∂ y ⃗ ∂ z + ∂ α ∂ x ⃗ ∂ x ⃗ ∂ z ⃗ = x ⃗ ⊤ ∂ y ⃗ ∂ z ⃗ + ψ ⃗ ∂ x ⃗ ∂ z \frac{\partial \alpha}{\partial z}=\frac{\partial \alpha}{\partial \vec{y}} \frac{\partial \vec{y}}{\partial z}+\frac{\partial \alpha}{\partial \vec{x}} \frac{\partial \vec{x}}{\partial \vec{z}}=\vec{x}^{\top} \frac{\partial \vec{y}}{\partial \vec{z}}+\vec{\psi} \frac{\partial \vec{x}}{\partial z} ∂z∂α=∂y∂α∂z∂y+∂x∂α∂z∂x=x⊤∂z∂y+ψ∂z∂x
Proposition 8 :
α = x ⃗ ′ A ⋅ x ⃗ , x ⃗ ∈ R n , A ∈ R n x n \alpha=\vec{x}^{\prime} \mathbb{A} \cdot \vec{x}, \quad \vec{x} \in R^{n} , \quad \mathbb{A} \in R^{n x n} α=x′A⋅x,x∈Rn,A∈Rnxn , A \mathbb{A} A doesn’t depend on x,then : ∂ α ∂ x = x ⃗ ⊤ ( A + A ⊤ ) \frac{\partial \alpha}{\partial x}=\vec{x}^{\top}\left(\mathbb{A}+\mathbb{A}^{\top}\right) ∂x∂α=x⊤(A+A⊤)
Pf : α = x ⃗ ′ A ⋅ x ⃗ , y ⃗ = x ⃗ \alpha=\vec{x}^{\prime} \mathbb{A} \cdot \vec{x}, \quad \vec{y}=\vec{x} α=x′A⋅x,y=x
α = y ⃗ ′ A ⋅ x ⃗ , y ⃗ ⋅ x ⃗ , x ⃗ , x ⃗ \alpha=\vec{y}^{\prime} \mathbb{A} \cdot \vec{x} \quad, \quad \vec{y} \cdot \vec{x}, \quad \vec{x} , \vec{x} α=y′A⋅x,y⋅x,x,x
Proposition 10 :
∂ α ∂ x ⃗ = ∂ α ∂ y ⃗ ⋅ ∂ y ⃗ ∂ x ⃗ + ∂ ∂ ∂ x ⃗ ⋅ ∂ x ⃗ ∂ x ˙ \frac{\partial \alpha}{\partial \vec{x}}=\frac{\partial \alpha}{\partial \vec{y}} \cdot \frac{\partial \vec{y}}{\partial \vec{x}}+\frac{\partial \partial}{\partial \vec{x}} \cdot \frac{\partial \vec{x}}{\partial \dot{x}} ∂x∂α=∂y∂α⋅∂x∂y+∂x∂∂⋅∂x˙∂x
= ( A ⋅ x ⃗ ) ⊤ + y ⃗ ′ A = x ⃗ ⊤ A ⊤ + x ⃗ ′ A = x ⃗ r ( A + A ⊤ ) =(\mathbb{A} \cdot \vec{x})^{\top}+\vec{y}^{\prime}\mathbb{A}=\vec{x}^{\top}\mathbb{A}^{\top}+\vec{x}^{\prime} \mathbb{A}=\vec{x}^{r}\left(\mathbb{A}+\mathbb{A}^{\top}\right) =(A⋅x)⊤+y′A=x⊤A⊤+x′A=xr(A+A⊤)
Proposition 9 :
A \mathbb{A} A is symetric , then A = A ⊤ \mathbb{A}=\mathbb{A}^{\top} A=A⊤
After,we can have Proposition 10 : = 2 x ⃗ ⋅ A 2 \vec{x} \cdot \mathbb{A} 2x⋅A
Proposition 15 :
A − 1 A = I A^{-1} A=I A−1A=I
A − 1 ∂ A ∂ α + ∂ A − 1 ∂ α A = 0 A^{-1} \frac{\partial A}{\partial \alpha}+\frac{\partial A^{-1}}{\partial \alpha} A=0 A−1∂α∂A+∂α∂A−1A=0