来自University of Waterloo
https://www.youtube.com/watch?v=dXxuCARJ1CY&list=PLdAoL1zKcqTW-uzoSVBNEecKHsnug_M0k&index=16
情势所迫不得不好好补一补Machine Learning,因为喜欢这个教授,所以从头开始听他的课~这些笔记基本都是他课程内容的听写 + 犹他大学的Machine Learning ppt,之后抽空会整理一遍,使得内容更加体系化。
1 Perceptron
Perceptron is an online learning algorithm which is very widely used and easy to implement.
Idea: mimic the brain to do computation
-
Brain is made up of nucleus, synapse, dendrite(树突), axon(轴突)… Brain
just looks like a computer. Neuron - gates, signals - electrical signals; parallel computation - sequential and parallel computation -
However, brain is robust but computers are fragile. If a gate stops working, the computer will crash.
Artificial neural network
- Nodes: neurons
- Links: synapses
ANN Unit
- For each unit
i
i
i
- Weight
W
W
W is
- Strength of the link from unit i i i to unit j j j
- Input signals
x
i
x_i
xi weighted by
W
j
i
W_{ji}
Wji are linearly combined and produce a new signal
a
i
a_i
ai
a i = ∑ i W j i x i + w 0 = W j x ˉ a_i=\sum_i W_{ji} x_i + w_0 = W_j \bar{x} ai=i∑Wjixi+w0=Wjxˉ
- Activation function
h
h
h to produce numerical signal
y
j
y_j
yj in a non-linear way:
y j = h ( a j ) y_j = h(a_j) yj=h(aj)-
Should be non-linear or network will just be a linear function
-
Often chosen to mimic firing in neurons: unit should be ‘active’ (output near 1) when fed with the ‘right’ outputs; ‘Inactive’ (near 0) when ‘wrong’ inputs
-
Common activation functions
- Threshold activation function
- Sigmoid
-
Can we design a boolean function using threshold activation function? (And, Or, Not)
-
- Weight
W
W
W is
Network Sturctures
-
Feed-forward network
- Directed acyclic graph
- No internal state
- Simply computes outputs from inputs
-
Recurrent network
- Directed cyclic graph
Popular in NLP. We have inputs of varied lengths. So we shall use the cyclic part to adapt to different lengths. - Dynamical system with internal states
- Can memorize information
- Directed cyclic graph
2 Feed-forward Network
Perceptron: single layer feed-forward network
Shades/Color: different values/magnitude
Lines: higher weights
3 Supervised learning algorithms for neural networks
- Given list of ( x , y ) (x, y) (x,y) pairs
- Train feed-forward ANN
- To compute proper outputs y y y when fed with inputs x x x
- Consists of adjusting weights W j i W_{ji} Wji
Threshold Perceptron Learning
- Learning is done separately for each unit j j j since unit do not share weights
- Perceptron learning for unit
j
j
j:
- For each
(
x
,
y
)
(x, y)
(x,y) pair do:
- Case 1: correct output produced: ∀ i W j i ← W j i \forall_i W_{ji} \leftarrow W_{ji} ∀iWji←Wji
- Case 2: output produced is 0 instead of 1: add x i x_i xi
- Case 3: output produced is 0 instead of 1: subtract x i x_i xi
- For each
(
x
,
y
)
(x, y)
(x,y) pair do:
Sigmoid Perceptron Learning
- Represent ‘soft’ linear separators
- Same hypothesis space as logistic regression
- Possible objectives
- Minimum squared error
E ( w ) = 1 2 ∑ n E n ( w ) 2 = 1 2 ∑ n ( y n − σ ( w t x n ˉ ) ) 2 E(w)=\frac{1}{2}\sum_nE_n(w)^2\\ =\frac{1}{2}\sum_n (y_n-\sigma(w^t\bar{x_n}))^2 E(w)=21n∑En(w)2=21n∑(yn−σ(wtxnˉ))2 - Maximum likelihood (same algorithm as for logistics regression)
- Maximum a posterior hypothesis
- Beyesian learning
- Minimum squared error
- Gradient
∂ E ∂ w i = ∑ n E n ( w ) ∂ E n ∂ w i = − ∑ n E n ( w ) σ ′ ( w T x n ˉ ) x i = − ∑ n E n ( w ) σ ( w T x n ˉ ) ( 1 − σ ( w T x n ˉ ) x i \frac{\partial E}{\partial w_i} =\sum_nE_n(w)\frac{\partial E_n}{\partial w_i}\\ =-\sum_n E_n(w)\sigma ' (w^T\bar{x_n})x_i\\ =-\sum_nE_n(w)\sigma (w^T\bar{x_n})(1-\sigma(w^T\bar{x_n})x_i ∂wi∂E=n∑En(w)∂wi∂En=−n∑En(w)σ′(wTxnˉ)xi=−n∑En(w)σ(wTxnˉ)(1−σ(wTxnˉ)xi
For sigmoid funcion, σ ′ = σ ( 1 − σ ) \sigma'=\sigma(1-\sigma) σ′=σ(1−σ)
Sequential Gradient Descent in perceptron learning
- Repeat
For each ( x n , y n ) (x_n, y_n) (xn,yn) in examples to
E n ← y n − σ ( w t x n ˉ ) w ← w + η E n σ ( w t x n ˉ ) ( 1 − σ ( w t x n ˉ ) ) x n ˉ E_n \leftarrow y_n - \sigma(w^t \bar{x_n}) \\ w \leftarrow w + \eta E_n \sigma(w^t \bar{x_n}) (1-\sigma(w^t \bar{x_n})) \bar{x_n} En←yn−σ(wtxnˉ)w←w+ηEnσ(wtxnˉ)(1−σ(wtxnˉ))xnˉ
η \eta η learning rate - Until some stopping criterion satisfied
- Return learnt network
Notes:
- Prediction = s g n ( w T x ) sgn(w^T x) sgn(wTx)
- Update only on error. So this is a mistake-driven algorithm
Geometry Representation (from university of utah)
Convergence theorem:
If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable, the perceptron algorithm will converge.
Cycling theorem
If the training data is not separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop (never converge)
Mistake Bound Theorem
- R R R: Look for the farthest data point from the origin
- u u u and γ \gamma γ: The data and u u u have a margin γ \gamma γ. The data is separable. γ \gamma γ is the complexity parameter that defines the separability of data.-
Variants of Perceptron
- Hyper parameter: training epoch T T T
- Margin Perceptron: pick a positive
η
\eta
η and update
w
w
w when
y i w T x i ∣ ∣ w ∣ ∣ < η \frac{y_i w^T x_i}{||w||} < \eta ∣∣w∣∣yiwTxi<η - Voted Perceptron: After updating the weight, update the class. Return
(
w
i
,
c
i
)
(w_i, c_i)
(wi,ci). The prediction is
s g n ( ∑ i = 1 k c i s ˙ g n ( w i T x ) ) sgn(\sum^k_{i=1}c_i \dot sgn(w_i^T x)) sgn(i=1∑kcis˙gn(wiTx)) - Average Perceptron: After updating the weight, update a new variable
a
←
a
+
w
a \leftarrow a+w
a←a+w (
a
a
a is initialized as
0
0
0). Return
a
a
a. The prediction is
s g n ( a T x ) = s g n ( ∑ i = 1 k c i w i T x ) sgn(a^Tx) = sgn(\sum^k_{i=1}c_i w_i^Tx) sgn(aTx)=sgn(i=1∑kciwiTx)