Neural Network
1. Model Summary
At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features
x
1
⋯
x
n
x_1\cdots x_n
x1⋯xn and the output is the result of our hypothesis function. In this model our
x
0
=
1
x_{0}=1
x0=1 input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification,
1
1
+
e
−
θ
T
x
\frac{1}{1 + e^{-\theta^Tx}}
1+e−θTx1, yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.
Visually, a simplistic representation looks like:
[ x 0 x 1 x 2 ] → [ ] → h θ ( x ) \begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} →[ \qquad ]→h_θ(x) ⎣⎡x0x1x2⎦⎤→[]→hθ(x)
three layers: input layer / hidden layer / output layer
a
i
(
j
)
a_{i}^{(j)}
ai(j) : activation unit i in layer j
Θ
(
j
)
\Theta^{(j)}
Θ(j) : Matrix that controls function mapping from j-th layer to (j+1)-th layer
If layer j has
s
j
s_{j}
sj units, layer j+1 has
s
j
+
1
s_{j+1}
sj+1 units, then size of
Θ
(
j
)
\Theta^{(j)}
Θ(j) is
s
j
+
1
∗
(
s
j
+
1
)
s_{j+1}*(s_{j}+1)
sj+1∗(sj+1)
L
L
L : Number of Layers
s
l
s_l
sl : Number of units in l-th layer
Number of Inputs: the dimension of features in
x
(
i
)
x^{(i)}
x(i)
Binary Classification: 1 output unit
K-classes Classification: K output unit
2. Forward Propagation
- Add a x ( 0 ) = 1 a_x^{(0)}=1 ax(0)=1 first
- z x + 1 = Θ ( x ) a x z_{x+1}=\Theta^{(x)}a_x zx+1=Θ(x)ax
- a x + 1 = g ( z x + 1 ) a_{x+1}=g(z_{x+1}) ax+1=g(zx+1) — g(x) : Sigmoid
3. Cost Function
Excluding Bias Term
4. Backpropagation Algorithm
δ
j
(
l
)
\delta_j^{(l)}
δj(l) error of node j in layer l, then
δ
(
L
)
=
a
(
L
)
−
y
δ
(
i
)
=
(
Θ
(
i
)
)
T
δ
(
i
+
1
)
.
∗
g
′
(
z
(
i
)
)
(
i
!
=
L
,
i
!
=
1
)
\delta^{(L)}=a^{(L)}-y\\ \delta^{(i)}=(\Theta^{(i)})^T\delta^{(i+1)}.*g'(z^{(i)}) \qquad (i!=L,i!=1)
δ(L)=a(L)−yδ(i)=(Θ(i))Tδ(i+1).∗g′(z(i))(i!=L,i!=1)
where
g
′
(
z
(
i
)
)
=
a
(
i
)
.
∗
(
1
−
a
(
i
)
)
g'(z^{(i)})=a^{(i)}.*(1-a^{(i)})
g′(z(i))=a(i).∗(1−a(i))
One thing to note: use one training set to train the model at one time!
5. Unrolling Parameters
Enroll them to vectors/Get back:
6.Gradient Checking
When learning, turn off gradient checking!!!
7. Random Initialization
8.Network Architecture
one hidden layer/
more than one hidden layer with same number of units