Coursera ML笔记5

最新推荐文章于 2021-06-26 16:35:18 发布

ksboys

最新推荐文章于 2021-06-26 16:35:18 发布

阅读量226

点赞数

分类专栏：机器学习文章标签：机器学习 CoureraML

本文链接：https://blog.csdn.net/js1568/article/details/69098009

版权

机器学习专栏收录该内容

12 篇文章 0 订阅

订阅专栏

cost function

L = total number of layers in the network
$s_l$ = number of units (not counting bias unit) in layer l
K = number of output units/classes

logistic regression:
$J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$
neural networks:

J (Θ) = - 1 m \sum i = 1 m \sum k = 1 K [y (i) k log ((h Θ (x (i))) k) + (1 - y (i) k) log (1 - (h Θ (x (i))) k)] + λ 2 m \sum l = 1 L - 1 \sum i = 1 s l \sum j = 1 s l + 1 (Θ (l) j, i) 2

$\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}$

Forwardpropagation Algorithm

Backpropagation Algorithm

calculation

For training example t =1 to m:
1. Set $a^{(1)} := x^{(t)}$
2. Perform forward propagation to compute $a^{(l)}$ for l=2,3,…,L
3. Using $y^{(t)}$ , compute $\delta^{(L)} = a^{(L)} - y^{(t)}$
4. Compute $δ^{(L−1)},δ^{(L−2)},…,δ^{(2)}$ using
$\delta^{(l)} =((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ g'(z^{(l)})=((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$
5. $\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$ or with vectorization, $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$
Hence we update our new Δ matrix.
- $D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)$
- $D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}$
$\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)=D_{ij}^{(l)}$

cost function

$cost(t) =y^{(t)} \ \log (h_\Theta (x^{(t)})) + (1 - y^{(t)})\ \log (1 - h_\Theta(x^{(t)}))$
$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$

Gradient Checking

$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$
$\epsilon \approx 10^{-4}$
$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}$

Random Initialization

Hence, we initialize each $Θ^{(l)}_{ij}$ to a random value between[−ϵ,ϵ].

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Training a neural network

Number of input units = dimension of features x(i)
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
1. Randomly initialize the weights
2. Implement forward propagation to get $h_Θ(x^{(i)})$ for any $x^{(i)}$
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.