Coursera | Mathematics for Machine Learning 专项课程 | Multivariate Calculus

本文为学习笔记,记录了由Imperial College London推出的Coursera专项课程——Mathematics for Machine Learning中Course Two: Mathematics for Machine Learning: Multivariate Calculus中全部Programming Assignment代码,均已通过测试,得分均为10/10。

Backpropagation

Instructions

In this assignment, you will train a neural network to draw a curve. The curve takes one input variable, the amount travelled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.

 

You will be asked to complete functions that calculate the Jacobian of the cost function, with respect to the weights and biases of the network. Your code will form part of a stochastic steepest descent algorithm that will train your network.

Matrices in Python

Recall from assignments in the previous course in this specialisation that matrices can be multiplied together in two ways.

Element wise: when two matrices have the same dimensions, matrix elements in the same position in each matrix are multiplied together In python this uses the '∗∗' operator.

A = B * C

Matrix multiplication: when the number of columns in the first matrix is the same as the number of rows in the second. In python this uses the '@' operator

A = B @ C

This assignment will not test which ones to use where, but it will use both in the starter code presented to you. There is no need to change these or worry about their specifics.

Feed forward

In the following cell, we will define functions to set up our neural network. Namely an activation function, \sigma(z), it's derivative, \sigma^{\prime}(z), a function to initialise weights and biases, and a function that calculates each activation of the network using feed-forward.

Recall the feed-forward equations,

a^{(n)} = \sigma(z^{(n)})

z^{(n)} = W^{(n)}a^{(n-1)} + b^{(n)}  

In this worksheet we will use the logistic function as our activation function, rather than the more familiar tanh

\sigma(z) = \frac{1}{1+exp(-z)} 

There is no need to edit the following cells. They do not form part of the assessment. You may wish to study how it works though.

Run the following cells before continuing.

%run "readonly/BackpropModule.ipynb"
# PACKAGE
import numpy as np
import matplotlib.pyplot as plt
# PACKAGE
# First load the worksheet dependencies.
# Here is the activation function and its derivative.
sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4

# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
    global W1, W2, W3, b1, b2, b3
    W1 = random.randn(n1, 1) / 2
    W2 = random.randn(n2, n1) / 2
    W3 = random.randn(2, n2) / 2
    b1 = random.randn(n1, 1) / 2
    b2 = random.randn(n2, 1) / 2
    b3 = random.randn(2, 1) / 2

# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
    z1 = W1 @ a0 + b1
    a1 = sigma(z1)
    z2 = W2 @ a1 + b2
    a2 = sigma(z2)
    z3 = W3 @ a2 + b3
    a3 = sigma(z3)
    return a0, z1, a1, z2, a2, z3, a3

# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
    return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size

Backpropagation

In the next cells, you will be asked to complete functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.

We'll define our Jacobians as,

J_{W^{(3)}} = \frac{\partial C}{\partial W^{(3)}}

J_{b^{(3)}} = \frac{\partial C}{\partial b^{(3)}} 

 etc., where C is the average cost function over the training set. i.e.,

C = \frac{1}{N}\sum_{k}{C_k} 

You calculated the following in the practice quizzes,

\frac{\partial C}{\partial W^{(3)}} = \frac {\partial C}{\partial a^{(3)}}\frac {\partial a^{(3)}}{\partial z^{(3)}}\frac {\partial z^{(3)}}{\partial W^{(3)}} 

for the weight, and similarly for the bias,

\frac{\partial C}{\partial b^{(3)}} = \frac {\partial C}{\partial a^{(3)}}\frac {\partial a^{(3)}}{\partial z^{(3)}}\frac {\partial z^{(3)}}{\partial b^{(3)}}  

With the partial derivatives taking the form,

\frac{\partial C}{\partial a^{(3)}} = 2(a^{(3)} - y) 

\frac{\partial a^{(3)}}{\partial z^{(3)}} = \sigma^{\prime}(z^{(3)}) 

\frac{\partial z^{(3)}}{\partial W^{(3)}} = a^{(2)} 

\frac{\partial z^{(3)}}{\partial b^{(3)}} = 1 

We'll do the J_W3 (J_{W^{(3)}}) function for you, so you can see how it works. You should then be able to adapt the J_b3 function, with help, yourself. 

# GRADED FUNCTION

# Jacobian for the third layer weights. There is no need to edit this function.
def J_W3 (x, y) :
    # First get all the activations and weighted sums at each layer of the network.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    # We'll use the variable J to store parts of our result as we go along, updating it in each line.
    # Firstly, we calculate dC/da3, using the expressions above.
    J = 2 * (a3 - y)
    # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
    J = J * d_sigma(z3)
    # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
    # i.e. dz3/dW3 = a2
    # and divide by the number of training examples, for the average over all training examples.
    J = J @ a2.T / x.size
    # Finally return the result out of the function.
    return J

# In this function, you will implement the jacobian for the bias.
# As you will see from the partial derivatives, only the last partial derivative is different.
# The first two partial derivatives are the same as previously.
# ===YOU SHOULD EDIT THIS FUNCTION===
def J_b3 (x, y) :
    # As last time, we'll first set up the activations.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    # Next you should implement the first two partial derivatives of the Jacobian.
    # ===COPY TWO LINES FROM THE PREVIOUS FUNCTION TO SET UP THE FIRST TWO JACOBIAN TERMS===
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    # For the final line, we don't need to multiply by dz3/db3, because that is multiplying by 1.
    # We still need to sum over all training examples however.
    # There is no need to edit this line.
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

We'll next do the Jacobian for the Layer 2. The partial derivatives for this are,

\frac{\partial C}{\partial W^{(2)}} = \frac{\partial C}{\partial a^{(3)}}(\frac{\partial a^{(3)}}{\partial a^{(2)}})\frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial W^{(2)}} 

\frac{\partial C}{\partial b^{(2)}} = \frac{\partial C}{\partial a^{(3)}}(\frac{\partial a^{(3)}}{\partial a^{(2)}})\frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial b^{(2)}} 

This is very similar to the previous layer, with two exceptions: 

  • There is a new partial derivative, in parentheses, \frac{\partial a^{(3)}}{\partial a^{(2)}}
  • The terms after the parentheses are now one layer lower.

Recall the new partial derivative takes the following form, 

\frac{\partial a^{(3)}}{\partial a^{(2)}} = \frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}} = \sigma^{\prime}(z^{(3)})W^{(3)}

To show how this changes things, we will implement the Jacobian for the weight again and ask you to implement it for the bias.

# GRADED FUNCTION

# Compare this function to J_W3 to see how it changes.
# There is no need to edit this function.
def J_W2 (x, y) :
    #The first two lines are identical to in J_W3.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)    
    J = 2 * (a3 - y)
    # the next two lines implement da3/da2, first σ' and then W3.
    J = J * d_sigma(z3)
    J = (J.T @ W3).T
    # then the final lines are the same as in J_W3 but with the layer number bumped down.
    J = J * d_sigma(z2)
    J = J @ a1.T / x.size
    return J

# As previously, fill in all the incomplete lines.
# ===YOU SHOULD EDIT THIS FUNCTION===
def J_b2 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    J = (J.T @ W3).T
    J = J * d_sigma(z2)
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

 Layer 1 is very similar to Layer 2, but with an addition partial derivative term.

\frac{\partial C}{\partial W^{(1)}} = \frac{\partial C}{\partial a^{(3)}}(\frac{\partial a^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial a^{(1)}})\frac{\partial a^{(1)}}{\partial z^{(1)}}\frac{\partial z^{(1)}}{\partial W^{(1)}}

\frac{\partial C}{\partial b^{(1)}} = \frac{\partial C}{\partial a^{(3)}}(\frac{\partial a^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial a^{(1)}})\frac{\partial a^{(1)}}{\partial z^{(1)}}\frac{\partial z^{(1)}}{\partial b^{(1)}} 

You should be able to adapt lines from the previous cells to complete both the weight and bias Jacobian. 

# GRADED FUNCTION

# Fill in all incomplete lines.
# ===YOU SHOULD EDIT THIS FUNCTION===
def J_W1 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    J = (J.T @ W3).T
    J = J * d_sigma(z2)
    J = (J.T @ W2).T
    J = J * d_sigma(z1)
    J = J @ a0.T / x.size
    return J

# Fill in all incomplete lines.
# ===YOU SHOULD EDIT THIS FUNCTION===
def J_b1 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    J = (J.T @ W3).T
    J = J * d_sigma(z2)
    J = (J.T @ W2).T
    J = J * d_sigma(z1)
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

Test your code before submission

To test the code you've written above, run all previous cells (select each cell, then press the play button [ ▶| ] or press shift-enter). You can then use the code below to test out your function. You don't need to submit these cells; you can edit and run them as much as you like.

First, we generate training data, and generate a network with randomly assigned weights and biases.

x, y = training_data()
reset_network()

Next, if you've implemented the assignment correctly, the following code will iterate through a steepest descent algorithm using the Jacobians you have calculated. The function will plot the training data (in green), and your neural network solutions in pink for each iteration, and orange for the last output.

It takes about 50,000 iterations to train this network. We can split this up though - 10,000 iterations should take about a minute to run. Run the line below as many times as you like.

plot_training(x, y, iterations=10000, aggression=7, noise=1)

If you wish, you can change parameters of the steepest descent algorithm (We'll go into more details in future exercises), but you can change how many iterations are plotted, how agressive the step down the Jacobian is, and how much noise to add.

You can also edit the parameters of the neural network, i.e. to give it different amounts of neurons in the hidden layers by calling,

reset_network(n1, n2)

Play around with the parameters, and save your favourite result for the discussion prompt - I ❤️ backpropagation.

 Fitting the distribution of heights data

Instructions

In this assessment you will write code to perform a steepest descent to fit a Gaussian model to the distribution of heights data that was first introduced in Mathematics for Machine Learning: Linear Algebra.

The algorithm is the same as you encountered in Gradient descent in a sandpit but this time instead of descending a pre-defined function, we shall descend the χ2�2 (chi squared) function which is both a function of the parameters that we are to optimise, but also the data that the model is to fit to.

How to submit

Complete all the tasks you are asked for in the worksheet. When you have finished and are happy with your code, press the Submit Assingment button at the top of this notebook.

Get started

Run the cell below to load dependancies and generate the first figure in this worksheet. 

# Run this cell first to load the dependancies for this assessment,
# and generate the first figure.
from readonly.HeightsModule import *

Background

If we have data for the heights of people in a population, it can be plotted as a histogram, i.e., a bar chart where each bar has a width representing a range of heights, and an area which is the probability of finding a person with a height in that range. We can look to model that data with a function, such as a Gaussian, which we can specify with two parameters, rather than holding all the data in the histogram.

The Gaussian function is given as,

f(x;\mu,\sigma) = \frac{1}{​{\sigma}\sqrt{2\pi}}exp(-\frac{(x -\mu)^2}{2\sigma^2})

The figure above shows the data in orange, the model in magenta, and where they overlap in green. This particular model has not been fit well - there is not a strong overlap.

Recall from the videos the definition of \chi^2 as the squared difference of the data and the model, i.e \chi^2 = |y - f(x;\mu,\sigma)|^2. This is represented in the figure as the sum of the squares of the pink and orange bars.

Don't forget that x and y are represented as vectors here, as these are lists of all of the data points, the |abs-squared|^2 encodes squaring and summing of the residuals on each bar.

To improve the fit, we will want to alter the parameters \mu and \sigma, and ask how that changes the \chi^2. That is, we will need to calculate the Jacobian,

J = [\frac{\partial (\chi^2)}{\partial \mu}, \frac{\partial (\chi^2)}{\partial \sigma}]

Let's look at the first term, \frac{\partial (\chi^2)}{\partial \mu}, using the multi-variate chain rule, this can be written as,

\frac{\partial (\chi^2)}{\partial \mu} = -(y - f(x;\mu,\sigma)) \cdot \frac{\partial f}{\partial \mu}(x;\mu,\sigma)

With a similar expression for \frac{\partial (\chi^2)}{\partial \sigma}; try and work out this expression for yourself.

The Jacobians rely on the derivatives \frac{\partial f}{\partial \mu} and \frac{\partial f}{\partial \sigma}. Write functions below for these.

# PACKAGE
import matplotlib.pyplot as plt
import numpy as np
# GRADED FUNCTION

# This is the Gaussian function.
def f (x,mu,sig) :
    return np.exp(-(x-mu)**2/(2*sig**2)) / np.sqrt(2*np.pi) / sig

# Next up, the derivative with respect to μ.
# If you wish, you may want to express this as f(x, mu, sig) multiplied by chain rule terms.
# === COMPLETE THIS FUNCTION ===
def dfdmu (x,mu,sig) :
    return f(x, mu, sig) * 1 / sig**2 * (x - mu)

# Finally in this cell, the derivative with respect to σ.
# === COMPLETE THIS FUNCTION ===
def dfdsig (x,mu,sig) :
    return f(x, mu, sig) * (-1 / sig + (x-mu)**2 / sig**3)

Next recall that steepest descent shall move around in parameter space proportional to the negative of the Jacobian, i.e., \begin {bmatrix} \delta\mu \\ \delta\sigma \end{bmatrix}\alpha - J, with the constant of proportionality being the aggression of the algorithm.

Modify the function below to include the \frac{\partial (\chi^2)}{\partial \sigma} term of the Jacobian, the \frac{\partial (\chi^2)}{\partial \mu} term has been included for you.

# GRADED FUNCTION

# Complete the expression for the Jacobian, the first term is done for you.
# Implement the second.
# === COMPLETE THIS FUNCTION ===
def steepest_step (x, y, mu, sig, aggression) :
    J = np.array([
        -2*(y - f(x,mu,sig)) @ dfdmu(x,mu,sig),
        -2*(y - f(x,mu,sig)) @ dfdsig(x,mu,sig) # Replace the ??? with the second element of the Jacobian.
    ])
    step = -J * aggression
    return step

Test your code before submission

To test the code you've written above, run all previous cells (select each cell, then press the play button [ ▶| ] or press shift-enter). You can then use the code below to test out your function. You don't need to submit these cells; you can edit and run them as much as you like.

# First get the heights data, ranges and frequencies
x,y = heights_data()

# Next we'll assign trial values for these.
mu = 155 ; sig = 6
# We'll keep a track of these so we can plot their evolution.
p = np.array([[mu, sig]])

# Plot the histogram for our parameter guess
histogram(f, [mu, sig])
# Do a few rounds of steepest descent.
for i in range(50) :
    dmu, dsig = steepest_step(x, y, mu, sig, 2000)
    mu += dmu
    sig += dsig
    p = np.append(p, [[mu,sig]], axis=0)
# Plot the path through parameter space.
contour(f, p)
# Plot the final histogram.
histogram(f, [mu, sig])
<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Note that the path taken through parameter space is not necesarily the most direct path, as with steepest descent we always move perpendicular to the contours.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值