The tutorials are generated from Python 2 IPython Notebook files, which will be linked to at the end of each chapter so that you can adapt and run the examples yourself. The neural networks themselves are implemented using the Python NumPy library which offers efficient implementations of linear algebra functions such as vector and matrix multiplications. Illustrative plots are generated using Matplotlib. If you want to run these examples yourself and don’t have Python with the necessary libraries installed I recommend to download and install Anaconda Python, which is a free Python distribution that contains all the libraries you need to run these tutorials, and is used to create these tutorials.
The code input cells in this blog can be collapsed or expanded by clicking on the button in the top right of each cell.
A version of this tutorial is also available in Chinese thanks to Mingming Chen.
Linear regression
This first part will cover:
A very simple neural network
Concepts such as target function and cost function
Gradient descent optimisation
All this will be illustrated with the help of the simplest neural network possible: a 1 input 1 output linear regression model that has the goal to predict the target value
t
from the input value
Image of the simple neural network
In regular neural networks, we typically have multiple layers, non-linear activation functions, and a bias for each node. In this tutorial, we only have one layer with one weight parameter
w
, no activation function on the output, and no bias. In simple linear regression the parameter
In this tutorial, we will approximate the targets
The notebook starts out with importing the libraries we need:
In [1]:
Python imports
import numpy # Matrix and vector computation package
import matplotlib.pyplot as plt # Plotting library
Allow matplotlib to plot inside this notebook
%matplotlib inline
Set the seed of the numpy random number generator so that the tutorial is reproducable
numpy.random.seed(seed=1)
Define the target function
In this example, the targets
t
will be generated from a function
f
and additive gaussian noise sampled from
We will sample 20 input samples
x
from the uniform distribution between 0 and 1, and then generate the target output values
t
by the process described above. These resulting inputs
x
and targets
t
are plotted against each other in the figure below together with the original
f(x)
line without the gaussian noise. Note that
x
is a vector of individual input samples
xi
, and that
t
is a corresponding vector of target values
ti
.
In [2]:
Define the vector of input samples as x, with 20 values sampled from a uniform distribution
between 0 and 1
x = numpy.random.uniform(0, 1, 20)
Generate the target values t from x with small gaussian noise so the estimation won’t
be perfect.
Define a function f that represents the line that generates t without noise
def f(x): return x * 2
Create the targets t with some gaussian noise
noise_variance = 0.2 # Variance of the gaussian noise
Gaussian noise error for each sample in x
noise = numpy.random.randn(x.shape[0]) * noise_variance
Create targets t
t = f(x) + noise
In [3]:
Plot the target t versus the input x
plt.plot(x, t, ‘o’, label=’t’)
Plot the initial line
plt.plot([0, 1], [f(0), f(1)], ‘b-‘, label=’f(x)’)
plt.xlabel(‘
x
’, fontsize=15)
plt.ylabel(‘
plt.ylim([0,2])
plt.title(‘inputs (x) vs targets (t)’)
plt.grid()
plt.legend(loc=2)
plt.show()
Define the cost function
We will optimize the model
y=x∗w
by tuning parameter
w
so that the squared error cost along all samples is minimized. The squared error cost is defined as
Notice that we take the sum of errors over all samples, which is known as batch training. We could also update the parameters based upon one sample at a time, which is known as online training.
This cost function for variable
w
is plotted in the figure below. The value
The neural network model is implemented in the nn(x, w) function, and the cost function is implemented in the cost(y, t) function.
In [4]:
Define the neural network function y = x * w
def nn(x, w): return x * w
Define the cost function
def cost(y, t): return ((t - y)**2).sum()
In [5]:
Plot the cost vs the given weight w
Define a vector of weights for which we want to plot the cost
ws = numpy.linspace(0, 4, num=100) # weight values
cost_ws = numpy.vectorize(lambda w: cost(nn(x, w) , t))(ws) # cost for each weight in ws
Plot
plt.plot(ws, cost_ws, ‘r-‘)
plt.xlabel(‘
w
’, fontsize=15)
plt.ylabel(‘
plt.title(‘cost vs. weight’)
plt.grid()
plt.show()
Optimizing the cost function
For a simple cost function like in this example, you can see by eye what the optimal weight should be. But the error surface can be quite complex or have a high dimensionality (each parameter adds a new dimension). This is why we use optimization techniques to find the minimum of the error function.
Gradient descent
One optimization algorithm commonly used to train neural networks is the gradient descent algorithm. The gradient descent algorithm works by taking the derivative of the cost function ξ with respect to the parameters at a specific position on this cost function, and updates the parameters in the direction of the negative gradient. The parameter w is iteratively updated by taking steps proportional to the negative of the gradient:
With
w(k)
the value of
w
at iteration
Δw
is defined as:
With
μ
the learning rate, which is how big of a step you take along the gradient, and
∂ξ/∂w
the gradient of the cost function
ξ
with respect to the weight
w
. For each sample
Where
ξi
is the squared error cost, so the
∂ξi/∂yi
term can be written as:
And since
yi=xi∗w
we can write
∂yi/∂w
as:
So the full update function
Δw
for sample
i
will become:
In the batch processing, we just add up all the gradients for each sample:
To start out the gradient descent algorithm, you typically start with picking the initial parameters at random and start updating these parameters with Δw until convergence. The learning rate needs to be tuned separately as a hyperparameter for each neural network.
The gradient
∂ξ/∂w
is implemented by the gradient(w, x, t) function.
Δw
is computed by the delta_w(w_k, x, t, learning_rate). The loop below performs 4 iterations of gradient descent while printing out the parameter value and current cost.
In [6]:
define the gradient function. Remember that y = nn(x, w) = x * w
def gradient(w, x, t):
return 2 * x * (nn(x, w) - t)
define the update function delta w
def delta_w(w_k, x, t, learning_rate):
return learning_rate * gradient(w_k, x, t).sum()
Set the initial weight parameter
w = 0.1
Set the learning rate
learning_rate = 0.1
Start performing the gradient descent updates, and print the weights and cost:
nb_of_iterations = 4 # number of gradient descent updates
w_cost = [(w, cost(nn(x, w), t))] # List to store the weight,costs values
for i in range(nb_of_iterations):
dw = delta_w(w, x, t, learning_rate) # Get the delta w update
w = w - dw # Update the current weight parameter
w_cost.append((w, cost(nn(x, w), t))) # Add weight,cost to list
Print the final w, and cost
for i in range(0, len(w_cost)):
print(‘w({}): {:.4f} \t cost: {:.4f}’.format(i, w_cost[i][0], w_cost[i][1]))
w(0): 0.1000 cost: 13.6197
w(1): 1.5277 cost: 1.1239
w(2): 1.8505 cost: 0.4853
w(3): 1.9234 cost: 0.4527
w(4): 1.9399 cost: 0.4510
Notice in the previous outcome that the gradient descent algorithm quickly converges towards the target value around
2.0
. Let’s try to plot these iterations of the gradient descent algorithm to visualize it more.
In [7]:
Plot the first 2 gradient descent updates
plt.plot(ws, cost_ws, ‘r-‘) # Plot the error curve
Plot the updates
for i in range(0, len(w_cost)-2):
w1, c1 = w_cost[i]
w2, c2 = w_cost[i+1]
plt.plot(w1, c1, ‘bo’)
plt.plot([w1, w2],[c1, c2], ‘b-‘)
plt.text(w1, c1+0.5, ‘
w()
’.format(i))
Show figure
plt.xlabel(‘
w
’, fontsize=15)
plt.ylabel(‘
plt.title(‘Gradient descent updates plotted on cost function’)
plt.grid()
plt.show()
Gradient descent updates
The last figure shows the gradient descent updates of the weight parameters for 2 iterations. The blue dots represent the weight parameter values
w(k)
at iteration
k
. Notice how the update differs from the position of the weight and the gradient at that point. The first update takes a much larger step than the second update because the gradient at
The regression line fitted by gradient descent with 10 iterations is shown in the figure below. The fitted line (red) lies close to the original line (blue), which is what we tried to approximate via the noisy samples. Notice that both lines go through point
(0,0)
, this is because we didn’t have a bias term, which represents the intercept, the intercept at
x=0
is thus
t=0
.
In [8]:
w = 0
Start performing the gradient descent updates
nb_of_iterations = 10 # number of gradient descent updates
for i in range(nb_of_iterations):
dw = delta_w(w, x, t, learning_rate) # get the delta w update
w = w - dw # update the current weight parameter
In [9]:
Plot the fitted line agains the target line
Plot the target t versus the input x
plt.plot(x, t, ‘o’, label=’t’)
Plot the initial line
plt.plot([0, 1], [f(0), f(1)], ‘b-‘, label=’f(x)’)
plot the fitted line
plt.plot([0, 1], [0*w, 1*w], ‘r-‘, label=’fitted line’)
plt.xlabel(‘input x’)
plt.ylabel(‘target t’)
plt.ylim([0,2])
plt.title(‘input vs. target’)
plt.grid()
plt.legend(loc=2)
plt.show()
This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file