MTHM506/COMM511 - Statistical Data Modelling
Topic 3 - Introduction
Preliminaries
In this session, we will start Topic 3 and introduce Generalised Additive Models, another, more flexible class
of models that are often seen as an extension to Generalised Linear Models. These notes refer to Topics
3.1-3.8 from the lecture notes. In this session, we need the mgcv package to help us fit Generalised Additive
Models. We use the install.packages() function to download and install the most recent package and use
the library() function to load them into the R library.
# Installing required packages
install.packages("mgcv")
# Loading required packages into the library
library(mgcv)
Introduction
In Topic 2, we learned about Generalised Linear Models (GLMs), a new framework in order to build more
general models and model different data types . In Topic 3, we extend the GLM framework so that we can
model the mean function using smooth functions of the covariates. Let’s formalise this in a similar way to
GLMs. Generalised Additive Modelling (GAM) will have a response variable, Yi which again come from the
exponential family of distributions
Yi ~ EF (θi, φ)
Examples that we have seen of exponential family distributions are Normal, Binomial, Poisson, Negative
Binomial, Exponential and Gamma. Remember, θi is called the location parameter and φ is called the
scale/dispersion parameter. The location parameter relates to the mean of the distributions in this family
and the dispersion relates to the variance. Again we will see that the variance will not be independent of the
mean (see Slides 4-5 in Topic 2 Notes). We’re working within probability distributions for which there might
be a potential mean-variance relationship, therefore the variance is a scaled function of the mean.
In GLMs we specified a function of the mean E(Yi) = μi of the following
g(μi) = ηi = β0 + β1x1,i + · · ·+ βpxp,i
where ηi is called the linear predictor (the part of the model we relate the response yi and the covariates xi).
It relates to the mean of the distribution μi through a function g(·), the “link-function”.
Now, in GAMs, we want to replace this linear predictor with a series of unknown functions of our parameters
g(μi) = ηi =
p∑
i
fp(xp,i)
where fp(·) are a series of unknown (smooth, continuous) functions of our covariates xpi . The idea of GAMs
is that we want to fit these unknown functions and not individual parameters (β). The easiest way to do this
is express our functions fp(·) in a linear way using basis functions. We have seen an example (see Poisson
1
GLMs) where we fit a polynomial function as our linear predictor so a function fp(·) with a polynomial basis
function would look like:
fp(xi) = β0 + β1xi + β2x2i + β3x3i + . . .+ βqx
More generally we write fp(·) as a linearly or as a sum of basis functions bj(·)
fp(xi) = βp,0 +
q∑
j=1
βp,jbp,j(xi)
where b1(xi) = xi, b2(xi) = x2i , . . ., bj(xi) = x
j
i . There are certain questions that arise from this. Is this the
most sensible way of doing this? And how do you decide what q (the number of basis functions) are? Can we
do this as part of the inference? Let’s consider it with an example on some simulated data:
# We wil