阅读材料:Ramsay, J.O., Silverman, B.W. (2005) Functional Data Analysis (2nd Edition) Section 1.1, 1.2, 1.3, 1.4, 1.6.1
Contents
1 Functional data
Functional Data: Multivariate data with an ordering on the dimensions. (Muller, 2006)
- Key assumption is smoothness:
y i j = x i ( t i j ) + ϵ i j y_{ij} = x_i(t_{ij}) + \epsilon_{ij} yij=xi(tij)+ϵij
with
y i j y_{ij} yij: the ith individual in the jth time piont;
t t t: in a continuum (usually time);
x i ( t ) x_i(t) xi(t): functions for different i i i. - Highest quality data from monitoring equipment. Noiser and less frequent data can also be used.
Eg. Handwriting Data: Measures of position of nib of a pen writing “fda”. 20 replications, measurements taken at 200 Hz.
Fanctional data is complex.
- Requires more thought/judgement than a t-test.
- Data needs pre-processing.
- Parametric inference is rarely available/appropriate.
- Most approaches are computationally challenging.
FDA is relatively new.
- First named in Dalzell & Ramsay, 1991.
- Relatively little penetration into applied fields.
- Several competing methodologies from different schools of thought (we focus on one).
- Limited public software/resourses - changed over the last decade.
- Initial development on data analysis rather than inference.
Necessities for Functional Data:
- must believably derive from a smooth process;
- process should not be easily parameterizable (should not be able to write down a formula);
- enough data to resolve the essential features of the process (peaks, zero-crossings, speed… will depend on application);
- some repetition in the process;
- do not need equally spaced or perfect measurements.
2 From discrete to functional data
2.1 Two methods
- Representing non-parametric continuous-time functions.
- Basis-expansion methods:
x ( t ) = ∑ i = 1 k ϕ i ( t ) c i x(t) = \sum_{i = 1}^k \phi_i (t) c_i x(t)=∑i=1kϕi(t)ci
for pre-defined ϕ i ( t ) \phi_i (t) ϕi(t) and coefficient c i c_i ci. - Several basis systems available: focus on Fourior and B-splines.
- Basis-expansion methods:
- Reducing noise in measurements.
- Smoothing penalties:
c = a r g m i n ∑ i = 1 n ( y i − x ( t i ) ) 2 + λ ∫ [ L x ( t ) ] 2 d t c = argmin \sum_{i = 1}^n (y_i - x(t_i))^2 + \lambda \int [Lx(t)]^2 dt c=argmin∑i=1n(yi−x(ti))2+λ∫[Lx(t)]2dt - L x ( t ) Lx(t) Lx(t) measures “roughness” of x x x
- λ \lambda λ: a “smoothing parameter” that trades-off fit to the y i y_i yi and roughness; must be chosen.
- Smoothing penalties:
2.2 Basis expansions
Consider only one record
y i = x ( t i ) + ϵ y_i = x(t_i) + \epsilon yi=x(ti)+ϵ
representing
x ( t ) = ∑ j = 1 k c j ϕ j ( t i ) = Φ ( t ) T c x(t) = \sum_{j=1}^k c_j \phi_j (t_i) = \Phi (t)^T \mathbf{c} x(t)=j=1∑kcjϕj(ti)=Φ(t)Tc
Φ ( t ) \Phi (t) Φ(t) is a basis system for x x x.
Eg. Terms for curvature in linear regression
y i = β 0 + x i β 1 + x i 2 β 2 + x i 3 β 3 + . . . + ϵ i y_i = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... + \epsilon_i yi=β0+xiβ1+xi2β2+xi3β3+...+ϵi
implies
x ( t ) = β 0 + x i β 1 + x i 2 β 2 + x i 3 β 3 + . . . x(t) = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... x(t)=β0+xiβ1+xi2β2+xi3β3+...
Polynomials are unstable; Fourier basis and B-splines wil be more useful.
2.2.1 The Fourier basis
Add up these basis plots.
Advantages:
- Only alternative to monomial bases until the middle of the 20th century;
- Excellent computational properties, especially if the observations are equally spaced;
- Natural for describing periodic data.
Disadvantages:
- The functions are periodc which can be a problem for data with growth curves etc.
2.2.2 B-splines
P ( u ) = ∑ i = 0 k P i B i , j ( u ) , u ∈ [ u j − 1 , u k + 1 ] , P(u) = \sum_{i=0}^k P_i B_{i, j} (u), ~~~~~ u \in [u_{j-1}, u_{k+1}], P(u)=i=0∑kPiBi,j(u), u∈[uj−1,uk+1],
Splines ( B i , j B_{i, j} Bi,j): polynomial segmetns joined end-to-end.
- Segments are constrained to be smooth at the join.
Knotes ( P i P_i Pi): the pointes at which the segments join.
The system is difined by
- the order m m m (order = degree + 1) of the polynomial segments (splines);
- the location of the knots.
Properties:
- Number of basis functions = order ( m m m) + number of interior knots.
- Derivatives up to m − 2 m - 2 m−2 are continuous.
- B-spline basis functions are positive over at most m m m adjacent intervals → \rightarrow → fast computation for even thousands of basis functions.
- Sum of all B-splines in a basis is always 1; can fit any polynomial of order m m m.
- Most popular choice is order 4, implying continuous second derivatives. Second derivatives have straight-line segments.
Choosing the number of knots and order:
- The order of the spline should be at least m + 2 m + 2 m+2 if you are interested in m m m derivatives.
- Knots are often equally spaced (a useful default).
- Place more knotes where you know there is strong curvature, and fewer where the function changes slowly.
- Be sure there is at least one data point in every interval.
2.2.3 Other basis
The fda library in R also allows the following basis:
- Constant: ϕ ( t ) = 1 \phi (t) = 1 ϕ(t)=1.
- Power: t λ 1 , t λ 2 , t λ 3 , . . . t^{\lambda_1}, t^{\lambda_2}, t^{\lambda_3}, ... tλ1,tλ2,tλ3,..., powers are distinctbut not necessarily integers or positive.
- Exponential: e λ 1 t , e λ 2 t , e λ 3 t , . . . e^{\lambda_1 t}, e^{\lambda_2 t}, e^{\lambda_3 t}, ... eλ1t,eλ2t,eλ3t,....
Other possible basis include:
- Wavelets: especially for sharp, local features.
- Empirical: functional principal components.
- Designer: dynamic models, tailoring a basis to data can be more efficient.
2.2.4 Summary
- Basis expansions: just like adding different independent variables in linear regression
- Fourier basis: classical, common in signal processing etc. Great for periodic functions. Must be infinitely differentiable.
- B-spline basis: locally polynomial. Allows control of smoothness and accuracy. Local definition ) good numerics.
- Other basis systems also exist.
- What is best depends on the data.
3 R example
3.1 New functions
We need the package "fda"
. (Package “fda”.pdf)
Functions used | Detials |
---|---|
matplot(matrixA, matrixB) | matrixA would be on the x-axis while matrixB on the y-axis. The number of rows of matrixA and matrixB should match. |
create.fourier.basis(vectorRange, numberBasis) | Create Fourier basis |
create.bspline.basis(vectorRange, numberBasis, numberOrder, breaks = numberKnots) | Create B-spline basis |
eval.basis(vectorTimepoint, basis) | Evaluate the basis. Outcome is a vector with vectorTimepoint as rows and basis as colnumns. |
library("fda")
# Plot the temperature data of the 35 cities
matplot(CanadianWeather$dailyAv[, , 1], , type = "l", xlab = "Days")
# Load the dataset "CanadianWeather"
data(CanadianWeather)
# Temperature and precipitation are contained in the dailyAV element, and we
# will extract them.
temp <- CanadianWeather$dailyAv[, , 1]
precip <- CanadianWeather$dailyAv[, , 2]
# We need corresponding time points. I'll put them half-way through a day.
# This is because the period is over 0:365 and we'd like the 365 data points
# to be about symmetric in that period.
daytime <- (1:365) - 0.5
# This is a bit fine for plotting purposes, so we'll also create a vector of
# points to use for plotting every 5 days
day5 <- seq(0, 365, 5)
# Plot these data
matplot(daytime, temp, type='l')
matplot(daytime, precip, type='l')
# We can also plot by region; Atlantic, Pacific, Central and North.
matplot(daytime, temp, type = 'l', col = as.factor(CanadianWeather$region))
matplot(daytime, precip, type = 'l',col = as.factor(CanadianWeather$region))
|
|
![]() |
![]() |
| |
![]() |
![]() |
# Perform fda
# Define a basis system
# We'll always need the range.
dayrng = c(0, 365)
3.2 Fourier basis
# 1/ Fourier Basis with 365 basis functions
fbasis = create.fourier.basis(dayrng, 365)
# Plot a basis with just 5 components
plot(create.fourier.basis(dayrng, 5))
# Try a simple linear regression of the first temperature record on the first
# 20 basis functions.
fb.values <- eval.basis(day5, fbasis)
# The 74 by 365 matrix that results has days as rows, bases as columns.
dim(fb.values)
plot(day5, fb.values[, 1])
plot(day5, fb.values[, 2])
![]() |
![]() |
# Extract the first temperature record
ex.temp <- temp[, 1]
plot(ex.temp, main = "Daily Temperature of St. Johns")
# Run a linear regression on temperature with the first 5 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:5] # First 5 basis functions
matplot(Xmat, type = "l")
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (5 basis functions")
# Check the residuals
plot(daytime, ex.temp.mod$resid)
There’s some clear autocorrelation. More basis functions may be warranted.
Repeat the above with different numbers of basis functions (say 20, 50, 100, 200, 365) to find the one give a reasonable smooth.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:20] # First 20 basis functions
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (20 basis functions)")
# Check the residuals
plot(daytime, ex.temp.mod$resid)
3.3 B-splines
# 2/ B-spline bases with knots every 5 days
# First of all define a knot sequence; this will be the same as day5.
knots <- day5
# We'll use fourth-order B-splines
norder <- 4
# This implies the number of basis functions
nbasis = length(knots) + norder - 2
# Define the basis
bbasis <- create.bspline.basis(dayrng, nbasis, norder, knots)
# If in doubt, we can obtain
bbasis$nbasis # number of basis functions
bbasis$rangeval # basis range
# Plot the basis
plot(bbasis)
# We can look at a smaller number
plot(create.bspline.basis(dayrng, nbasis = 12, norder))
# We can also look at the inner product of these
in.mat = inprod(bbasis, bbasis)
par(mfrow=c(1, 1))
image(in.mat)
It is zero outside of a diagonal band This can help computation a great deal.
Change the order of the basis and observe how the width of the support of the basis changes and how its smoothness properties change.
bbasis <- create.bspline.basis(dayrng, nbasis = 20, norder)
# Run a linear regression on temperature with the first 20 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, bbasis)
Xmat <- Xmat[, 1:20] # First 20 basis functions
matplot(Xmat, type = "l")
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main="Daily Temperature of St. Johns with smoothing")
# Check the residuals
plot(daytime,ex.temp.mod$resid)