Functional Data Analysis Notes 1 - Introduction and Basis Expansion
阅读材料:Ramsay, J.O., Silverman, B.W. (2005) Functional Data Analysis (2nd Edition) Section 1.1, 1.2, 1.3, 1.4, 1.6.1
目录- 1 Functional data
- 2 From discrete to functional data
- 2.1 Two methods
- 2.2 Basis expansions
- 2.2.1 The Fourier basis
- 2.2.2 B-splines
- 2.2.3 Other basis
- 2.2.4 Summary
- 3 R example
- 3.1 New functions
- 3.2 Fourier basis
- 3.3 B-splines
1 Functional data
Functional Data: Multivariate data with an ordering on the dimensions. (Muller, 2006)
-
Key assumption is smoothness:
\[y_{ij} = x_i(t_{ij}) + \epsilon_{ij} \]with
- \(y_{ij}\): the ith individual in the jth time piont;
- \(t\): in a continuum (usually time);
- \(x_i(t)\): functions for different \(i\).
-
Highest quality data from monitoring equipment. Noiser and less frequent data can also be used.
Eg. Handwriting Data: Measures of position of nib of a pen writing "fda". 20 replications, measurements taken at 200 Hz.
Fanctional data is complex.
- Requires more thought/judgement than a t-test.
- Data needs pre-processing.
- Parametric inference is rarely available/appropriate.
- Most approaches are computationally challenging.
FDA is relatively new.
- First named in Dalzell & Ramsay, 1991.
- Relatively little penetration into applied fields.
- Several competing methodologies from different schools of thought (we focus on one).
- Limited public software/resourses - changed over the last decade.
- Initial development on data analysis rather than inference.
Necessities for Functional Data:
- must believably derive from a smooth process;
- process should not be easily parameterizable (should not be able to write down a formula);
- enough data to resolve the essential features of the process (peaks, zero-crossings, speed... will depend on application);
- some repetition in the process;
- do not need equally spaced or perfect measurements.
2 From discrete to functional data
2.1 Two methods
-
Representing non-parametric continuous-time functions.
- Basis-expansion methods:
\(x(t) = \sum_{i = 1}^k \phi_i (t) c_i\)
for pre-defined \(\phi_i (t)\) and coefficient \(c_i\). - Several basis systems available: focus on Fourior and B-splines.
- Basis-expansion methods:
-
Reducing noise in measurements.
- Smoothing penalties:
- \(Lx(t)\) measures "roughness" of \(x\)
- \(\lambda\): a "smoothing parameter" that trades-off fit to the \(y_i\) and roughness; must be chosen.
2.2 Basis expansions
Consider only one record
\[y_i = x(t_i) + \epsilon \]representing
\[x(t) = \sum_{j=1}^k c_j \phi_j (t_i) = \Phi (t)^T \mathbf{c} \]\(\Phi (t)\) is a basis system for \(x\).
Eg. Terms for curvature in linear regression
\[y_i = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... + \epsilon_i \]implies
\[x(t) = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... \]Polynomials are unstable; Fourier basis and B-splines wil be more useful.
2.2.1 The Fourier basis
Add up these basis plots.
Advantages:
- Only alternative to monomial bases until the middle of the 20th century;
- Excellent computational properties, especially if the observations are equally spaced;
- Natural for describing periodic data.
Disadvantages:
- The functions are periodc which can be a problem for data with growth curves etc.
2.2.2 B-splines
\[P(u) = \sum_{i=0}^k P_i B_{i, j} (u), ~~~~~ u \in [u_{j-1}, u_{k+1}], \]Splines (\(B_{i, j}\)): polynomial segmetns joined end-to-end.
- Segments are constrained to be smooth at the join.
Knotes (\(P_i\)): the pointes at which the segments join.
The system is difined by
- the order \(m\) (order = degree + 1) of the polynomial segments (splines);
- the location of the knots.
Properties:
- Number of basis functions = order (\(m\)) + number of interior knots.
- Derivatives up to \(m - 2\) are continuous.
- B-spline basis functions are positive over at most \(m\) adjacent intervals \(\rightarrow\) fast computation for even thousands of basis functions.
- Sum of all B-splines in a basis is always 1; can fit any polynomial of order \(m\).
- Most popular choice is order 4, implying continuous second derivatives. Second derivatives have straight-line segments.
Choosing the number of knots and order:
- The order of the spline should be at least \(m + 2\) if you are interested in \(m\) derivatives.
- Knots are often equally spaced (a useful default).
- Place more knotes where you know there is strong curvature, and fewer where the function changes slowly.
- Be sure there is at least one data point in every interval.
2.2.3 Other basis
The fda library in R also allows the following basis:
- Constant: \(\phi (t) = 1\).
- Power: \(t^{\lambda_1}, t^{\lambda_2}, t^{\lambda_3}, ...\), powers are distinctbut not necessarily integers or positive.
- Exponential: \(e^{\lambda_1 t}, e^{\lambda_2 t}, e^{\lambda_3 t}, ...\).
Other possible basis include:
- Wavelets: especially for sharp, local features.
- Empirical: functional principal components.
- Designer: dynamic models, tailoring a basis to data can be more efficient.
2.2.4 Summary
- Basis expansions: just like adding different independent variables in linear regression
- Fourier basis: classical, common in signal processing etc. Great for periodic functions. Must be infinitely differentiable.
- B-spline basis: locally polynomial. Allows control of smoothness and accuracy. Local definition ) good numerics.
- Other basis systems also exist.
- What is best depends on the data.
3 R example
3.1 New functions
We need the package "fda"
. (Package "fda".pdf)
Functions used | Detials |
---|---|
matplot(matrixA, matrixB) |
matrixA would be on the x-axis while matrixB on the y-axis. The number of rows of matrixA and matrixB should match. |
create.fourier.basis(vectorRange, numberBasis) |
Create Fourier basis |
create.bspline.basis(vectorRange, numberBasis, numberOrder, breaks = numberKnots) |
Create B-spline basis |
eval.basis(vectorTimepoint, basis) |
Evaluate the basis. Outcome is a vector with vectorTimepoint as rows and basis as colnumns. |
library("fda")
# Plot the temperature data of the 35 cities
matplot(CanadianWeather$dailyAv[, , 1], , type = "l", xlab = "Days")
# Load the dataset "CanadianWeather"
data(CanadianWeather)
# Temperature and precipitation are contained in the dailyAV element, and we
# will extract them.
temp <- CanadianWeather$dailyAv[, , 1]
precip <- CanadianWeather$dailyAv[, , 2]
# We need corresponding time points. I'll put them half-way through a day.
# This is because the period is over 0:365 and we'd like the 365 data points
# to be about symmetric in that period.
daytime <- (1:365) - 0.5
# This is a bit fine for plotting purposes, so we'll also create a vector of
# points to use for plotting every 5 days
day5 <- seq(0, 365, 5)
# Plot these data
matplot(daytime, temp, type='l')
matplot(daytime, precip, type='l')
# We can also plot by region; Atlantic, Pacific, Central and North.
matplot(daytime, temp, type = 'l', col = as.factor(CanadianWeather$region))
matplot(daytime, precip, type = 'l',col = as.factor(CanadianWeather$region))
![]() |
![]() |
![]() |
![]() |
# Perform fda
# Define a basis system
# We'll always need the range.
dayrng = c(0, 365)
3.2 Fourier basis
# 1/ Fourier Basis with 365 basis functions
fbasis = create.fourier.basis(dayrng, 365)
# Plot a basis with just 5 components
plot(create.fourier.basis(dayrng, 5))
# Try a simple linear regression of the first temperature record on the first
# 20 basis functions.
fb.values <- eval.basis(day5, fbasis)
# The 74 by 365 matrix that results has days as rows, bases as columns.
dim(fb.values)
plot(day5, fb.values[, 1])
plot(day5, fb.values[, 2])
![]() |
![]() |
# Extract the first temperature record
ex.temp <- temp[, 1]
plot(ex.temp, main = "Daily Temperature of St. Johns")
# Run a linear regression on temperature with the first 5 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:5] # First 5 basis functions
matplot(Xmat, type = "l")
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (5 basis functions")
# Check the residuals
plot(daytime, ex.temp.mod$resid)
There's some clear autocorrelation. More basis functions may be warranted.
Repeat the above with different numbers of basis functions (say 20, 50, 100, 200, 365) to find the one give a reasonable smooth.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:20] # First 20 basis functions
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (20 basis functions)")
# Check the residuals
plot(daytime, ex.temp.mod$resid)
3.3 B-splines
# 2/ B-spline bases with knots every 5 days
# First of all define a knot sequence; this will be the same as day5.
knots <- day5
# We'll use fourth-order B-splines
norder <- 4
# This implies the number of basis functions
nbasis = length(knots) + norder - 2
# Define the basis
bbasis <- create.bspline.basis(dayrng, nbasis, norder, knots)
# If in doubt, we can obtain
bbasis$nbasis # number of basis functions
bbasis$rangeval # basis range
# Plot the basis
plot(bbasis)
# We can look at a smaller number
plot(create.bspline.basis(dayrng, nbasis = 12, norder))
# We can also look at the inner product of these
in.mat = inprod(bbasis, bbasis)
par(mfrow=c(1, 1))
image(in.mat)
It is zero outside of a diagonal band This can help computation a great deal.
Change the order of the basis and observe how the width of the support of the basis changes and how its smoothness properties change.
bbasis <- create.bspline.basis(dayrng, nbasis = 20, norder)
# Run a linear regression on temperature with the first 20 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, bbasis)
Xmat <- Xmat[, 1:20] # First 20 basis functions
matplot(Xmat, type = "l")
ex.temp.mod <- lm(ex.temp ~ Xmat)
# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main="Daily Temperature of St. Johns with smoothing")
# Check the residuals
plot(daytime,ex.temp.mod$resid)