Functional Data Analysis Notes 1 - Introduction and Basis Expansion


阅读材料:Ramsay, J.O., Silverman, B.W. (2005) Functional Data Analysis (2nd Edition) Section 1.1, 1.2, 1.3, 1.4, 1.6.1

目录
  • 1 Functional data
  • 2 From discrete to functional data
    • 2.1 Two methods
    • 2.2 Basis expansions
      • 2.2.1 The Fourier basis
      • 2.2.2 B-splines
      • 2.2.3 Other basis
      • 2.2.4 Summary
  • 3 R example
    • 3.1 New functions
    • 3.2 Fourier basis
    • 3.3 B-splines

1 Functional data

Functional Data: Multivariate data with an ordering on the dimensions. (Muller, 2006)

  • Key assumption is smoothness:

    \[y_{ij} = x_i(t_{ij}) + \epsilon_{ij} \]

    with

    • \(y_{ij}\): the ith individual in the jth time piont;
    • \(t\): in a continuum (usually time);
    • \(x_i(t)\): functions for different \(i\).
  • Highest quality data from monitoring equipment. Noiser and less frequent data can also be used.

Eg. Handwriting Data: Measures of position of nib of a pen writing "fda". 20 replications, measurements taken at 200 Hz.
Handwriting Data

Fanctional data is complex.

  • Requires more thought/judgement than a t-test.
  • Data needs pre-processing.
  • Parametric inference is rarely available/appropriate.
  • Most approaches are computationally challenging.

FDA is relatively new.

  • First named in Dalzell & Ramsay, 1991.
  • Relatively little penetration into applied fields.
  • Several competing methodologies from different schools of thought (we focus on one).
  • Limited public software/resourses - changed over the last decade.
  • Initial development on data analysis rather than inference.

Necessities for Functional Data:

  • must believably derive from a smooth process;
  • process should not be easily parameterizable (should not be able to write down a formula);
  • enough data to resolve the essential features of the process (peaks, zero-crossings, speed... will depend on application);
  • some repetition in the process;
  • do not need equally spaced or perfect measurements.

2 From discrete to functional data

2.1 Two methods

  • Representing non-parametric continuous-time functions.

    • Basis-expansion methods:
      \(x(t) = \sum_{i = 1}^k \phi_i (t) c_i\)
      for pre-defined \(\phi_i (t)\) and coefficient \(c_i\).
    • Several basis systems available: focus on Fourior and B-splines.
  • Reducing noise in measurements.

    • Smoothing penalties:

    \[c = argmin \sum_{i = 1}^n (y_i - x(t_i))^2 + \lambda \int [Lx(t)]^2 dt \]

    • \(Lx(t)\) measures "roughness" of \(x\)
    • \(\lambda\): a "smoothing parameter" that trades-off fit to the \(y_i\) and roughness; must be chosen.

2.2 Basis expansions

Consider only one record

\[y_i = x(t_i) + \epsilon \]

representing

\[x(t) = \sum_{j=1}^k c_j \phi_j (t_i) = \Phi (t)^T \mathbf{c} \]

\(\Phi (t)\) is a basis system for \(x\).

Eg. Terms for curvature in linear regression

\[y_i = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... + \epsilon_i \]

implies

\[x(t) = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... \]

Polynomials are unstable; Fourier basis and B-splines wil be more useful.

2.2.1 The Fourier basis

Add up these basis plots.

fourier-1

fourier-2

fourier-3

Advantages:

  • Only alternative to monomial bases until the middle of the 20th century;
  • Excellent computational properties, especially if the observations are equally spaced;
  • Natural for describing periodic data.

Disadvantages:

  • The functions are periodc which can be a problem for data with growth curves etc.

2.2.2 B-splines

\[P(u) = \sum_{i=0}^k P_i B_{i, j} (u), ~~~~~ u \in [u_{j-1}, u_{k+1}], \]

Splines (\(B_{i, j}\)): polynomial segmetns joined end-to-end.

  • Segments are constrained to be smooth at the join.

Knotes (\(P_i\)): the pointes at which the segments join.

The system is difined by

  • the order \(m\) (order = degree + 1) of the polynomial segments (splines);
  • the location of the knots.

b-splines-1

b-splines-2

b-splines-3

b-spline-local

Properties:

  • Number of basis functions = order (\(m\)) + number of interior knots.
  • Derivatives up to \(m - 2\) are continuous.
  • B-spline basis functions are positive over at most \(m\) adjacent intervals \(\rightarrow\) fast computation for even thousands of basis functions.
  • Sum of all B-splines in a basis is always 1; can fit any polynomial of order \(m\).
  • Most popular choice is order 4, implying continuous second derivatives. Second derivatives have straight-line segments.

Choosing the number of knots and order:

  • The order of the spline should be at least \(m + 2\) if you are interested in \(m\) derivatives.
  • Knots are often equally spaced (a useful default).
  • Place more knotes where you know there is strong curvature, and fewer where the function changes slowly.
  • Be sure there is at least one data point in every interval.

2.2.3 Other basis

The fda library in R also allows the following basis:

  • Constant: \(\phi (t) = 1\).
  • Power: \(t^{\lambda_1}, t^{\lambda_2}, t^{\lambda_3}, ...\), powers are distinctbut not necessarily integers or positive.
  • Exponential: \(e^{\lambda_1 t}, e^{\lambda_2 t}, e^{\lambda_3 t}, ...\).

Other possible basis include:

  • Wavelets: especially for sharp, local features.
  • Empirical: functional principal components.
  • Designer: dynamic models, tailoring a basis to data can be more efficient.

2.2.4 Summary

  • Basis expansions: just like adding different independent variables in linear regression
  • Fourier basis: classical, common in signal processing etc. Great for periodic functions. Must be infinitely differentiable.
  • B-spline basis: locally polynomial. Allows control of smoothness and accuracy. Local definition ) good numerics.
  • Other basis systems also exist.
  • What is best depends on the data.

3 R example

3.1 New functions

We need the package "fda". (Package "fda".pdf)

Functions used Detials
matplot(matrixA, matrixB) matrixA would be on the x-axis while matrixB on the y-axis. The number of rows of matrixA and matrixB should match.
create.fourier.basis(vectorRange, numberBasis) Create Fourier basis
create.bspline.basis(vectorRange, numberBasis, numberOrder, breaks = numberKnots) Create B-spline basis
eval.basis(vectorTimepoint, basis) Evaluate the basis. Outcome is a vector with vectorTimepoint as rows and basis as colnumns.

library("fda")

# Plot the temperature data of the 35 cities
matplot(CanadianWeather$dailyAv[, , 1], , type = "l", xlab = "Days")

matplot

# Load the dataset "CanadianWeather"
data(CanadianWeather)

# Temperature and precipitation are contained in the dailyAV element, and we
# will extract them.
temp <- CanadianWeather$dailyAv[, , 1]
precip <- CanadianWeather$dailyAv[, , 2]

# We need corresponding time points. I'll put them half-way through a day. 
# This is because the period is over 0:365 and we'd like the 365 data points 
# to be about symmetric in that period.
daytime <- (1:365) - 0.5

# This is a bit fine for plotting purposes, so we'll also create a vector of
# points to use for plotting every 5 days
day5 <- seq(0, 365, 5)


# Plot these data
matplot(daytime, temp, type='l')
matplot(daytime, precip, type='l')

# We can also plot by region; Atlantic, Pacific, Central and North.
matplot(daytime, temp, type = 'l', col = as.factor(CanadianWeather$region))
matplot(daytime, precip, type = 'l',col = as.factor(CanadianWeather$region))
Temperature of the 35 cities
Precipitation of the 35 cities
Grouped by region
# Perform fda

# Define a basis system
# We'll always need the range.
dayrng = c(0, 365)

3.2 Fourier basis

# 1/ Fourier Basis with 365 basis functions
fbasis = create.fourier.basis(dayrng, 365)

# Plot a basis with just 5 components
plot(create.fourier.basis(dayrng, 5))

在这里插入图片描述

# Try a simple linear regression of the first temperature record on the first
# 20 basis functions.
fb.values <- eval.basis(day5, fbasis)
# The 74 by 365 matrix that results has days as rows, bases as columns.
dim(fb.values)
plot(day5, fb.values[, 1])
plot(day5, fb.values[, 2])
# Extract the first temperature record
ex.temp <- temp[, 1]
plot(ex.temp, main = "Daily Temperature of St. Johns")

ex.temp

# Run a linear regression on temperature with the first 5 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:5]  # First 5 basis functions
matplot(Xmat, type = "l")

Xmat

ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (5 basis functions")

fit_5

# Check the residuals
plot(daytime, ex.temp.mod$resid)

resid

There's some clear autocorrelation. More basis functions may be warranted.
Repeat the above with different numbers of basis functions (say 20, 50, 100, 200, 365) to find the one give a reasonable smooth.

Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:20]  # First 20 basis functions
ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (20 basis functions)")

fit_20

# Check the residuals
plot(daytime, ex.temp.mod$resid)

resid_20

3.3 B-splines

# 2/ B-spline bases with knots every 5 days

# First of all define a knot sequence; this will be the same as day5.
knots <- day5

# We'll use fourth-order B-splines
norder <- 4

# This implies the number of basis functions
nbasis = length(knots) + norder - 2

# Define the basis
bbasis <- create.bspline.basis(dayrng, nbasis, norder, knots)

# If in doubt, we can obtain
bbasis$nbasis  # number of basis functions
bbasis$rangeval  # basis range

# Plot the basis
plot(bbasis)

bbasis

# We can look at a smaller number
plot(create.bspline.basis(dayrng, nbasis = 12, norder))

bbasis-12

# We can also look at the inner product of these
in.mat = inprod(bbasis, bbasis)
par(mfrow=c(1, 1))
image(in.mat)

inner_product

It is zero outside of a diagonal band This can help computation a great deal.

Change the order of the basis and observe how the width of the support of the basis changes and how its smoothness properties change.

bbasis <- create.bspline.basis(dayrng, nbasis = 20, norder)
# Run a linear regression on temperature with the first 20 basis functions. 
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, bbasis)
Xmat <- Xmat[, 1:20]  # First 20 basis functions
matplot(Xmat, type = "l")

Xmat_b

ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main="Daily Temperature of St. Johns with smoothing")

b_fitted

# Check the residuals
plot(daytime,ex.temp.mod$resid)

b_residual

相关