Multivariate (and multiway) calibration methods like principal component regression (PCR) and partial least squares (PLS) regression require the analyst to select a suitable number of components (also known as latent variables or factors). In practice, this is all but a trivial task!
This page is organized as follows:
Validationbased model selection
Currently, validationbased methods appear to be the standard tool. Utilizing a large independent test set is generally considered to be best ('test = best'), at least in theory. However, a separate test set is rather wasteful hence it may not always be available.
Crossvalidation is the most popular alternative. However, it is a resampling method. Correct application is therefore, in a strict sense, limited to data that can be considered as a random sample from some population. This may present a serious problem if the data is collected according to an experimental design. Another situation where crossvalidation is likely to fail, is encountered when an existing model is to be updated with a few observations from another population.

A serious weakness shared by all validationbased methods is the following. Ideally, validation leads to a minimum prediction error for the optimum model complexity, see figure on the right.
In practice, however, a clear minimum is often not observed. Instead, one has to rely on 'visual inspection' of 'farfromideal' plots, and apply 'soft' decision rules such as the 'first local minimum' or the 'start of a plateau'.
Which decision rule is applied, actually depends on the data at hand.
It follows that validationbased model selection is inherently subjective in practice.



Figure COM 1: The text book illustration.


Randomization test

Top

We have developed a randomization test that intends to make the decision more objective. A tutorial introduction is:
 N.M. Faber
How to avoid overfitting in PLS regression?
CAC 2006
Download (zipped =268 kB)
For more practical examples, see:
 N.M. Faber and R. Rajkó
An evergreen problem in multivariate calibration
Spectroscopy Europe, 18 (2006) 2428
Download from the Spectroscopy Europe site (=1,470 kB)
 N.M. Faber and R. Rajkó
How to avoid overfitting in multivariate calibration  the conventional validation approach and an alternative
Analytica Chimica Acta, 595 (2007) 98106
 M.P. GómezCarracedo, J.M. Andrade, D.N.Rutledge and N.M. Faber
Selecting the optimum number of PLS components for the calibration of ATRmidIR spectra of undesigned kerosene samples
Analytica Chimica Acta, 585 (2007) 253265
 S. Wiklund, D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber
A randomization test for PLS component selection
Journal of Chemometrics, 21 (2007) 427439

References & further information

Top

Open a list of references.
For further information, please contact Róbert Rajkó:

