Home About us Media Research Consultancy Training Site map Contact

Home » Research » Component selection

Multivariate (and multiway) calibration methods like principal component regression (PCR) and partial least squares (PLS) regression require the analyst to select a suitable number of components (also known as latent variables or factors). In practice, this is all but a trivial task!

This page is organized as follows:

Validation-based model selection

Currently, validation-based methods appear to be the standard tool. Utilizing a large independent test set is generally considered to be best ('test = best'), at least in theory. However, a separate test set is rather wasteful hence it may not always be available.

Cross-validation is the most popular alternative. However, it is a resampling method. Correct application is therefore, in a strict sense, limited to data that can be considered as a random sample from some population. This may present a serious problem if the data is collected according to an experimental design. Another situation where cross-validation is likely to fail, is encountered when an existing model is to be updated with a few observations from another population.

A serious weakness shared by all validation-based methods is the following. Ideally, validation leads to a minimum prediction error for the optimum model complexity, see figure on the right.

In practice, however, a clear minimum is often not observed. Instead, one has to rely on 'visual inspection' of 'far-from-ideal' plots, and apply 'soft' decision rules such as the 'first local minimum' or the 'start of a plateau'.

Which decision rule is applied, actually depends on the data at hand.

It follows that validation-based model selection is inherently subjective in practice.


Figure COM 1: The text book illustration.

Randomization test

Top Top blue.gif

We have developed a randomization test that intends to make the decision more objective. A tutorial introduction is:

  • N.M. Faber
    How to avoid over-fitting in PLS regression?
    CAC 2006
    Download (zipped icon_ppt.gif=268 kB)

For more practical examples, see:

  • N.M. Faber and R. Rajkó
    An evergreen problem in multivariate calibration
    Spectroscopy Europe, 18 (2006) 24-28
    Download from the Spectroscopy Europe site (icon_pdf.gif=1,470 kB)
  • N.M. Faber and R. Rajkó
    How to avoid over-fitting in multivariate calibration - the conventional validation approach and an alternative
    Analytica Chimica Acta, 595 (2007) 98-106
  • M.P. Gómez-Carracedo, J.M. Andrade, D.N.Rutledge and N.M. Faber
    Selecting the optimum number of PLS components for the calibration of ATR-mid-IR spectra of undesigned kerosene samples
    Analytica Chimica Acta, 585 (2007) 253-265
  • S. Wiklund, D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber
    A randomization test for PLS component selection
    Journal of Chemometrics, 21 (2007) 427-439

References & further information

Top Top blue.gif

Open blue.gif Open a list of references.

For further information, please contact Róbert Rajkó: Robert Rajko.jpg