logo_2Cy.gif
Home About us Media Research Consultancy Training Site map Contact

Home » Research » Numerical optimization

Analytical figures of merit and other reliability measures enable one to improve or optimize an analytical determination. Often, this task boils down to performing a combinatorial search. A good example is finding the best subset of wavelength channels to keep for model building.

This page is organized as follows:


Best subset selection

The number of combinations grows exponentially with the total number of channels:

OPT1.gif

Figure OPT 1: The number of combinations as a function of the number of channels. The data are plotted on a logarithmic scale (base 10) for visual clarity.


Consequently, large-scale optimization techniques are required to do the job within a reasonable time span. The excellent survey of Ronald Shaffer and Gary Small (Learning optimization from nature. Genetic algorithms and simulated annealing, Analytical Chemistry, 69 (1997) 236A-242A) shows that two techniques are commonly deployed, namely genetic algorithms (GAs) and simulated annealing (SA). Unfortunately, there have been few studies comparing the two methods, and consequently guidelines about when to use a GA or SA are far from complete. (In practice, the choice is often a matter of taste.) For example, when applying a GA and SA to best subset selection no clear winner emerged from the following studies:

  • C.B. Lucasius, M.L.M. Beckers and G. Kateman
    Genetic algorithms in wavelength selection: a comparative study
    Analytica Chimica Acta, 286 (1994) 135-153
  • U. Hörchner and J.H. Kalivas
    Further investigation on a comparative study of simulated annealing and genetic algorithm for wavelength selection
    Analytica Chimica Acta, 311 (1995) 1-13


The main problem with these methods is, that certain parameters must be fine-tuned, otherwise the search is not effective and convergence to the global optimum is unnecessarily slow. (Typical parameters are the 'mutation rate' for a GA and the 'cooling schedule' for SA.) We have found this fine-tuning to be relatively easy when applying SA to best subset selection, as illustrated by the following results.


Worked example


Top Top blue.gif

First, we have investigated the convergence behaviour for the data set studied in the articles mentioned above. This data set consists of the UV spectra of four ribonucleic acids (RNAs) measured at 36 channels:

OPT2.gif

Figure OPT 2. Normalized UV spectra of the RNAs adenine (A), cytidine (C), guanine (G) and uracyl (U). The normalization emphasizes that the spectra of adenine and uracyl are highly overlapped, while the others have distinct features.


Finding the best subset amounts to searching among 236-1»7x1010 candidates. (The rather insignificant minus 1 is for the full data set, which was already evaluated.) The global optimum for this data set was known through an exhaustive search. The SA algorithm was run 10,000 times with a different random initialization and the number of evaluations required to hit the global optimum was monitored. In most cases, less than 10,000 evaluations were required - the mode lies around 700. It can be inferred from the theory of (semi-) Markov processes that this quantity is distributed as the convolution of a normal and a geometric distribution:

OPT3.gif

Figure OPT 3: Probability density function (pdf) of the number of trial evaluations required to find the optimum subset of wavelengths (channels 1, 7, 8, 28 and 36) using the SA algorithm: observed for 10,000 independent runs divided over 100 bins (Yellow dot.gif) and best weighted non-linear fit to the convolution of a normal and a geometric distribution (Cyan line.gif). The pdf is obtained by a suitable normalization of the original frequency counts.


The quality of this fit was statistically validated by examining studentized residuals:

OPT4.gif

Figure OPT 4: Studentized residuals for the original frequency counts, calculated as the residual divided by a standard error. The required standard error follows from the multinomial distribution of binned data (independent runs).


The studentized residuals have a standard deviation 1.08, which suggests a satisfactory fit.

This exercise was repeated for a variety of data sets. To keep the computation time manageable, the distributions were generated for 100 independent runs. The performance of the algorithm was measured in terms of the median of the distributions, rather than the mean, because the distributions are heavily skewed (see Figure OPT 3). Plotting the median against the number of channels yields:

OPT5.gif

Figure OPT 5: The median of the number of trial evaluations required to find the optimum subset of wavelengths using the SA algorithm as a function of the number of channels for various data sets: Gaussian functions (Yellow asterisk.gif), RNAs (Cyan circle.gif), proteins (Green square.gif) and polycyclic aromatic hydrocarbons (Red triangle.gif). The median is calculated for 100 independent runs.


Comparing this plot with the number of combinations (see scale of Figure OPT 1) shows that the search has been very effective. Studies of John Kalivas and coworkers suggest that the current results can be representative for other selection problems (samples, factors).


References & further information


Top Top blue.gif

Open blue.gif Open a list of references. Many of them were kindly provided by John Kalivas and Riccardo Leardi.

For further information on SA, please contact John Kalivas: John Kalivas.jpg

For further information on GAs, please contact Riccardo Leardi: Riccardo Leardi.jpg