Genetic Algorithms as a tool for wavelength selection in spectral data sets
Riccardo Leardi
Department of Pharmaceutical and Food Chemistry
and
Technology, University of Genova, Italy
After a first period in which it was believed that full-spectrum methods such as PLS or PCR did not need any feature selection, it is now widely accepted that the removal of noisy or non-informative regions can significantly improve the predictive ability and the possibility of interpretation of the models.
Wavelength selection on spectral data sets has some peculiarities differentiating it from the selection of “independent” variables: the number of variables is much higher (up to some thousands), the variables show a very high autocorrelation, and the solutions are much more interpretable if made by sets of contiguous wavelengths (i.e., regions).
When applied to such data sets, the "classical" techniques of feature selection usually end up with a model in which single variables, scattered throughout the spectrum, are present. Such a solution is clearly not satisfactory for the spectroscopists, who are used to think in terms of spectroscopic features (peaks, shoulders, …).
Genetic Algorithms (GA) have already been proved to give very good results when applied to problems of variable selection. A set of modifications of the “standard” algorithm has allowed to obtain a very powerful algorithm, producing models made by a limited number of well-defined regions, often clearly linked to clear spectroscopic features, and with very good predictive ability.
After a short presentation of the GA theory and of the algorithm used for the selection of uncorrelated variables, the algorithm for wavelength selection will be discussed and some real cases will be shown.