GENETIC ALGORITHMS IN FEATURE SELECTION

W. Demuth and K. Varmuza

Vienna University of Technology, Laboratory for Chemometrics
Institute of Food Chemistry
Getreidemarkt 9/160, A-1060 Vienna, Austria, wilhelm.demuth@chello.at


The usefulness of the selection of appropriate subsets of features is controversially discussed in chemometrics. For some methods like multiple linear regression or linear discriminant analysis a feature selection or a feature reduction by transformation into a set of latent variables is usually necessary. It is also evident that features which introduce only noise may deteriorate the performance of a model. Other reasons for preferring a reduced set of features are economical and technical restrictions. The often large number of features in chemical problems makes it impossible to obtain an exhaustive solution by testing all possible subsets of features.

Basically, there are three types of methods for feature selection, the univariate, the stepwise, and the multivariate approach. (1) Univariate methods select a subset of features which individually have the maximum desired quality - for instance maximum discriminating power or maximum correlation to a given property. This method is simple and fast but suffers from the fact that no interactions between the features are considered. (2) Stepwise approaches typically start with a univariate selection and then test many or all combinations of systematically extended subsets of features. (3) A potentially powerful multivariate method uses genetic algorithms (GA). Contrary to univariate approaches, GA include an optimization process, in which many combinations of features and their interactions are considered. GA have been successfully applied for feature selection especially in multivariate calibration.

In this work univariate approaches (based on the Fisher ratio or the correlation coefficient) and GA were compared for feature selection in multivariate classification and calibration. A considerable reduction of the number of features was often possible without a loss of predictive performance. In some cases subsets of features which were selected by GA gave better results than subsets selected by the univariate approach or the full set of features. A significant improvement in prediction could be often achieved by combining several GA-solutions (chromosomes) instead of using only the best one.

Contributions by R. Leardi, H. Yoshida, and K. Funatsu are gratefully acknowledged. The Austrian Science Fund supported this work (project P14792-CHE).