Thomas Blenkers and Peter Zinn
Lehrstuhl für Analytische Chemie, Ruhr-Universität Bochum
D-44780 Bochum, Germany
An interesting approach of the application of genetic algorithms is the genetic programming developed by Koza. Genetic programming uses a hierarchically coded notation of mathematical equations to find the optimum solutions for given mathematical problems. In analogy to Koza we applied genetic algorithms to a hierarchically coded chemical line notation. As the line notation we developed an extension of our formerly derived list notation that is based on the Wisswesser line notation. Applying genetic algorithms to this hierarchical chemical notation it was the aim to find the corresponding chemical structure for an observed 13C-NMR spectrum.
Therefore it was necessary to develop a spectrum generator for the construction of 13C-NMR spectra of given chemical compounds. In order to make the spectrum generator applicable we chose halogenated acyclic compounds as substance class of our study. The generator was developed using a data sample of 118 compounds including 651 13C-shifts. The resulting prediction error was 1.65 ppm within a range of chemical shifts between 10 and 110 ppm. The prediction model included 29 significant description parameters.
The implementation of the genetic algorithm contains of 5 steps. In the first step a start population of the halogenated compounds is randomly generated. As the second step the 13C-NMR spectra of each compound are generated. The third step is the comparison of each population spectrum with the observed spectrum of the unknown compound resulting in the calculation of a fitness value for each generated compound. Depending on the fitness values the candidates for the next generation are selected by a spinning wheel procedure during the forth step. In the last step these candidates are rearranged by genetic mutation and cross over to form the next generation. The steps 2 to 5 of the described procedure are repeated during the following generations to find the best candidate fitting the spectrum of the unknown compound within acceptable tolerances.
In order to validate the implemented procedure of structure elucidation we used a leave-one-out method. Therefore each of the 118 molecules of the data set for the development of the spectrum generator was tested as the observed spectrum in the elucidation procedure under the same evolution conditions of population size, number of generations, mutation and crossover rate and so on. The result of each test run was assigned with a malus value. The malus value allows to derive three classes of results. The first class included 53 correctly found substances. The second class included 43 substances with higher malus values than the target molecule. In this cases the target molecule was not yet member of the last generation when we stopped the evolution. We repeated some of these evolution runs allowing a higher generation number and found the correct target molecules. So it is to be expected that dependent on the randomly generated start population and the randomly developing of evolution these compounds may also be attached correctly. The third group of 22 molecules were not correctly associated to the expected target molecules. In this cases we observed different situations for the misclassification that we are looking for more intensively.
In summary we could show that the new approach is a promising technique for structure elucidation especially if one keeps in mind that the evolution starts at random without any structural a priori knowledge about the unknown compound.