CHEMICAL STRUCTURE SIMILARITY IN MASS SPECTRAL SIMILARITY SEARCHES

K. Varmuza and W. Demuth

Vienna University of Technology, Laboratory for Chemometrics
Institute of Food Chemistry
Getreidemarkt 9/160, A-1060 Vienna, Austria, kvarmuza@email.tuwien.ac.at


Spectral library search methods are routinely applied in mass spectrometry laboratories but the benefit of spectral similarity searches is questionable when used for spectra from compounds which are not present in the library. A "structural interpretative power" is claimed by a few mass spectrometric database systems; however, systematic investigations are rare which quantify the similarity between the chemical structures of hitlist compounds and the chemical structure of the tested unknown.

The similarity between two mass spectra has been defined by the correlation coefficient as used in many library search systems. It is known that the intensities of peaks at certain mass numbers are often not characteristic for substructures or functional groups; therefore a good structural similarity of the hits cannot be expected by using simple peak intensities for the similarity measure. In this work three types of variables have been compared for the calculation of mass spectral similarities: (a) peak intensities, (b) spectral features which are mostly nonlinear transformations of the peak intensities, (c) responses from a set of linear multivariate classification or calibration models.

Chemical structures have been characterized by a set of 135 binary molecular descriptors generated by our software SubMat. The similarity between two structures has been calculated by the Tanimoto index. The similarity between the structure of a compound used as unknown and the structures in a hitlist was characterized by the averaged Tanimoto index calculated from all hitlist structures.

The database used contained 61000 mass spectra and structures. Random samples with 200 spectra were selected to compare the different approaches. In general the similarities between the structure of the compound considered as unknown and the structures of the hitlist compounds were found to be higher if spectral features were used to calculate the spectral similarity instead of peak intensities. In some cases a further improvement was obtained by using responses of multivariate models as variables for the calculation of spectral similarities.

Development of software SubMat (http://www.lcm.tuwien.ac.at) by H. Scsibrany is gratefully acknowledged. The Austrian Science Fund supported this work (project P14792-CHE).