István Kövesdi1, László Örfi2, György Kéri2,
1 EGIS Gyógyszergyár Rt. H-1106 Budapest X. Keresztúri út 30-38.
2 SOT E, Budapest VIII. POB 260, 1444
We bland many established statistical and QSAR techniques with novel predictivity scoring functions of the descriptors along with database storage and retrieval operations to achieve automatic generation of QSAR models. The method has been developed to predict quantitatively the biological activity of novel or untested compounds and for the systematic QSAR scout scanning of database stored High Through output Screening results. It can detect low quality test figures, incorrect structures and spoiled or mixed samples via the drop of the prediction quality of optimal QSAR models when compared with models obtained from data when everything were done correctly. In the preferred case the prediction quality of the automatically computed QSAR models can reach a certain threshold, e.g. when the true external validation predictive Q2 is greater than 0.4, to indicate a good enough QSAR model for in-silico screening.
The applied method is based on the recognition that measured or automatically calculated biological, physical-chemical and structural data linked to the corresponding molecular structures and collected in a standardized database format, permit the automatic and fast computerized development of optimal quantitative structure activity relationships. We use a large number of calculated descriptors and principal component analysis in each presented case. We couple the intensive repeated evaluation set validation of the different QSAR models with effective sequential or genetic algorithm for the variable subset selection. In the genetic algorithm we developed a novel roulett-wheel bit mutation operator especially tailored for QSAR variable subset selection. The automatic analysis is performed by the simultaneous application of MLR, PLS and ANN (LLM coming soon) methods to achieve an optimal validated correlation coefficient Q2 for the given models. The application of the scoring function for the predictive ability of the descriptors proved to be economic in terms of computer time and of the quality of the final optimized models. We use two levels of external validation to draw the final conclusion about the applicability of an optimized model.