GENERALIZATION OF PAIR-CORRELATION METHOD (PCM)
FOR NONPARAMETRIC VARIABLE SELECTION

Róbert Rajkó1 and Károly Héberger2

1Department of Unit Operations and Environmental Engineering,
College Faculty of Food Engineering, University of Szeged,
P. O. Box 433, H-6701 Szeged (Hungary)

2Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences
P. O. Box. 17, H-1525 Budapest (Hungary)

Pair-Correlation Method (PCM) has been developed for choosing between two, correlated predictor variables (factors) [1] provided that the scatter is caused not only by random effects. The distinction of two variables can be made using an arrangement into a 2x2 contingency table. Further on suitable test statistics [2] can be used to decide the significance of differences between factors. A macro based on the MS Excel 8.0 Visual Basic for Application (VBA) was also constructed which yielded a user-friendly and easy-to-use application because of the spreadsheet properties.

PCM can easily be generalized for variable selection purposes using more than two variables (GPCM is used for the abbreviation of Generalized Pair-Correlation Method). The comparison of factors can be made pair-wise in all possible combinations. If a given statistical test indicates a significant difference between the factors, the following terms are used for the overwhelming and subordinate factors, respectively: superior - inferior or winner - loser. Every comparison can mark a factor as superior, inferior or no decision can be occurred.

The following step is the ranking of predictor variables. Three ways of ranking have been elaborated: (i) simple ranking, (ii) ranking based on differences and (iii) ranking according to the probability weighted differences. (Difference here means number of wins minus number of losses).

From among the three ranking methods the simple ranking is the least conservative (it selects less descriptors). The best method, which can utilize even small differences in statistical tests, is the ranking according to the probability weighted differences. However, the practical task determines which ranking procedures should be chosen. A real data set illustrates the efficiency and function of the method.

References

[1] Héberger K, Rajkó R. Discrimination of statistically equivalent variables in quantitative structure-activity relationships. In Quantitative Structure-Activity Relationships (QSAR) in Environmental Sciences-VII, Ed. Fei Chen & Gerrit Schüürmann, SETAC Press, Pensacola, USA, 1997, Chapter 29, 423-431.

[2] Rajkó R, Héberger K. Conditional Fisher's exact test as a selection criterion for pair-correlation method. Type I and Type II errors. Chemometrics. Intell. Lab. Syst. 2001; 57(1):1-14.

This scientific research was supported by the Scientific Foundation of the Academy (KH, No. AKP 98-51 2,4) and by the Hungarian Science Foundation (RR, No. OTKA T035125).