Cserháti, Forgács: PRINCIPAL COMPONENT ANALYSIS. PITFALLS AND POSSIBILITIES

PRINCIPAL COMPONENT ANALYSIS.
PITFALLS AND POSSIBILITIES

Tibor Cserháti and Esther Forgács

Institute of Chemistry, Chemical Research Center, Hungarian Academy of Science,
POB 17, 1525 Budapest, Hungary

Traditional two-ways principal component analysis (PCA), a versatile and easy-to-use multivariate mathematical-statistical method has been developed to contribute to the extraction of maximal information from large data matrices containing numerous columns and rows. PCA makes possible the elucidation of the relationship between the columns and rows of any data matrix without being one the dependent variable. PCA is a so-called projection method representing the original data in smaller dimensions. It calculates the correlations between the columns of the data matrix and classifies the variables according to the coefficients of correlations. As the resulting matrices of PC loadings and variables are also multidimensional the dimensionality of these matrices can be reduced by cluster analysis (CA), nonlinear mapping (NLM) and/or varimax rotation (VR). Although PCA can be used for the analysis of any data matrices its inadequate application may lead to serious misinterpretation of the results. As it was previously mentioned PCA calculates the similarities and dissimilarities among the variables and observations according to the differences among the coefficients of correlation and distributes the data point on the two-dimensional plane determined by NLM or VR. The method distributes the point in such a manner that they always cover the whole surface of the plane. It means that their distribution will be similar in the theoretical cases when each coefficient of correlation are in the range of 0.1, 0.5 or 0.9. While the scattering of points on the plane calculated from the coefficients of correlation 0.1 does not contain any useful (significant) information the similar distribution of points marks significant relationships when it is calculated from the correlation matrices of 0.9. The publication of a table containing each coefficient of correlation overcomes the difficulty emerging from the similar scattering of points. Both NLM and CA take into consideration the positive and negative sign of the coefficient of correlation and carry out the calculation accordingly. Therefore, the points highly but negatively correlated are far away on the maps as the same manner as the points not correlated. This procedure is correct in the case when the scientist is interested only in the positive correlations among variables and observations. To evaluate precisely the relationships between the points without taking into consideration of the positive or negative character of the correlation it is advisable to carry out the calculations with the absolute values of PC loadings and variables. The traditional PCA is a typical multivariate two-ways statistical method unsuitable for the evaluation of three or more dimensional data matrices. A three way PCA model (3D-PCA) has been developed (Tucker model) to overcome this difficulty. Oppositely to 2D-PCA the ratio of variance explained cannot be previously determined, therefore the component matrices have to be determined experimentally. A new method has been developed for the-assessment of the component matrix containing minimal number of elements enough for the explanation of a predetermined ratio of variance. Firstly 3D-PCA has been carried out on the component matrix of maximal dimensions (number of element - 1) and the two-dimensional NLM of the matrices have been calculated. The elements present in component matrices explaining more then 0.5 % of the total variance can been selected as members of the next component matrix and the two-dimensional maps have been calculated again. It has been found that the reduction of the component matrix results in a negligible loss of variance explained indicating the suitability of the method. However, the distribution of the points on the maps depended considerably on the number of elements included in the component matrices. This finding indicates that the reduction of the number of elements in the component matrices has a negligible effect on the total variance explained but exerts a marked impact on the similarities and dissimilarities among the variables and observations of the matrix.