MULTIVARIATE STATISTICS
Overview
Multivariate methods of statistics permit the simultaneous analysis of any number of variables. Consider, for example, x-ray fluorescence measurements of a series of samples from a granitic pluton. Each sample will have % estimates of a whole range of major elements including Si, O, Fe, Mg, Ca, etc. We have previously considered methods for comparing individual element percentages for all the samples (univariate statistics). We have also considered methods of testing the covariance of pairs of variables (bivariate statistics). multivariate statistics is the next general step of analysing all variables of all samples at the same time and test for degree of covariance among any two (or more) of the variables.
The central problem with doing multivariate statistics is that of graphing or visualization. In statistical methods that we have already studied we have one dimension for each variable. For our example of elemental composition of granites that means ten or more dimensions. There is no easy way to visualize or plot such data. One important goal of multivariate statistics is to find a more limited number of key variables that can represent all of the remaining variables and be visualized in our normal three dimensions.
There are several standard methods of multivariate statistical analyis: (1) multiple (linear) regression, (2) discriminant analysis, (3) principle components analysis, (4) factor analysis, and (5) cluster analysis. The mathematics associated with all of these techniques is beyond what we want to consider. The important point to note is that there are standard computer programs to do all of these analyses. At USC programs that will handle multivariate statistics are BMDP (biomedical data processing program), SPSS (statistical package for the social sciences), and SAS (statistical analysis program). SAS is available on EARTH.
Multiple (Linear) Regression
Multiple regression attempts to define one variable (Y) in terms of other variables (Xi) for each of n data samples. This would lead to a formula of the type Y=a+bX1+cX2+dX3+... The test of each regression is its ability to fit all of the n data samples. One problem with tyis type of analysis is that one can determine the bet fitting formula (a, b, c, ...), but it is less easy to define the relative importance of each variable (Xi) in the equation. One can do this by doing a sequential regression. Another problem is that this only deals with linear regression. One can imagine conditions where the relationship between two variables is non-linear. This, however, goes beyond the scope of what wou will normally deal with when analyzing geological data.
Discriminant Analysis
Discriminant analysis is used to assign data to one or more established groups. For example, is a particular igneous rock a granite, a granodiorite, or a diorite? Or, is a single brachiopod shell associated with species A, B, or C? In all cases, the overall goal is to discriminate among several known data groups and determine which one is most closely related to some new object.
The key input data are the parameters (variables) defining the members of several distinct groups. These should be chosen such that the individual variables show significant difference among the different data groups. These established groups then form the norm against which all other objects and their variables are compared.
It is easy to see how this type of analysis would work if there were only two groups and one variable with each group variable having a significantly different mean value. In that case, any new object could be assigned to each group on the basis of its relative closeness to one or the other group in the only defined variable. The problem becomes more complicated when there are several variables and perhaps only a combination of variables best defines the differences between two established groups.
Eigenvector Methods - Principle Components Analysis
Suppose we have a scatter plot of two variables with fairly strong linear correlation and similar degrees of variance as shown in figure 8.5 (from Swan and Sandilands, 1995). We might plot a line through the 'long axis' of the data and another perpendicular to it. We have thus identified an alternative set of coordinates (axes) in which to define the data scatter. The new axes are different in that one new axis maximizes the variance along it while it minimizes the varaince around the other axis. The purpose of the change in axes might be to isolate and better identify the primary source of variance in the data. In our example, the first axis might represent the relative overall size of measured organisms, while the second variable may reflect variations in organism shape. This example may seem mundane, but it illustrates a methodology that can be generalized to any number of axes (dimensions) and associated variables. This method is termed principle components analysis (PCA).
PCA is in general a method for finding linear combinations of variables that can be grouped together and specified by a single variable or component. Ideally, we might hope to take a data set with n variables and find some number of principle components p that define most of the data variability and have the quality that n>>p. One way to see how to go about doing this is to consider a set of 5 variables. For each pair of variables do a scatter plot and calculate a correlation coefficient (-1 to1). If one tabulates all of the coefficients, one can discern when positive and negavtibve correlations occur (see attached table or correlations). On the basis of such calculations, one can define principle components by equaitons of the form:
PC1 = a1V1 + a2V2 + a3V3 + ....
where the a's are constants proportional to the correlation coefficients and V1-Vn are the original variables. Each of the constants, a, are commonly referred to as weights or loadings. In principle, there are the same number of principle components are there are variables, but usually only a few of the principle components need to be identified that are the cause of most of the data variance.
Factor Analysis
Factor analysis is very similar to PCA except that the data variables may be rotated to find and maximize the total data variance in only a few factors (~ principle components). In some forms of factor analysis the various factors are orthogonal to each other and define variability of orthogonal or independent variables. In some methods, however, the issue of orthogonality is not critical and some factors may be oblique to one another. This may be due to one or more variables having common dependencies such as sediment grain size, water depth of a sediment deposit, amount of quartz in a sediment sample, current velocity, etc.
Cluster Analysis
Cluster analysis is similar to the above tow methods of analysis in that it tries to find elements of similarity among several independent variables. Cluster analysis, however, does not use dimensions or orthogonal axes to calculate degrees of similarity. Rather, it uses distance or correlaiton coefficients to find those variables that best co-vary. It then groups the variables on the basis of their similarity or covariation into clusters. This type of analysis can be done at different levels of simialrity and dendrograms or tree diagrams may result that measures the degree of similarity of any variable with respect to other variables.