## Multivariate Analysis

IDL provides a number of tools for analyzing multivariate data. These tools are broadly grouped into two categories: Cluster Analysis and Principal Components Analysis.

### Cluster Analysis

Cluster Analysis attempts to construct a sensible and informative classification of an initially unclassified sample population using a set of common variables for each individual. The clusters are constructed so as to group samples with the similar features, based upon a set of variables. The samples (contained in the rows of an input array) are each assigned a cluster number based upon the values of their corresponding variables (contained in the columns of an input array).

In computing a cluster analysis, a predetermined number of cluster centers are formed and then each sample is assigned to the unique cluster which minimizes a distance criterion based upon the variables of the data. Given an m-column, n-row array, IDL's CLUST_WTS and CLUSTER functions compute n cluster centers and n clusters, respectively. Conceivably, some clusters will contain multiple samples while other clusters will contain none. The choice of clusters is arbitrary; in general, however, the user will want to specify a number less than the default (the number of rows in the input array). The cluster number (the number that identifies the cluster group) assigned to a particular sample or group of samples is not necessarily unique.

It is possible that not all variables play an equal role in the classification process. In this situation, greater or lesser importance may be given to each variable using the VARIABLE_WTS keyword to the CLUST_WTS function. The default behavior is to assume all variables contained in the data array are of equal importance.

Under certain circumstances, a classification of variables may be desired. The CLUST_WTS and CLUSTER functions provide this functionality by first transposing the m-column, n-row input array using the TRANSPOSE function and then interchanging the roles of variables and samples.

#### Example of Cluster Analysis

Define an array with 5 variables (columns) and 9 samples (rows):

```array = [[ 99,  79,  63,  87, 249 ], \$
[ 67,  41,  36,  51, 114 ], \$
[ 67,  41,  36,  51, 114 ], \$
[ 94, 191, 160, 173, 124 ], \$
[ 42, 108,  37,  51,  41 ], \$
[ 67,  41,  36,  51, 114 ], \$
[ 94, 191, 160, 173, 124 ], \$
[ 99,  79,  63,  87, 249 ], \$
[ 67,  41,  36,  51, 114 ]]
; Compute the cluster weights with four cluster centers:
weights = CLUST_WTS(array, N_CLUSTERS = 4)
; Compute the cluster assignments, for each sample,
; into one of four clusters:
result  = CLUSTER(array, weights, N_CLUSTERS = 4)
; Display the cluster assignment and corresponding sample (row):
FOR k = 0, 8 DO \$
PRINT, result[k], array[*, k]
```

IDL prints:

```1      99      79      63      87     249
3      67      41      36      51     114
3      67      41      36      51     114
0      94     191     160     173     124
2      42     108      37      51      41
3      67      41      36      51     114
0      94     191     160     173     124
1      99      79      63      87     249
3      67      41      36      51     114
```

Samples 0 and 7 contain identical data and are assigned to cluster #1. Samples 1, 2, 5, and 8 contain identical data and are assigned to cluster #3. Samples 3 and 6 contain identical data and are assigned to cluster #0. Sample 4 is unique and is assigned to cluster #2.

If this example is run several times, each time computing new cluster weights, it is possible that the cluster number assigned to each grouping of samples may change.

### Principal Components Analysis

Principal components analysis is a mathematical technique which describes a multivariate set of data using derived variables. The derived variables are formulated using specific linear combinations of the original variables. The derived variables are uncorrelated and are computed in decreasing order of importance; the first variable accounts for as much as possible of the variation in the original data, the second variable accounts for the second largest portion of the variation in the original data, and so on. Principal components analysis attempts to construct a small set of derived variables which summarize the original data, thereby reducing the dimensionality of the original data.

The principal components of a multivariate set of data are computed from the eigenvalues and eigenvectors of either the sample correlation or sample covariance matrix. If the variables of the multivariate data are measured in widely differing units (large variations in magnitude), it is usually best to use the sample correlation matrix in computing the principal components; this is the default method used in IDL's PCOMP function.

Another alternative is to standardize the variables of the multivariate data prior to computing principal components. Standardizing the variables essentially makes them all equally important by creating new variables that each have a mean of zero and a variance of one. Proceeding in this way allows the principal components to be computed from the sample covariance matrix. IDL's PCOMP function includes COVARIANCE and STANDARDIZE keywords to provide this functionality.

For example, suppose that we wish to restate the following data using its principal components. There are three variables, each consisting of five samples.

Var 1
Var 2
Var 3
Sample 1
2.0
1.0
3.0
Sample 2
4.0
2.0
3.0
Sample 3
4.0
1.0
0.0
Sample 4
2.0
3.0
3.0
Sample 5
5.0
1.0
9.0

We compute the principal components (the coefficients of the derived variables) to 2 decimal accuracy and store them by row in the following array.

The derived variables {z1, z2, z3} are then computed as follows:

In this example, analysis shows that the derived variable z1 accounts for 57.3% of the total variance of the original data, the derived variable z2 accounts for 28.2% of the total variance of the original data, and the derived variable z3 accounts for 14.5% of the total variance of the original data.

#### Example of Derived Variables from Principal Components

The following example constructs an appropriate set of derived variables, based upon the principal components of the original data, which may be used to reduce the dimensionality of the data. The data consist of four variables, each containing of twenty samples.

```; Define an array with 4 variables and 20 samples:
data = [[19.5, 43.1, 29.1, 11.9], \$
[24.7, 49.8, 28.2, 22.8], \$
[30.7, 51.9, 37.0, 18.7], \$
[29.8, 54.3, 31.1, 20.1], \$
[19.1, 42.2, 30.9, 12.9], \$
[25.6, 53.9, 23.7, 21.7], \$
[31.4, 58.5, 27.6, 27.1], \$
[27.9, 52.1, 30.6, 25.4], \$
[22.1, 49.9, 23.2, 21.3], \$
[25.5, 53.5, 24.8, 19.3], \$
[31.1, 56.6, 30.0, 25.4], \$
[30.4, 56.7, 28.3, 27.2], \$
[18.7, 46.5, 23.0, 11.7], \$
[19.7, 44.2, 28.6, 17.8], \$
[14.6, 42.7, 21.3, 12.8], \$
[29.5, 54.4, 30.1, 23.9], \$
[27.7, 55.3, 25.7, 22.6], \$
[30.2, 58.6, 24.6, 25.4], \$
[22.7, 48.2, 27.1, 14.8], \$
[25.2, 51.0, 27.5, 21.1]]
```

The variables that will contain the values returned by the COEFFICIENTS, EIGENVALUES, and VARIANCES keywords to the PCOMP routine must be initialized as nonzero values prior to calling PCOMP.

```coef = 1 & eval = 1 & var = 1
; Compute the derived variables based upon
; the principal components.
result = PCOMP(data, COEFFICIENTS = coef, \$
EIGENVALUES = eval, VARIANCES = var)
; Display the array of derived variables:
PRINT, result, FORMAT = '(4(f5.1, 2x))'
```

IDL prints:

``` 81.4   15.5   -5.5    0.5
102.7   11.1   -4.1    0.6
109.9   20.3   -6.2    0.5
110.5   13.8   -6.3    0.6
81.8   17.1   -4.9    0.6
104.8    6.2   -5.4    0.6
121.3    8.1   -5.2    0.6
111.3   12.6   -4.0    0.6
97.0    6.4   -4.4    0.6
102.5    7.8   -6.1    0.6
118.5   11.2   -5.3    0.6
118.9    9.1   -4.7    0.6
81.5    8.8   -6.3    0.6
88.0   13.4   -3.9    0.6
74.3    7.5   -4.8    0.6
113.4   12.0   -5.1    0.6
109.7    7.7   -5.6    0.6
117.5    5.5   -5.7    0.6
91.4   12.0   -6.1    0.6
102.5   10.6   -4.9    0.6
```

Display the percentage of total variance for each derived variable:

```PRINT, var
```

IDL prints:

```0.712422
0.250319
0.0370950
0.000164269
```

Display the percentage of variance for the first two derived variables; the first two columns of the resulting array above.

```PRINT, TOTAL(var[0:1])
```

IDL prints:

```0.962741
```

This indicates that the first two derived variables (the first two columns of the resulting array) account for 96.3% of the total variance of the original data, and thus could be used to summarize the original data.

### Routines for Multivariate Analysis

Below is a brief description of IDL routines for multivariate analysis.

 Computes the cluster weights of an array for cluster analysis. Performs cluster analysis. Performs chi-square goodness-of-fit test. Performs Kruskal-Wallis H-test. Computes multiple correlation coefficient. Computes partial correlation coefficient. Computes principal components/derived variables. Computes standardized variables.