ELEMENTARY STATISTICS AND TESTS

 

Theory

Let us assume that we have a set of N related data values (x1, x2, ...xN) which have some unknown distribution in their values. These data might be a time series of output from a time varying process or different runs of a particular experiment. Parametric statistics can be used to characterize the data distribution using a small number of parameters that are termed 'moments' of the data set.

The best known of these moments is the mean value, estimated by the sample mean. The mean estimates the value around which the data cluster, the central tendency. (Other estimators of the central tendency are the median or middle value and the mode or most common value. With certain types of data distributions, the median or the mode may be better estimators of the central tendency than the sample mean.) The sample mean is estimated from

<x> represents the theoretical expectation (average value) for a data set of infinite length. In an much as you will never have an infinite number of data points, you can only estimate <x> from the summation at right. The same can be said for all moments estimated from a finite number of data points.

Another parameter, variance, estimates the dispersion or variability of the data distribution. Variance is estimated from

The square-root of the variance, s2, is termed the standard deviation, s. Another estimate of variability is the mean-square error (mse) statistic, Y, estimated from

The mse statistic will be the same as the variance when the data have a zero average. In general these three statistics are related by

Another often mentioned statistic is the standard error associated with the sample mean. The standard error of the mean is equal to . This number estimates how much variability you might expect for different sample mean estimates of the same random process or experiment if you had several different independent data sets.

The reason that the mean and variance are termed moments is because they are calculated by exponentiating the data to different levels (or moments). The mean is the first moment, the variance is the second moment, and by the same logic, there are an infinite number of moments of any data set. In practice the first two are the most important, with the third and fourth moments being useful on occasion as well. The third moment is the skewness; the fourth moment is the kurtosis. They are estimated from

These moments further describe the shape of the data distribution. The skewness defines whether there are more data on one side of the mean compared to the other; the kurtosis describes the peakiness of the distribution. (A gaussian or normal distribution has a KU of three. So values of KU smaller than three indicate a distribution peakier than 'normal', and a value larger than three indicates a broader distribution than 'normal'. Often times, statistical packages will define KU as the above summation minus three, the expected value for a normal distribution.)

The moments defined above are one way to characterize data distributions. There are many actual data distributions that one can encounter in the real world of geology. Two of the most common, however, are the uniform distribution and the gaussian or normal distribution. The uniform distribution is one where any value has an equal chance of occurring if it is within some range, say -1 to 1. Values less than -1 or greater than 1 are not permissable. The gaussian distribution has the well-known bell-shaped curve. The most likely value is the mean, and 95% of the data values will fall with -2s to 2s, where s is the standard deviation. The gaussian distribution is especially important because the Central-Limit theorum states that if one samples a process that has errors associated with a number of independent variables, the errors will add up such that they trend toward a gaussian distribution no matter what their individual distributions. This means that if you have errors in your data they will almost certainly have a gaussian distribution.

Other data distributions worth noting are the (1) log-normal distribution (data are 'normally' distributed after taking logrithms of the original data values), (2) power-law distribution (data follow a power law), (3) fractal distributions (a variety of special data distributions found in fractal and non-linear, chaotic systems and characterized by regions of the data space with zero probability of occurrance).

The first four statistical moments go a long way toward character-izing the distribution of a single data set. But what do you do if you have two data sets and you want to compare them? Do they have the same mean, variance, or distribution? There are several statistical tests to answer such questions.

The first test we will briefly discuss is the student's t-test to determine if the sample means of two different data sets are significantly different. Numerical Recipes uses the subroutines TTEST (data sets have same variance) or TUTEST (data sets have different variances) to determine this. The output variable T is the difference between the sample means of the two data sets normalized to an estimate of the standard error in the two data sets. As T approaches zero the means are not significantly different. The subroutines also return the probability, PROB, that the means are the same; larger values of PROB suggest equality of the means.

A similar test, the F-test, compares the variances of two different data sets and asks if they are significantly different. The Numerical Recipes subroutine FTEST carries out this test and returns the values F and PROB. F is the ratio of the two data set variances and should ideally be 1. Large values of PROB suggest that the variances are not significantly different.

More complicated, but often used tests, are the Chi-square (c2) test and Kolmogorov-Smirnov test. They both determine whether a data set is significantly different in distribution from some known data distribution (usually the gaussian distribution). These tests require more background in statistical evaluation than we are willing to go into here, but they are adequately treated in the references.

 

Algorithm Development

The primary task in writing a computer program to calculate statistical moments is to carry out the data summation. As we noted in Topic #1, the do-loop is ideally suited for such calculations. For example

 

dimension a(20)

C

average=0.

noise=0.

Do 10 i=1,20

average = average + a(i)

noise = noise + a(i)**2.

10 continue

average = average/20.

noise = noise/19.

variance = noise + average*average

.

.

will calculate the average value and mean-square error of the array a. The variance could then be calculated from those two values. Or two Do-loops in succession could calculate first the mean and then the variance directly.

 

Canned Programs

Numerical Recipes uses the subroutine MOMENT, listed at the end of this topic, to generate these statistics. The input is a one dimensional array DATA and the number of data points N. The subroutine then returns AVE, SDEV, SKEW, and CURT, the first four moments of the DATA distribution. The subroutine also returns ADEV which is another estimate of the variance.

There are also three statistical packages on the Macintosh computer file server that can carry out these statistical calculations (plus a lot more). They are STAVIEW, STATWORKS, and SYSTAT.

 

References

(1) W.H. Press et al., 1986, Numerical Recipes, The Art of Scientific Computing, Cambridge University Press, Chapter 13.

(2) Bendat and Piersol, 1986, Random Data, Wiley Interscience.

(3) Otnes, R. K., and L. Enochson, 1978, Applied Time Series Analysis, Volume 1: Basic Techniques, Wiley Interscience.