LINEAR LEAST-SQUARES ESTIMATION

Theory

Modeling studies attempt to quantitatively reconstruct the mathematical or physical behavior of a process (or group of processes) using data generated by that process. Usually, the data will contain noise and be sparce in a statistical sense so that we only have limited information for the reconstruction. The resultant models which one might develop can vary considerably in the accuracy and physical aspect of the process reconstruction. To start with, let us consider simple 2-D data sets where the result of some process can be modeled by a straight line or simple curve.

Suppose that we have a number (N) of experimental results f(x), where x is an independent variable. One analytical task could be to summarize the data in a model that depends on a few adjustable parameters. In the presence of noise in the results f(x), the data will almost certainly not fit the model perfectly, so how do we best estimate the parameters of our model? The best approach is to design a 'figure-of-merit' (or simply merit) function that measures the agreement between the data and the model as a function of the choice of parameters. Then if we can minimize the merit function, we will have found the 'best-fit' parameters of the model. One merit function that is often used is the minimum mean-square error function or 'least-squares' approach. (The mean-square error of a process is the squared deviation of all data from their average. In the limit that the data have a '0' average, the mean-square error is equivalent to the data variance.)

Suppose that we have, for example, a data set whose points ought to lie on a straight line, but don't. Clearly, whatever straight line we draw, there will be points that do not lie on it. The vertical distances separating the measured values of f(x) from the model line are called 'residuals'. It has been found from experience that the merit function that gives the best results is to arrange that the line minimizes the sum of the squares of the residuals. Let the straight line be

g(x) = a + bx

We have two constants to determine, a and b. The residual of the first data point is

The residual to the second point is

The residual to the ith point is

Thus, the sum of the squares of the residuals is

Eq. 1

What we want are the values a and b that minimize S2. We can find this by differentiating Eq. 1

Solution of these two equations provides the parameters a and b. This is illustrated by simplifying the equations as follows:

where

The simplified equations A = aN + bB and C = aB + bD can be solved for a and b as follows:

Thus, we have calculated the two parameters of the straight lines, a and b, in terms of x and f(x). The line g(x) = a + bx is termed the 'least-quares fit' to the data.

A comprehensive least-squares analysis should also have a method for estimating the 'goodness of fit' between the data and the final model. One method that is often used is a correlation coefficient r which is defined as

where xi and f(xi) are the data and x-bar and f(x)-bar are the average values of each variable. This equation tests the covariance between x and f(x) for all x (summation in the numerator) normalized to the individual variances of all x and f(x) (summations in the denominator). This equation tests for the co-linearity of xi and f(xi) without reference to a particular model, g(x) = a + bx. If r=1, the data all fall on a line with positive slope; if r=-1, the data all fall on a line with negative slope; if r=0, the data are uncorrelated (randomly distributed). Correlation coeffiecients need to be carefully evaluated, for in the presence of noise, r will never be 0 and could easily be 0.5 or 0.6 for truly independent data. If the data are close to ±1, then the data do fit a straight line and it may be presumed that our least-squares analysis has found the 'best' straight line fit. If r<<±1, it may be that our assumption of a stright-line model for the data is incorrect!

It is often the case that the data are more complicated and will only fit a curve that includes higher order powers of a specified function (usually termed the basis function). The example above had x as the basis function and only variables to the first power in x. A more general case with x as the basis function would be

which is a polynomial of degree M-1. An even more general form of this type of equation is

where X1(x),...,XM(x) are arbitrary functions of x called the basis functions. Note that the functions Xk(x) can be wildly nonlinear functions of x. These more general equations can be solved just like we did above for a simple straight line. But, we need never do this for there are many 'canned' subroutines already available.

When doing least-squares analysis, it is critical to understand the assumptions that underlie the mathematical manipulation discussed above. The minimum mean-square error method of finding a 'best' fit between a model and the data assumes three things: (1) the noise in the data measurements, f(x), must be zero-mean gaussian, (2) there can be no error in the independent variable measurements, x, (3) the data do, in fact, fit the type of curve estimated by the model. The failure to keep these assumptions in mind has led to many classic errors in scientific model analysis!!!!

 

Algorithm Development

In the process of writing a program to carry out a least-squares linear fit, there are several key features of program development which need to be considered. The first one is data entry. Under most circumstances you will have a data set that already exists and the task is to OPEN the data file and READ in the values. But for the first problem set you will have to read in the data from a terminal. ........more..........

 

Canned Subroutines

NUMERICAL RECIPES has several subroutines for the calculation of least-squares fits to both straight lines and more general smooth curves. These algorithms are more complete than our discussion above for they allow us to include error estimates associated with the dependent variable, f(x). In particular, the subroutine FIT (see next page) will calculate the coefficients A (= a) and B (= b) of a straight line to an input data set of length NDATA. The independent variable (x) is stored in the array X and the dependent variable (f(x)) is stored in Y. The subroutines LFIT and SVDFIT are more general examples suitable for higher order polynomial fits.

 

References

(1) W.H. Press et al., 1986, Numerical Recipes, The Art of Scientific Computing, Cambridge University Press, Chapter 14.

(2) C. J. Mann, 1987, Misuses of linear regression in Earth sciences, in International Association for Mathematical Geology Studies in Mathematical Geology #1 - Use and Abuse of Statistical methods in the Earth Sciences, W. B. Size Ed., Oxford University Press, p. 74-106.

(3) B. M. Troutman and G. P. Williams, Fitting straight lines in the Earth sciences, in International Association for Mathematical Geology Studies in Mathematical Geology #1 - Use and Abuse of Statistical methods in the Earth Sciences, W. B. Size Ed., Oxford University Press, p. 107-128.