Least Squares Fitting of Data

Often one has some experimental data set {(X_i, y_i)} and some mathematical function ‘f’ that we expect gives the general relationship between Y and X:

Y_theoretical = f(X,C)

so we expect each experimental data point to approximately satisfy

X_i is a vector that contains all the input variables (usually they correspond to experimental variables that the experimentalist controls) that correspond to the measurement y_i. For example, in a chemical experiment X_iwould typically include the temperature and the initial concentrations of the reagents.

The vector C contains all the parameters (i.e. numbers) in the theoretical model which are known imprecisely, i.e. the numbers in the model that we feel comfortable varying to try to improve the agreement between the data and the model. For example, a theoretical model for a rocket’s trajectory would include a lot of numbers: the gravitational constant, the mass of the rocket, the density and viscosity of air, the wind speed, etc. We probably do not want to vary the gravitational constant – that number is known very precisely from other experiments – but we might feel free to adjust the number we use for the wind speed a little in order to improve the fit between data on the observed trajectory and our computer model. So the wind speed would be one of the adjustable numbers in C, but the gravitational constant would not be.

Most of Chapter 9 is focused on algorithms to compute values for C to get the best possible fit. There are several conceptual issues which are much more important than the algorithms, we will discuss these first.

Models vs. Data: Fundamental Concepts

The key thing to remember is that all of the numbers have uncertainties. In addition, the model is usually only an approximation to the real physical system, and also there are usually systematic errors associated with the experimental measurements (e.g. the calibration of the experimental apparatus is not perfect.) So one never expects to find any

exactly, except perhaps by accident. (If the number of adjustable parameters C is very large, it might be possible to find an exact fit that forces the equality for all of the yi. Unfortunately, exact fits found in this way are often physically incorrect; this is called “overfitting”. These sorts of overfits often mislead people into thinking a model is better than it really is.)

Because all of the numbers have uncertainties, we don’t expect an exact fit. First we need to develop an estimate of what sorts of uncertainties (also called “error bars”) we have in our data (and in our model). Then we want to compare the discrepancies between the data and the model. If the discrepancies are about the size we expected from the error bars, we say “the model is consistent with the data”. Remember that just because some data are consistent with a model, it doesn’t mean that the model is correct! For example, there are a lot of data consistent with the model that the earth is flat.

On the other hand, if the discrepancies are much larger than the error bars, we may be able to say “the model is inconsistent with the data”. In this case, after some careful consideration and double checks to make sure the data is right, we may be able to conclude “the model is wrong”.

For a practical example, see NMM Figure 9.4. The assumption is that the model ‘f’ for this case is a straight line. There is one point (near x=4.5) that lies quite far from any reasonable straight line through the rest of the data. The crucial question is: how large is the uncertainty in that data point? If the uncertainty could be as large as 0.5 unit in y, we should be happy with the fit, and proclaim “the data is consistent with the model”. However, if all the data points are known to six significant figures we have a serious problem. After a few double checks, we would be justified in saying “the data are inconsistent with a straight line model”. (This may seem crazy, but there are a lot of physical phenomena which lead to wiggles and sharp peaks and dips in data traces, so it is definitely possible that all the data in Fig. 9.4 are correct.) We cannot determine whether or not data are consistent with a model unless we have an estimate of the uncertainties.

The last three paragraphs are critically important for your career as a scientist/engineer, and as an informed citizen. If you don’t understand them, you MUST stop and go back, or talk with one of the course instructors.

Estimating the uncertainties

It is unfortunately quite difficult to determine the systematic uncertainties associated with an experiment, or with approximations made in a computer model, so we will not attempt it here. (Just because it is difficult does not mean it is not important; many of the discussions in science and engineering journals are focused on exactly these issues.)

On the other hand, it is quite simple to estimate the uncertainties associated with randomness in experimental measurements, and often these random errors dominate the error bars. All one has to do is to repeat the experiment several times, and see how much the measurements vary (i.e. how repeatable they are). This variance is usually expressed as the standard deviation

s_i = sqrt(S (y_ik- <y_i>)² / (N - 1))

where y_ik is the value of y_i measured the kth time we did the experiment, the sum is from k=1:N, and <y_i> is the average of all the N repeated measurements of y_i(all N measurements made with the same fixed X_i).

Many sources of randomness lead to the familiar “bell curve” Gaussian distribution (yes, same Karl F. Gauss) in the measured values. The amazing “central-limit theorem” shows that even when the underlying source of the randomness is non-Gaussian, the data can be usefully expressed in a way which leads to a Gaussian distribution of the mean. (For details, see any statistics textbook.). As a consequence, we will only concern ourselves with data sets which have a Gaussian (also called “normal”) distribution, where each measurement has about a 2/3 chance of lying within one standard deviation of the mean value, and is unlikely to lie more than a few standard deviations away from the mean.

The more repeat measurements we take, the more confident we are that the average value of these measurements is close to the true mean value of the complete distribution (the average we would obtain if we repeated the measurement a very large number of times). The uncertainty in the mean value <y_i> after N repeats is

This uncertainty in the mean s_mean,i is the number that we will want to compare with the deviation between the model and the average of the repeated meausrements:

If the random error is the main reason why the model and the data have discrepancies, then we expect that about 2/3 of the time the deviation will be less than s_mean,i, and 99% of the time the deviation will be less than 3 s_mean,i. (Note however that one sometimes finds “outliers” in data sets, irreproducible data values which are completely weird. Normally we discard these.)

If the deviations are larger than expected, it could be because of systematic errors in the experiment or approximations made in the model – or the model could just be wrong. If the deviations are much smaller than expected from the data reproducibility, we will be surprised; overfitting is a likely explanation.

N.B. Each time we make a repeat measurement y_ik, we expect it will differ by roughly s from <y_i> (it could be more or less, but something in this ballpark) – this fluctuation comes from a real source of randomness in our experiment, and doesn’t reduce with the number of repeats. But after a few measurements, our average value <y_i> will start to ‘settle down’ and stop fluctuating so much. As the number of repeats gets very large we expect that <y_i> will become nearly constant near the true mean of the distribution (and s_mean will become very small).

Least-squares fitting

Suppose you have done a lot of repeat experiments for several different values of the inputs X_i, so you have a set of data {(X_i , <y_i> , s_mean,i)}, i=1:Ndata. You also have a theoretical function f(X_i,C) which you think should match this data pretty well, if only you knew the correct values of the adjustable parameters C. To determine good values for the vector C, you want to minimize the sum of the squares of the correctly weighted deviations, i.e. to minimize

r(C)= S ( <y_i> - f(X_i,C) ) / s_mean,i )²

This sum over i=1:Ndata is called r in the NMM text; in other books it is often called c². It is a measure of the “badness” of the fit. If the model is consistent with the data, and the discrepancies are mostly due to the random errors in the data, we expect the sum to have a value O(N_data) for any good choice of C.

To minimize a function, you look for places where the derivatives are equal to zero. In our case we want

0 = ¶r/¶c_n= -2 S ( <y_i> - f(X_i,C) ) / s_mean,i²) (¶f/¶c_n) for all n=1:Nparams

For the specific case where our fitting function is a straight line, both the slope and intercept are adjustable, and all the data points have the same uncertainty, this system of equations is the same as Eq. 9.6 in the NMM text. Equations 9.10 through 9.15 show how to write these equations in a matrix form which can be easily solved using backslash or the Matsolve program you wrote.

Least-squares fits to complicated functions

The formulas for straight-line fitting are available in many graphing programs and in many calculators. However, there is no reason to confine yourself to linear functions: one can easily use the same math to fit data to much more complicated functions. This is described in Section 9.2.1 and 9.2.2

The key idea is that as long as f is linear in C (i.e. d²f/dc_jdc_n = 0 for all j,n = 1:Nparams) then the system of equations will also be linear in C, and so it will be easy to solve for C using Matsolve or backslash. The function f DOES NOT have to be linear in the inputs X. In the fitting procedure we are not solving for X or Y, they are given data, so the function can depend on those in any complex way without making the solution for the best-fit C more complicated. Also, it makes no difference whether X_iis a single variable or a vector. (If you are uncomfortable with the vector equations, see Section 9.3 where all the components are written out.)

A convenient way to write all functions f which depend linearly on C is as a sum of other functions f_n:

where B_mn =A_mn/s_mean,m and z_m = <y_m>/s_mean,m for any function F of this form. Again, if you assume that all of the data points have equal uncertainties, you get Eq. 9.26 as in the text.

Using QR instead of the Normal Equations to solve for C

You see that in the final equation we have to solve, both sides are multiplied by B’ (A’ in the NMM text). This makes the final matrix B’*B square. Sometimes this matrix multiplication makes the equations ill-conditioned, and one would do better to work with the original rectangular matrices. This is done using the QR method, described in Section 9.2.3. The QR method is used by the Matlab backslash command when you feed it a rectangular matrix.

Sometimes the rectangular matrix is rank-deficient (rank(B) < Ncolumns), which is a disaster for most methods. However, this can be dealt with using the SVD method.

Both the QR method and the SVD method have multiple applications in chemical engineering (and in other fields), just like Gaussian elimination. All three methods normally give you the same values for C, they differ only when the problem is ill-conditioned. For now you don’t need to be too concerned about this. In later courses or UROPs where you are actually fitting complicated data, you will want to understand these methods in some detail.

10.10 Introduction to Chemical Engineering

NMM Chapter 9: Least-Squares Fitting a Model to Data

Overview

Models vs. Data: Fundamental Concepts

Estimating the uncertainties

Least-squares fitting

Least-squares fits to complicated functions

Using QR instead of the Normal Equations to solve for C