Linear and Nonlinear Least Squares

The object is to minimize the weighted sum-of-squares-of-the-errors (SSE) between the model f(X) and the data Y, by varying the fitting parameters P:

min SSE(P) = Sum { (W_i*(Y_i-f(X_i; P))² } sum is over all data points (X_i,Y_i) Eq.(1)

The weights W_i might be 1/standard error in Y_i. If you don't know the weights, make them all =1.

to do minimization with respect to P_k, we look for places where the derivative is zero:

dSSE / dP_k = 0 = Sum { 2(W_i*(Y_i-f(X_i; P)) (-W_i*df/dP_k|_Xi,P) } Eqs.(2)

The sum is still over the data points i. Note that there are as many equations as there are fitting parameters P_k.

One can usually evaluate the partial derivatives by hand (or using a symbolic math code like Maple), so these are just algebraic functions of P and X_i, call them F:

df/dP_k|_Xi,P= f_k(X_i;P) Eqs.(3)

so one really has to solve a system of nonlinear algebraic equations, where the unknowns are the P_k's, to find the best values for the fitting parameters:

0 = Sum { 2(W_i*(Y_i-f(X_i; P)) (-W_i*f_k(X_i;P)) } = F_k(P) Eqs.(4)

for all k=1..N_parameters. The sum is still over all the data points. In general this is a hard equation to solve (but if forced we can try using Newton's method). In the lucky case where f is a linear function of the parameters P_k, there is a big simplification:

IF
f(X,P) = Sum { P_k G_k(X) } (N.B. sum here is over the fitting parameters!) Eq.(5)
THEN
f_k(X_i;P) = df/dP_k|_Xi,P= G_k(X_i) Note no more summation, and no more P! Eqs.(6)

Plugging this into Eqs. (4) gives a system of LINEAR equations in P. For the case where W_i=1, these are just the familiar linear-least-squares matrix equation:

(G^TG) P = G^T Y Eq.(7)

(This form of it is called the "normal equations"). In these equations the matrix elements of G

G_ik = G_k(X_i) Eq.(8)

VERY IMPORTANT: G_k(X) does not have to be linear!! For these equations to work, the only requirement is that the model depend linearly on the fitting parameters P. Its dependence on X can be as nonlinear as you like. Also, X can be a vector without affecting this derivation.