Linear and Nonlinear Least Squares

The object is to minimize the weighted sum-of-squares-of-the-errors (SSE) between the model f(X) and the data Y, by varying the fitting parameters P:

min  SSE(P) = Sum { (Wi*(Yi-f(Xi; P))2 }     sum is over all data points (Xi,Yi)      Eq.(1)

The weights Wi might be 1/standard error in Yi. If you don't know the weights, make them all =1.

to do minimization with respect to Pk, we look for places where the derivative is zero:

dSSE / dPk = 0 = Sum { 2(Wi*(Yi-f(Xi; P)) (-Wi*df/dPk|Xi,P) }        Eqs.(2)

The sum is still over the data points i. Note that there are as many equations as there are fitting parameters Pk.

One can usually evaluate the partial derivatives by hand (or using a symbolic math code like Maple),  so these are just algebraic functions of P and Xi, call them F:

df/dPk|Xi,P = fk(Xi;P)            Eqs.(3)

so one really has to solve a system of nonlinear algebraic equations, where the unknowns are the Pk's, to find the best values for the fitting parameters:

0 = Sum { 2(Wi*(Yi-f(Xi; P)) (-Wi*fk(Xi;P)) }  = Fk(P)       Eqs.(4)

for all k=1..Nparameters. The sum is still over all the data points. In general this is a hard equation to solve (but if forced we can try using Newton's method). In the lucky case where f is a linear function of the parameters Pk, there is a big simplification:

IF
f(X,P) = Sum { Pk Gk(X) }      (N.B. sum here is over the fitting parameters!)    Eq.(5)
THEN
fk(Xi;P) = df/dPk|Xi,P = Gk(Xi)         Note no more summation, and no more P!     Eqs.(6)

Plugging this into Eqs. (4) gives a system of LINEAR equations in P. For the case where Wi=1, these are just the familiar linear-least-squares matrix equation:

(GTG) P = GT Y       Eq.(7)

(This form of it is called the "normal equations"). In these equations the matrix elements of G

Gik = Gk(Xi)      Eq.(8)

VERY IMPORTANT: Gk(X) does not have to be linear!! For these equations to work, the only requirement is that the model depend linearly on the fitting parameters P. Its dependence on X can be as nonlinear as you like. Also, X can be a vector without affecting this derivation.