Next: Tests for the Regression Up: 10.001: Correlation and Regression Previous: Correlation Analysis

Regression Analysis: Method of Least Squares

Once we have established that a strong correlation exists between x and y, we would like to find suitable coefficients a and b so that we can represent y using a best fit line $\hat{y}$ = ax + b within the range of the data. The method of least squares is a very common technique used for this purpose. The rationale used here is as follows. For each pair of observations (x_i, y_i), we define the error e_i as

e_i = (ax_i + b - y_i).

(2)

Now, we find a and b in such a way that the sum of the squared errors over all the observations is minimized. i.e., the quantity we are interested in minimizing is

S(a, b) = $\displaystyle \sum_{i=1}^{n}$ $\displaystyle \left[\vphantom{ ax_i+b - y_i }\right.$ ax_i + b - y_i $\displaystyle \left.\vphantom{ ax_i+b - y_i }\right]^{2{_}}_{}$ .

(3)

We know from calculus that to minimize this, we need $\partial$ S/ $\partial$ a $\equiv$ 0 and $\partial$ S/ $\partial$ b $\equiv$ 0. These conditions yield

nb + $\displaystyle \left(\vphantom{ \sum_{i=1}^{n} x_i}\right.$ $\displaystyle \sum_{i=1}^{n}$ x_i $\displaystyle \left.\vphantom{ \sum_{i=1}^{n} x_i}\right)$ a = $\displaystyle \sum_{i=1}^{n}$ y_i
$\displaystyle \left(\vphantom{ \sum_{i=1}^{n} x_i}\right.$ $\displaystyle \sum_{i=1}^{n}$ x_i $\displaystyle \left.\vphantom{ \sum_{i=1}^{n} x_i}\right)$ b + $\displaystyle \left(\vphantom{ \sum_{i=1}^{n} x_i^2}\right.$ $\displaystyle \sum_{i=1}^{n}$ x_i² $\displaystyle \left.\vphantom{ \sum_{i=1}^{n} x_i^2}\right)$ a = $\displaystyle \sum_{i=1}^{n}$ x_iy_i.			(4)

Eq. 4 gives two linear equations in a and b, which can be solved to get

a = $\displaystyle {\frac{n \sum_{i=1}^n x_iy_i - \left( \sum_{i=1}^{n} x_i\right) \... ...}^{n} y_i\right)}{n\sum_{i=1}^n x_i^2 - { \left( \sum_{i=1}^{n} x_i\right)}^2}}$ ,

(5)

with b obtained through subsequent substitution of a in either of the two equations given by Eq. 4.

In the case of the data given in Figure 1, the best fit line has a slope of 1.64 and intercept of -0.36. Or in other words, $\hat{y}$ = 1.64x - 0.36. Note that this is only a best fit line which can be used to compute the fuel consumption given the weight within or very close to the range of the measurements. Its predictive power is rather limited. For instance, for x = 0, we get y = - 0.36, which is non-physical. A physical model for the fuel consumption would have predicted 0 consumption of fuel for 0 weight.

How are the slope and the intercept of the best fit line related to the correlation coefficient? To examine this, we rewrite Eq. 5 as

a	=	$\displaystyle {\frac{n \sum_{i=1}^n x_iy_i - \left( \sum_{i=1}^{n} x_i\right) \... ...}^{n} y_i\right)}{n\sum_{i=1}^n x_i^2 - { \left( \sum_{i=1}^{n} x_i\right)}^2}}$
		= $\displaystyle {\frac{\sum_{i=1}^n x_iy_i - \left( \sum_{i=1}^{n} x_i\right) \le... ...n} y_i\right)/n}{\sum_{i=1}^n x_i^2 - { \left( \sum_{i=1}^{n} x_i\right)}^2/n}}$
		= $\displaystyle {\frac{\sum_{i=1}^n (x_i-\mu_x)(y_i-\mu_y)}{\sum_{i=1}^n {(x-\mu_x)}^2}}$ (Verify this step)
		= $\displaystyle {\frac{(n-1)R \sigma_x \sigma_y}{(n-1)\sigma_x^2}}$ (See Eq. 1)
		= R $\displaystyle {\frac{\sigma_y}{\sigma_x}}$ .	(6)