Next: Frequency Distribution Revisited Up: 10.001: Data Visualization and Previous: Quantitative Description of the

Variance, Standard Deviation and Coefficient of Variation

The most commonly used measure of variation (dispersion) is the sample standard deviation, $\sigma$ . The square of the sample standard deviation is called the sample variance, defined as²

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{1}{n-1}}$ $\displaystyle \sum_{i=1}^{n}$ (x_i- $\displaystyle \mu$ )².

(2)

However,

$\displaystyle \sum_{i=1}^{n}$ (x_i- $\displaystyle \mu$ )²	=	$\displaystyle \sum_{i=1}^{n}$ (x_i² - 2 $\displaystyle \mu$ x_i + $\displaystyle \mu^{2{_}}_{}$ )
		= $\displaystyle \sum_{i=1}^{n}$ x_i² - 2 $\displaystyle \mu$ ( $\displaystyle \sum_{i=1}^{n}$ x_i) + n $\displaystyle \mu^{2{_}}_{}$
		= ( $\displaystyle \sum_{i=1}^{n}$ x_i²) - 2n $\displaystyle \mu^{2{_}}_{}$ + n $\displaystyle \mu^{2{_}}_{}$
		= ( $\displaystyle \sum_{i=1}^{n}$ x_i²) - n $\displaystyle \mu^{2{_}}_{}$ .	(3)

So an alternate equation for computing the variance is given by

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{1}{n-1}}$ $\displaystyle \left[\vphantom{ (\sum_{i=1}^{n}x_i^2) - n{(\sum_{i=1}^n x_i})^2}\right.$ ( $\displaystyle \sum_{i=1}^{n}$ x_i²) - n( $\displaystyle \sum_{i=1}^{n}$ x_i)² $\displaystyle \left.\vphantom{ (\sum_{i=1}^{n}x_i^2) - n{(\sum_{i=1}^n x_i})^2}\right]$ .

(4)

The advantage of Eq. 4 over Eq. 2 is that it allows for the computation of $\sum$ x_i² required for the evaluation of $\sigma$ and $\sum$ x_i required for the evaluation of $\mu$ in one loop, whereas Eq. 2 requires the precomputed value of $\mu$ before we can compute $\sigma$ . For this reason, Eq. 4 is used often in the computations of the mean and variance.

However, if you closely examine Eq. 2 and Eq. 4, one important difference can be pointed out: Eq. 2 guarantees a non-negative variance because variance is given there as the sum of squares. This is not necessarily true of Eq. 4 where we subtract n( $\sum_{i=1}^{n}$ x_i)² from $\sum_{i=1}^{n}$ x_i². From a computational perspective, we know that this can cause difficulties for large samples prone to potential roundoff errors. So we are interested in developing an algorithm which computes (a). both the mean and the variance in the same loop and (b). variance as a sum of squares. How can this be accomplished? Well, we can resort to developing recursive relations. Applying Eq. 1 for the the first p - 1 and p data and subtracting one from the other, we get

p $\displaystyle \mu_{p{^}}^{}$ = (p - 1) $\displaystyle \mu_{p-1{^}}^{}$ + x_p,

(5)

where $\mu_{p}^{}$ denotes the mean value of the first p data of the sample. We can now compute the sample mean recursively by letting $\mu_{1}^{}$ = x₁ and subsequently applying Eq. 5 for p = 2, 3,^..., n. We can also construct a simple recursion relation for computing $\sigma^{2}_{}$ by applying Eq. 4 for the first p - 1 and p data in the sample. This gives the two equations

(p - 2) $\displaystyle \sigma_{p-1}^{2}$ = x₁² + x₂² + ^... + x_{p - 1}² - (p - 1) $\displaystyle \mu_{p-1}^{2}$
(p - 1) $\displaystyle \sigma_{p}^{2}$ = x₁² + x₂² + ^... + x_p² - p $\displaystyle \mu_{p}^{2}$ .			(6)

subtracting the first of Eq. 6 from the second one gives

(p - 1) $\displaystyle \sigma_{p}^{2}$ = (p - 2) $\displaystyle \sigma_{p-1}^{2}$ + (p - 1) $\displaystyle \mu_{p-1}^{2}$ + x_p² - p $\displaystyle \mu_{p}^{2}$ ,

(7)

which can be rewritten (to get rid of subtractions) by substituting for $\mu_{p-1}^{2}$ from Eq. 5 as follows:

(p - 1) $\displaystyle \sigma_{p}^{2}$ = (p - 2) $\displaystyle \sigma_{p-1}^{2}$ + p(x_p- $\displaystyle \mu_{p}^{}$ )²/(p - 1), p = 2, 3,^..., n.

(8)

Now, once we initialize $\mu_{1}^{}$ = x₁ and $\sigma_{1}^{}$ = 0, we can compute the sample mean and variance using Eq. 5 and 8 for p = 2, 3,^..., n within the same loop. Note that the variance thus computed is guaranteed to be non-negative.

The coefficient of variation of the sample data, denoted by CV is defined as

CV = $\displaystyle {\frac{\sigma}{\mu}}$ .

(9)

Note that CV is independent of the units of measurement.

Next: Frequency Distribution Revisited Up: 10.001: Data Visualization and Previous: Quantitative Description of the

Michael Zeltkevic
1998-04-15