next up previous
Next: Frequency Distribution Revisited Up: 10.001: Data Visualization and Previous: Quantitative Description of the

Variance, Standard Deviation and Coefficient of Variation

The most commonly used measure of variation (dispersion) is the sample standard deviation, $ \sigma$. The square of the sample standard deviation is called the sample variance, defined as2

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{1}{n-1}}$$\displaystyle \sum_{i=1}^{n}$(xi-$\displaystyle \mu$)2. (2)

$\displaystyle \sum_{i=1}^{n}$(xi-$\displaystyle \mu$)2 = $\displaystyle \sum_{i=1}^{n}$(xi2 - 2$\displaystyle \mu$xi + $\displaystyle \mu^{2{_}}_{}$)  
    = $\displaystyle \sum_{i=1}^{n}$xi2 - 2$\displaystyle \mu$($\displaystyle \sum_{i=1}^{n}$xi) + n$\displaystyle \mu^{2{_}}_{}$  
    = ($\displaystyle \sum_{i=1}^{n}$xi2) - 2n$\displaystyle \mu^{2{_}}_{}$ + n$\displaystyle \mu^{2{_}}_{}$  
    = ($\displaystyle \sum_{i=1}^{n}$xi2) - n$\displaystyle \mu^{2{_}}_{}$. (3)

So an alternate equation for computing the variance is given by

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{1}{n-1}}$$\displaystyle \left[\vphantom{ (\sum_{i=1}^{n}x_i^2) - n{(\sum_{i=1}^n x_i})^2}\right.$($\displaystyle \sum_{i=1}^{n}$xi2) - n($\displaystyle \sum_{i=1}^{n}$xi)2$\displaystyle \left.\vphantom{ (\sum_{i=1}^{n}x_i^2) - n{(\sum_{i=1}^n x_i})^2}\right]$. (4)

The advantage of Eq. 4 over Eq. 2 is that it allows for the computation of $ \sum$xi2 required for the evaluation of $ \sigma$ and $ \sum$xi required for the evaluation of $ \mu$ in one loop, whereas Eq. 2 requires the precomputed value of $ \mu$ before we can compute $ \sigma$. For this reason, Eq. 4 is used often in the computations of the mean and variance.

However, if you closely examine Eq. 2 and Eq. 4, one important difference can be pointed out: Eq. 2 guarantees a non-negative variance because variance is given there as the sum of squares. This is not necessarily true of Eq. 4 where we subtract n($ \sum_{i=1}^{n}$xi)2 from $ \sum_{i=1}^{n}$xi2. From a computational perspective, we know that this can cause difficulties for large samples prone to potential roundoff errors. So we are interested in developing an algorithm which computes (a). both the mean and the variance in the same loop and (b). variance as a sum of squares. How can this be accomplished? Well, we can resort to developing recursive relations. Applying Eq. 1 for the the first p - 1 and p data and subtracting one from the other, we get

p$\displaystyle \mu_{p{^}}^{}$ = (p - 1)$\displaystyle \mu_{p-1{^}}^{}$ + xp, (5)

where $ \mu_{p}^{}$ denotes the mean value of the first p data of the sample. We can now compute the sample mean recursively by letting $ \mu_{1}^{}$ = x1 and subsequently applying Eq. 5 for p = 2, 3, ... , n. We can also construct a simple recursion relation for computing $ \sigma^{2}_{}$ by applying Eq. 4 for the first p - 1 and p data in the sample. This gives the two equations
(p - 2)$\displaystyle \sigma_{p-1}^{2}$ = x12 + x22 + ... + xp - 12 - (p - 1)$\displaystyle \mu_{p-1}^{2}$      
(p - 1)$\displaystyle \sigma_{p}^{2}$ = x12 + x22 + ... + xp2 - p$\displaystyle \mu_{p}^{2}$.     (6)

subtracting the first of Eq. 6 from the second one gives

(p - 1)$\displaystyle \sigma_{p}^{2}$ = (p - 2)$\displaystyle \sigma_{p-1}^{2}$ + (p - 1)$\displaystyle \mu_{p-1}^{2}$ + xp2 - p$\displaystyle \mu_{p}^{2}$, (7)

which can be rewritten (to get rid of subtractions) by substituting for $ \mu_{p-1}^{2}$ from Eq. 5 as follows:

(p - 1)$\displaystyle \sigma_{p}^{2}$ = (p - 2)$\displaystyle \sigma_{p-1}^{2}$ + p(xp-$\displaystyle \mu_{p}^{}$)2/(p - 1),  p = 2, 3, ... , n. (8)

Now, once we initialize $ \mu_{1}^{}$ = x1 and $ \sigma_{1}^{}$ = 0, we can compute the sample mean and variance using Eq. 5 and 8 for p = 2, 3, ... , n within the same loop. Note that the variance thus computed is guaranteed to be non-negative.

The coefficient of variation of the sample data, denoted by CV is defined as

CV = $\displaystyle {\frac{\sigma}{\mu}}$. (9)

Note that CV is independent of the units of measurement.

next up previous
Next: Frequency Distribution Revisited Up: 10.001: Data Visualization and Previous: Quantitative Description of the
Michael Zeltkevic