next up previous
Next: Variance, Standard Deviation and Up: 10.001: Data Visualization and Previous: Data Visualization/Graphical Analysis

Quantitative Description of the Data

There are primarily 2 types of measures we are interested in. The first type is known as the measures of location or measures of central tendency. The most commonly used description of this type is the sample mean denoted by $ \mu$ and computed as

 
$\displaystyle \mu$ = $\displaystyle {\frac{1}{n}}$$\displaystyle \sum_{i=1}^{n}$xi. (1)

For the data given above, $ \mu$ $ \approx$ 320.

The mean value of the data set is a single number which we extract from the entire data set using Eq. 1. In that sense, it is a function which maps the sample data to a single number representing a measure of location, i.e., it tells us what the average value of the data set is. This is true of almost all quantitative statistical measures: On one hand, they allow us to represent large sets of data compactly. On the other hand, we have discarded much of the detail in the process of arriving at a compact quantitative representation. This is why it helps to do a qualitative analysis using statistical plots. Moreover, we should try to combine the various quantitative measures to get a comprehensive description of the data.

Now, what are the other measures of location? The median value of the data set is defined as the observation so that half of the observations in the sample has values less than the median value. In other words, if we sort the data in the ascending order such that x1 $ \leq$ x2 $ \leq$ x3 $ \leq$ ... $ \leq$ xn - 1 $ \leq$ xn, the median value is the middle value of the sorted data if n is an odd number, or it is the arithmetic mean of the two middle values if n is an even number. We will denote the median of the data set X by xmed. For instance, if we sort the melting point data above in the ascending order, we can find that xmed = (x25 + x26)/2 = 320.

Well, when do we use the median as a representative measure in preference to the mean? The answer is when a sample of relatively small size contains extreme points, the mean may not be a representative measure, in such cases we tend to use the median of the sample. As an example, consider the data (2.16, 2.37, 2.84, 3.01, 17.3), the mean value is larger than 4 of the 5 observations in the sample due to the relatively large weight of the last datum. Here, the median is a better representation of the central tendency.

In the melting point data given above, the mean, the median as well as the mode of the sample are practically the same (320). This is indicative of the symmetry of the distribution, as seen from Figure 3. The idea of the median can be generalized to the concept of the percentile measure. Once the data has been sorted in ascending order, we can seek the observation xP such that P percent of the (sorted) observations are below xP. In this case, we call xP the Pth percentile of the sample. The 25th, the 50th and the 75th percentiles are often referred to as the first, the second and the third quartiles respectively. The difference between the third and the first quartiles is called the interquartile range. For the melting point data of our example, the first and third quartiles are 316 and 325 respectively so that the interquartile range is 9. The interquartile range is often used to describe the variation of the data. The advantage of using such a measure over the range of the sample is that it avoids extreme data points often resulting from observations with relatively low confidence levels.

The box and whisker plot, or the box plot for short, graphically represents the median, the first and the third quartiles and the extrema of the sample data. A box plot of the melting point data is given in Figure 4. In the case of multiple samples, we can use a box plot to represent each sample. In that case, the widths of the boxes can be used to represent the relative sizes of the samples.


next up previous
Next: Variance, Standard Deviation and Up: 10.001: Data Visualization and Previous: Data Visualization/Graphical Analysis
Michael Zeltkevic
1998-04-15