4.4. Describing the spread of the distribution
of data
The spread of the data tells you about variability.
Using range
This is useful information. It enriches the summary information
provided by measures of centre. For example, suppose that your mark
in a particular subject was 65 and the mean was 60. These values
might be interpreted very differently if the marks varied from 56
to 65 (a range of 9) than if they varied from 16 to 92 (a range
of 76). The measure of spread given here is called the range.
Although it is the simplest measure of spread, it is not very useful
for comparing variability between data sets of different sizes.
Other measures of spread such as interquartile ranges and standard
deviation provide more reliable information.
Using standard deviation
The mean is the most commonly used measure of centre. When mean
is used, then a measure called standard deviation
is generally used to measure the spread of the data. It measures
the overall deviation of observations from their mean by computing
the average of the squares of these deviations from the mean and
then finding the square root of that value. We can write it as:
Since most data comes from samples rather than populations, we
use the formula:
If you are looking at a single variable, then an adequate summary
of the distribution of data from one variable requires both a measure
of centre and a measure of spread. If mean and median are both used
to summarise 'centre', they can sometimes be used to describe the
shape of the distribution. If the median and the mean are virtually
the same, this indicates a distribution that is approximately symmetric
about the centre. If not, the difference between mean and median
is an indicator of an asymmetric shape.
Let's look at an example.
In 2003 the number of new private housing commencements was measured
by the US Bureau of Census for each of the 50 states. The histogram
of the distribution follows where the horizontal axis measures numbers
of home starts, and the vertical (frequency) axis measures numbers
of states:
(Weiers, 2005, p.85)
Notice that there is a column “more” – it contains
the three outliers 143.1 and 202.6 and 271.4 – these were
the states Texas, Florida and California respectively. These are
popular states with large (internal) immigration rates and had a
housing boom.
The median number of starts was 18.3 (i.e., 18,300 homes) and
the average was 34.6 (34,600 homes).
The lowest 10% of the data is cut off at 2.6 – this is called
the first decile. Another way of saying this is “there is
10% of the area of the distribution from 0 to 2.6”.
The lowest 20% is cut off at 5.0 – the second decile.
The lowest 30% is cut off at 8.8 – the third decile.
Each of the bands 0 – 2.6 and then 4.0 – 5.0 and then
5.2 to 8.8 contain 10% of the area under the distribution. In total
they contain the (lowest) 30%.
Using a continuous curve to summarise the shape of the histogram
gives the shape of the distribution to be:
(Weiers, 2005, p.85)
Using deciles, percentiles and quartiles
Note the graphical interpretation of median (18.3) and deciles
in terms of area. The lower decile is the point below which 10%
of the distribution lies. Areas are equal above and below the median
and 10% of area is below 2.6.
Deciles are a particular case of a more general measure, percentiles.
A data distribution is commonly ‘broken up’ into 100
sections. The thirtieth percentile is the data value (boundary)
below which 30% of the data lies.
The most commonly used percentiles are quartiles. If median is
used as the measure of centre, then specific percentiles can be
used as a measure of variability and to show the general shape of
the distribution.
The first (or lower) quartile (25th percentile) is the data value
below which 25% of the data set lies. The second quartile is the
50th percentile, the data value below which 50% of the data set
lies (this is, of course, the median). The median is the 50th percentile.
The 25th percentile is called the lower quartile
and is the median of all the values below the overall median and
the 75th percentile is called the upper quartile
and is the median of the values above the overall median. If the
lowest and highest individual values are also reported, then you
have what is called the five number summary.
What is a box plot?
The five number summary of a distribution can be represented graphically
as a box plot. Individual box plots can be used to represent distributions.
For example, if you wanted to draw a box plot of the new housing
starts that were described above it would look like the diagram
below (with 3 outliers omitted for scaling reasons):
To consider another example
Box plots can be used to compare the prime-time ratings of two hypothetical
television networks. Ratings are a measurement of the television
viewing preferences of a sample of the population. The more people
who watch a television show, the more popular it is and the higher
its ratings. Usually advertisers pay more to have their products
advertised on a high rating show than a low rating show.