Describing the shape of distributions is really only of interest
if you are looking at a quantitative variable. Stem-and-leaf plots
and histograms can tell you how the variable is distributed in the
population. For example, the stem-and-leaf plot of student marks
was fairly symmetric with short tails although there was also an
outlier.
"Stem and Leaf" Plot
However, the example above of new housing starts in the USA could
be described as skewed to the right because the higher values are
more spread out than the lower values. There are certain terms you
can use to describe the shape of a distribution.
What terminology can you use to describe the shape of a distribution?
Is the shape of the distribution symmetric,
that is, is the shape similar either side of a central axis (like
a mirror image)? Or is the shape skewed, i.e.
asymmetric?
Does the shape have long tails or short
tails of data?
Does the shape have only one mode and therefore can be described
as unimodal or does it have more than one and
can be described as bimodal (for two modes) or
multimodal (if it has three or more modes).
Are there any outliers?
You can describe the shape of the distribution of quantitative
data using these terms. Describing the shape is a first step in
understanding the distribution of a variable. However, you cannot
describe the shape of a categorical variable in this way because
the categories of a categorical variable have no specific order
(think of hair colour).
When we describe the shapes of distributions, a common
reference shape is the normal distribution.
The normal distribution (normal curve) is also described as a “bell-shaped”
curve (because it has the same cross-sectional shape as a large
church bell). A typical normal curve is shown below:
Properties of a normal curve:
Symmetrical (identical shape on each side of the centre line)
One mode
Mean, median and mode are together (at the centre!)
Most data values are clustered around the centre (in fact 68%
of the data is within a band which extends from one Standard Deviation
unit under the mean to one SD unit above the mean
A normal curve can be seen to arise from a (practical) situation
where there are a large number of histogram bars describing a data
set. The normal curve is the envelope which smooths out the shape
produced by the tops of the histogram bars.
The marks on the horizontal axis are blocks of size one SD unit.
You can see that the interval from 3 SD units below the mean to
3 SD units above the mean contains almost 100% of the data. In other
words, it is unusual for normally distributed data to be more than
3 SD units away from the mean.
Consider the following practical data
Following is some data on word length in literature.
The students who collected this data wanted to compare the nature
of articles in the popular magazine New Weekly with articles
in New Scientist. Their hypothesis was that the writing
style would reflect the type of magazine – i.e., that word
length distribution would be an indicator of the complexity of issues
discussed in each magazine.
A typical article was chosen at random from each magazine.
The first two graphs show:
(i) the frequency distribution of word lengths
(ii) the relative frequency distribution of word lengths.
word length
frequency
relative frequency
1
10
0.040
2
40
0.160
3
55
0.220
4
50
0.200
5
25
0.100
6
20
0.080
7
15
0.060
8
15
0.060
9
10
0.040
10
5
0.020
11
3
0.012
12
2
0.008
250
1.000
You can see that the two graphs above concerning New Weekly
are identical, except for the numbering on the vertical axes.
You can also see that:
the New Weekly article favours shorter words –
longer words are not common
the distribution (graph) is not symmetrical - the variable ‘word
length’ is distributed unevenly
Now consider the New Scientist article.
The article from the New Scientist is longer – 425
words compared to 250. Hence, for meaningful comparison, we use
relative frequency on the vertical axis.
word length
frequency
relative frequency
1
5
0.012
2
12
0.028
3
25
0.059
4
35
0.082
5
55
0.129
6
70
0.165
7
80
0.188
8
53
0.125
9
40
0.094
10
28
0.066
11
12
0.028
12
10
0.024
425
1.000
This article contains more long words (no doubt because of its
scientific nature) and the distribution approximates a normal curve
Now you have finished Module 2.
See your lecturer or
WebCT site to complete the quiz.