## Frequency Distributions |

Interactive histogram

In statistics, a *variable* is an assignment of a
number to each element of the population. Thus, mathematically, a
variable is actually a *function* defined on the
population. If the population is a group of people, for example,
then typical variables of interest might be *height*, *weight*,
*number of cars owned*, and so on.

A *discrete* variable is one whose set of possible
values is finite or countably infinite. Discrete variables are
frequently *counting variables*, like the number of cars
owned, in the example above. By contrast, a *continuous*
variable is one whose set of possible values is an *interval*
of real numbers. Continuous variables represent quantities, such
as height and weight in the example above, that can, in theory,
be measured to any degree of accuracy. In practice, of course,
measuring devices have limited accuracy so data collected from a
continuous variable is necessarily discrete. That is, there is
only a finite (but perhaps very large) set of possible values
that can actually be measured.

A *frequency distribution* is a summary of the data set
in which the interval of possible values is divided into
subintervals, known as *classes*. For each class, the
number of data values in that class is recorded; this is the *frequency*
of the class. The *relative frequency* of the class is the
frequency of the class divided by the number of values in the
data set.

An essential requirement for a frequency distribution is that
the classes be *mutually exclusive* and *exhaustive*.
That is, each value in the data set must belong to one and only
one class. A desirable, but not essential requirement is that the
classes have the same *width*.

A *histogram* is simply a bar chart of a frequency
distribution. For each class, a rectangle is drawn whose base is
the class (on the horizontal axis) and whose height is the
frequency (or relative frequency).

In the frequency distribution applet, the horizontal axis
represents a continuous variable *x*. You can click on the
axis from 0.1 to 5.0, to generate a data set. We are assuming
that our measuring device, the mouse, is accurate to one decimal,
so the values that you generate are stored by the computer to
this accuracy. The frequencies and relative frequencies are
recorded in the table on the left. As you click on the axis, the
computer also draws the histogram of the frequency distribution.

You can choose any of five types of distributions

- 50 classes of width 0.1: [0.05, 0.15), [0.15, 0.25), ..., [4.95, 5.05).
- 25 classes of width 0.2: [0.05, 0.25), [0.25, 0.45), ..., [4.85, 5.05).
- 10 classes of width 0.5: [0.05, 0.55), [0.55, 1.05), ..., [4.55, 5.05).
- 5 classes of width 1.0: [0.05, 1.05), [1.05, 2.05), ..., [4.05, 5.05).
- 1 class of width 5.0: [0.05, 5.05).

**1.** Click on the *x*-axis at
various points to generate a data set with 20 values. Vary the
class width over the five values from 0.1 to 5.0 and then back
again. For each choice of class width, switch between the
frequency histogram and the relative frequency histogram. Note
how the shape of the histogram changes as you perform these
operations.

As you can see, there is a tradeoff between the *number*
of classes and the *width* of the classes; these determine
the *resolution* of the frequency distribution. At one
extreme, when the class width is 0.1, each class contains a
single distinct value, because we are assuming that the original
data is recorded to one decimal accuracy. In this case, there is
no loss of information and we can recover the original data set
from the frequency distribution. On the other hand, it can be
hard to see the shape of the data when we have many classes of
small width.

**2.** Set the class width to 0.1.
click on the *x*-axis to generate a data set with 10
distinct values and 20 values total. From the frequency
distribution, explicitly write down the 20 values in the data
set.

At the other extreme, when the class width is 5.0, there is only one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set.

Between these two extreme cases, when the width is 0.2, 0.5,
or 1.0, the frequency distribution gives us partial information,
but not complete information. These intermediate cases can show
the *shape* of the data in a useful way.

**3.** For the distribution in Exercise
2, increase the class width to 0.2, 0.5, 1.0, and 5.0. Note how
the histogram loses *resolution*; that is, how the
frequency distribution loses information about the original data
set.

It is important to realize that frequency data is inevitable
for a continuous variable. For example, suppose that our variable
represents the weight of a person (in pounds) and that our
measuring device (a scale) is accurate to 0.1 pound. If we
measure a person's weight as 153.2, then we are really saying
that the weight is in the interval [153.15, 153.24). Similarly,
when two persons have the same measured weight, the apparent
equality of the weights is really just an artifact of the
imprecision of the measuring device; actually the two persons
almost certainly do *not* have the exact same weight.
Thus, two persons with the same measured weight really give us a
frequency count of 2 for a certain interval. One of the main
purposes of this module is to encourage you to always think of
data in terms of distributions

In general, suppose that we have a frequency distribution for
a continuous variable *x* with *k* classes. We will
denote the class *boundaries* by

a_{0},a_{1}, ...,a_{k}

so that the *i*'th *class* has lower boundary *a*_{i}_{–1} and upper boundary *a*_{i}.

The *frequency* of the *i*'th class will be
denoted *f*_{i} for *i* = 1, 2,
..., *k*. Because of the mutually exclusive and exhaustive
property of the frequency distribution we must have

f_{1}+f_{2}+ ··· +f_{k}=n.

where *n* is the number of values in the data set. The *relative
frequency* of the *i*'th class is

p_{i}=f_{i}/n.

Note that the relative frequencies must sum to 1:

p_{1}+p_{2}+ ··· +p_{k}= 1.

If we know *n*, the number of values in the data set,
then the *frequency* and the *relative frequency*
of a class are equivalent, in the sense that if we know one of
these, we can find the other.

**4.** Click on the *x*-axis to
generate a data set with at least 10 classes and at least 20
values total. Vary the class width over the five values and for
each class width, switch between the frequency histogram and the
relative frequency histogram. .Note that the frequency histogram
and the relative frequency histogram look the same, except for
the scale on the vertical axis.

The *width* of the *i*'th class is

w_{i}=a_{i}-a_{i}_{-1}.

When the class widths are all the same, we will use *w*
to denote the common value.

The *class mark* of a class is the midpoint of the
class. For the *i*'th class, we will denote this by

x_{i}= (a_{i}_{-1}+a_{i}) / 2.

It is usually best to think of a frequency distribution as an approximation of the original data set in which all the values in a class have been "rounded" to the class mark.

In general, a frequency distribution can represent the
variable *x* over the entire population or over a sample
(subset) of the population. In either case, numerical
characteristics that capture interesting features of the
distribution are important. When the distribution represents the
entire population, such characteristics are called *parameters*;
when the distribution represents a sample of the population, such
characteristics are called *statistics*.

The distinction is not important in this module, so we will assume that the frequency distribution represents the entire population

The *minimum value* of a distribution to be the
smallest class mark whose class has positive frequency and the *maximum
value* to be the largest class mark whose class has positive
frequency. These are parameters of the distribution.

In the applet, the number of points *n* and the minimum
and maximum value are recorded in the second table.

A *uniform* distribution is one in which all the
non-empty classes have the same frequency.

A *modal class* is any class with maximum frequency. A *unimodal*
distribution is one whose histogram has a single peak, so that
the frequencies at first increase and then decrease. A *bimodal*
distribution is one whose histogram has two peaks, so that the
frequencies at first increase, then decrease, then increase
again, and finally decrease again. Similarly, there can be *trimodal*
distributions and so on.

A distribution is said to be *symmetric* if the
histogram is roughly symmetric with respect to one of the class
marks *x*_{j}, so that classes that are
the same distance to the right and to the left of *x*_{j}
have the same frequency.

A unimodal distribution is said to be *skewed right* if
the histogram has a long tail to the right of the modal class.
The distribution is said to be *skewed left* if the
histogram has a long tail to the left of the modal class. Thus,
skewed distributions are not symmetric.

A *U-distribution* is one whose histogram has the shape
of the letter *U*, with large frequencies near the minimum
and maximum values and small frequencies in the middle.

**5.** In each case below, set the
class width to 0.1 and click on the axis to generate a
distribution of of the given type with 30 points. Now increase
the class width to each of the other four values and describe the
type of distribution.

- A uniform distribution
- A symmetric unimodal distribution
- A unimodal distribution that is skewed right.
- A unimodal distribution that is skewed left.
- A symmetric bimodal distribution
- A
*U*-distribution.

A relative frequency distribution has the mathematical
structure of a discrete probability distribution. Indeed, suppose
we perform the random experiment
of selecting a value at random from the data set and recording
the mark *X* of the class containing the value. Then *X*
is a discrete random variable
with density function

P(X=x_{i}) =p_{i}fori= 1, 2, ...,k

## Descriptive Statistics |