3 - Faculty Websites



3.3 MEASURING VARIATION OR SPREAD

Both sets of data have the same mean, median and mode but the values obviously differ in another respect -- the variation or spread of the values.

The values in List 1 are much more tightly clustered around the center value of 60. The values in List 2 are much more dispersed or spread out.

List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65

mean = median = mode = 60

X

X

XXXXXXXXXXX .

35 40 45 50 55 60 65 70 75 80 85

List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 80

mean = median = mode = 60

X

X

X X X X X X X X X X X .

35 40 45 50 55 60 65 70 75 80 85

Range

The range is the simplest measure of variability or spread.

Range is just the difference between the largest value and the smallest value.

Range can give a distorted picture of the actual pattern of variation.

Two distributions: same range but different patterns of variation.

The first distribution has most of its values far from the center, while the second distribution has most of its values closer to the center.

X

X X X X X X X

X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X

20 21 22 23 24 25 26 27 28 29 30 20 21 22 23 24 25 26 27 28 29 30

Interquartile Range

The interquartile range measures the spread of the middle 50% of the data. You first find the median (represented by Q2—the value that divides the data into two halves), and then find the median for each half.The three values that divide the data into four parts are called the quartiles, represented by Q1, Q2, and Q3. The difference between the third quartile and the first quartile is called the interquartile range, denoted by IQR=Q3-Q1.

Example Quartiles for Age

The ages of the 20 subjects in the medical study are listed below in order.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,

44, 45, 45, 45, 46, 47, 47, 49, 50, 51

The histogram of the ages is also provided.

a) Calculate the median age.

b) Calculate the first Quartile Q1 for this age data.

c) Calculate the third Quartile Q3 for this age data.

d) Calculate the range for this age data.

32 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51

[pic]

[pic]

We see that the distribution of age is approximately symmetric and that the quartiles are about the same distance from the median.

The quartiles are actually the 25th, 50th, and 75th percentiles.

[pic]

Five-Number Summary

Five-number summary:

Minimum, Q1, Median, Q3, Maximum

[pic]

To Build a Basic Boxplot

▪ List the data values in order from smallest to largest.

▪ Find the five number summary: minimum, Q1, median, Q3, and maximum.

▪ Locate the values for Q1, the median and Q3 on the scale. These values determine the “box” part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median.

▪ Draw lines (called whiskers) from the midpoints of the ends of the box out to the minimum and maximum.

Example

Five-Number Summary and Boxplot for Age

Problem

Consider the (ordered) ages of the 20 subjects in a medical study :

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,

44, 45, 45, 45, 46, 47, 47, 49, 50, 51

The five-number summary for the age data is given by:

min = 32, Q1 = 41, median = 43.5, Q3 = 46.5, and

max = 51.

Draw the basic boxplot.

[pic]

The distance between the median and the quartiles is roughly the same, supporting the rough symmetry of the distribution as seen previously from the histogram.

Side-by-side boxplots are helpful for comparing two or more distributions with respect to the five-number summary.

Although the median of the first process is closer to the target value of 20.000 cm, the second process produces a less variable distribution.

Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot

▪ List the data values in order from smallest to largest.

▪ Find the five number summary: minimum, Q1, median, Q3, and maximum.

▪ Locate the values for Q1, the median and Q3 on the scale. These values determine the “box” part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median.

▪ Find the IQR = Q3 – Q1.

▪ Compute the quantity STEP = 1.5 x (IQR)

▪ Find the location of the inner fences by taking 1 step out from each of the quartiles

lower inner fence = Q1 – STEP;

upper inner fence = Q3 + STEP.

▪ Draw the lines (whiskers) from the midpoints of the ends of the box out to the smallest and largest values WITHIN the inner fences.

▪ Observations that fall OUTSIDE the inner fences are considered potential outliers. If there are any outliers, plot them individually along the scale using a solid dot.

[pic]

Five-number summary:

min=1

Q1=21

median=32

Q3=66

max=325

Example Any Age Outlier?

Let’s apply the "rule of thumb" to our age data set to assess if there are any outliers.

(a) Construct the fences for the modified boxplot based

on the 1.5 * IQR rule.

(b) Are there any outliers using the 1.5 * IQR rule?

(c) Construct the modified boxplot.

[pic]

Let's Do It! 1( 3min)

Five-Number Summary and Outliers

Let's Do It! 2 (3min)

Let’s Do It! 2 Cost of Running Shoes

The prices for 12 comparable pairs of running shoes produced the following boxplot.

[pic]

(a) What was the approximate range of prices for such running shoes?

Range = ______________

(b) Twenty-five percent of the shoes cost more than

approximately what amount?

$ _____________

Let's Do It! 3 (10min)

Comparing Ages—Antibiotic Study

Variable = age for 23 children randomly assigned to one of two treatment groups.

(a) Give the five-number summary for each of the two

treatment groups. Comment on your results.

Amoxicillin Group (n=11): 8 9 9 10 10 11 11 12 14 14 17

Five-number summary:

Cefadroxil Group (n=12): 7 8 9 9 9 10 10 11 12 13 14 16

Five-number summary:

(b) Make side-by-side boxplots for the antibiotic study

data in part (a).

(c) Using our “rule of thumb,” are there any outliers for

the Amoxicillin group? If so, modify your boxplot

above.

(d) Using our “rule of thumb,” are there any outliers for

the Cefadroxil group?

If so, modify your boxplot above.

Standard Deviation

.…...a measure of the spread of the observations from the

mean.

.……think of the standard deviation as an “average (or

standard) distance of the observations from the mean.”

Example 5.9 Standard Deviation—What Is It?

[pic]

Deviations: -4, 1, 3

Squared Deviations: 16, 1, 9

-----------------------------------------------------------------------------------------Observation Deviation Squared Deviation

[pic] [pic] [pic]

-----------------------------------------------------------------------------------------

0 0 - 4 = -4 16

5 5 - 4 = 1 1

7 7 - 4 = 3 9

-----------------------------------------------------------------------------------------

mean = 4 sum always = 0 sum = 26

[pic]

[pic]

Interpretation of the Standard Deviation

Think of the standard deviation as roughly an average distance of the observations from their mean. If all of the observations are the same, then the standard deviation will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more spread out the observations are about their mean, the larger the value of the standard deviation.

If [pic] denote a sample of n observations,

the sample variance is denoted by:

[pic]

Sample standard deviation, denoted by [pic],

is the square root of the variance: [pic].

The population standard deviation, denoted by the Greek letter [pic] (sigma), is the square root of the population variance and is computed as:

[pic].

Remarks:

The variance is measured in squared units. By taking the square root of the variance we bring this measure of spread back into the original units.

Just as the mean is not a resistant measure of center, since the standard deviation used the mean in its definition, it is not a resistant measure of spread. It is heavily influenced by extreme values.

There are statistical arguments that support why we divide by [pic] instead of n in the denominator of the sample standard deviation.

Let's Do It! 4 (4min) 5.13Increasing Spread

Consider the following three data sets.

I: 20 20 20 II: 18 20 22 III: 17 20 23

(a) Which data set will have the smallest standard deviation?

(b) Which data set will have the largest standard deviation?

(c) Find the standard deviation for each data set and

check your answers to (a) and (b).

Think About It (3 min)

Given that two (or more) sets of n observations yield the same standard deviation, will these sets show the same amount of variability? Just what is variability anyway?

Example

There Are Many Measures of Variability

[pic][pic] [pic][pic]

Consider the following four data sets along with their histograms:

Data Set I

2 3 3 3 4 4 4 4 5 5 5 5 5

Data Set II

3 3 3 3 3 4 4 4 4 5 5 5 6

Data Set III

2 3 3 4 4 4 4 4 4 4 5 5 6

Data Set IV

3 3 3 3 3 3 4 5 5 5 5 5 5

(a) Calculate the mean for each

data set.

(b) Calculate the range for each

data set.

(c) Calculate the interquartile range, IQR, for each data set.

(d) Calculate the standard deviation for each data set.

(e) Which data set is most variable? Explain.

The mean for all four distributions is [pic]. The table presents three measures of variability for each of the four distributions:

Measure of Distribution

Variability I II III IV

Range 3 3 4 2

IQR 2 2 1 2

Std dev 1 1 1 1

If we look at the range: Distribution III is most variable; if we look at the IQR: Distribution III is least variable; while all four distributions have the SAME standard deviation.

Some people associate variability with range while others associate variability with how values differ from the mean. There are many measures of variability, with the standard deviation being the most widely used measure. But keep in mind, a distribution with the smallest standard deviation is not necessarily the distribution that is least variable with respect to other definitions or to your own definition of variability. (Reference: A. J. Nitko, (1983), Educational Tests and Measurement: An Introduction.)

Think About It

What do you think would happen to the measures of variability if the last value in all four of the preceding data sets were changed to 16?

IQR and Standard Deviation

The interquartile range, IQR, is the distance between the first and third quartiles (Q3 - Q1), and measures the spread of the middle 50% of the data. When the median is used as a measure of center, the IQR is often used as a measure of spread. For skewed distributions, or distributions with outliers, the IQR tends to be a better measure of spread if your goal is to summarize that distribution.

Adding the minimum and maximum values to the median and quartiles results in the five-number summary. A graphical display of the five-number summary is a boxplot, and the length of the box corresponds to the IQR.

The standard deviation is roughly the average distance of the observed values from their mean. The mean and the standard deviation are most useful for approximately symmetric distributions with no outliers. In the next chapter we will discuss an important family of symmetric distributions, called the normal distributions, for which the standard deviation is a very useful summary.

Tip:

The numerical summaries presented in this chapter provide information about the center and spread of a distribution, but a graph, such as a histogram or stem-and-leaf plot, provides the best picture of the overall shape of the distribution.

Graph your data first!

Variance and Standard Deviation for Grouped Data

The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class.

Example

The data represent the number of miles that 20 runners ran during one week. Find the variance and the standard deviation for the frequency distribution of the data.

Solution

Step1 Make a table as shown, and find the midpoint of each class.

Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D.

1 .8 = 8, 2 . 13 =26, . . . , 2 .38 = 76

Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E.

1 .82 = 64, 2 . 132 = 338, . . . ,

2 .382 = 2888

Step 4 Find the sums of columns B, D, and E. The sum of column B is n, the sum of column D is[pic], and the sum of column E is[pic]. The completed table is shown.

Step 5 Substitute in the formula and solve for s2 to get the variance.

Step 6 Take the square root to get the standard deviation. [pic]

Let's Do It! 5

|Interval |Frequency |

|29.50-69.45 |5 |

|69.50-89.45 |10 |

|89.50-99.45 |11 |

|99.50-109.45 |19 |

|109.50-119.45 |17 |

|119.50-129.45 |20 |

|129.50-139.45 |12 |

|139.50-169.45 |6 |

The data show distribution of the birth weight ( in oz.) of 100 consecutive deliveries. Find the variance and the standard deviation.

Practice Exercises from Textbook For 3.3 section

Page 129: 1-7 all, 9-11 all, 16, 18-21 all

Page 157: 1-12 all, 16, 17, and 18

TI Quick Steps

Obtaining Summary Measures

Step 1 Clear data.

Step 2 Enter data to be summarized.

Step 3 Obtain the summary measures for the data in L1.

Summary measures are obtained by requesting the 1-Var Stats from under the STAT CALC menu list. The sequence of buttons is as follows:

The 1-Var Stats are now displayed in the window. Notice that both the sample standard deviation s and the population standard deviation σ are provided, depending on whether the values in L1 are a sample or the entire population. The only mean provided is [pic], but this would be the population mean μ if the values are the entire population. To find more information, in particular the five-number summary, press down arrow button.

Producing a Boxplot

Step 1 Clear data and plots

Step 2 Enter data to be plotted

Step 3 Setting the STAT PLOT options for a boxplot.

Finally set the stat plot options for producing

a boxplot of the data in L1 as Plot 1.

The sequence of steps is as follows:

Press the ZOOM button and then “9” to have the boxplot displayed. Use the TRACE button and the right and left arrow keys to see values for the five-number summary. Note that the modified boxplot type is 4th graph icon in the Type list.

-----------------------

Finding the Quartiles

1. Find the median of all of the observations.

2. First Quartile = Q1 = median of observations that fall below the median.

3. Third Quartile = Q3 = median of observations that fall above the median.

Notes

▪ When the number of observations is odd, the middle observation is the median. This observation is not included in either of the two halves when computing Q1 and Q3.

▪ Although different books, calculators, and computers may use slightly different ways to compute the quartiles, they are all based on the same idea.

▪ In a left-skewed distribution, the first quartile will be farther from the median than the third quartile is. If the distribution is symmetric, the quartiles should be the same distance from the median.

DEFINITION:

The pth percentile is the value such that p% of the observations fall at or below that value and (100 - p)% of the observations fall at or above that value.

p. 321

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download