Chapter 1--Sampling and Descriptive Statistics



Chapter 1--Sampling And Descriptive Statistics.Doc

STATISTICS 301—APPLIED STATISTICS, Statistics for Engineers and Scientists, Walpole, Myers, Myers, and Ye, Prentice Hall

STA 301 Fall 2008

Sampling Concepts

[pic]

[pic]

To determine probabilities and expected values, we must start with some known facts. One of the known facts is how the data was collected. A sampling scheme must be used which

1. gives reasonable assurance that the sample is representative of the population

2. allows for computing probabilities about the statistic.

kinds of sampling schemes or methods

Census - a 100% enumeration of a population

Why take a sample instead of a census?

Convenience - refers to time, money, and effort

Necessity - when the act of making an observation destroys the element

Accuracy - when data collection requires highly skilled workers

Random Sample - Each element in a population is equally likely to be selected on each draw from the population and the draws are independent of one another.

Practically speaking, the elements are taken "at random with replacement."

Simple Random Sample - A simple random sample of size n elements from a population of N elements is such that every subset of n elements are equally likely.

Practically speaking, the elements are taken "at random without replacement.

As the population size gets large, then simple random sampling and random sampling are essentially the same.

Since SRS is the most common method of sampling we will focus mainly on it. We will see more sampling methods later.

how to take a SIMPLE RANDOM sample (SRS)

Taking a random sample or simple random sample must be done using a randomization device (computer, calculator, or random number table) and requires a sampling frame.

Sampling Frame - List of the elements in the population from which the sample will be drawn. (Let N = number of elements in the population.)

In the simplest of cases, a SRS can be obtained as follows.

1. Assign each element of the population a unique identifier. Names can be used as could social security numbers or telephone numbers.

2. Put the unique identifier of each element on a slip of paper.

3. Put the slips in a hat.

4. Shake the hat.

5. Select your elements without looking in the hat.

More often than not however, SRS’s are selected as follows.

1. Assign each element of the population a unique numeric identifier. It is essential that each population element has a unique number associated with it.

2. Use a table of random numbers (at right is an example) or a computer program (Minitab, Excel, SPSS) to randomly generate numbers. (Repeated #’s are ignored and length of numbers generated must match the length of the list.)

3. Your SRS is then the set of population elements whose numeric identifiers are those generated by the computer.

Example #1:

You need to select a random sample of Butler County households for the purpose of determining where the household does its primary grocery shopping.

You have your computer generate phone numbers from those available in Butler County, such as 523- 524-, 892-, etc. You then call these numbers and ask your questions of an adult who answers.

Example #2:

You need to select a random sample of Butler County residents for the purpose of determining whether a person is gainfully employed at the moment.

You have your computer generate phone numbers from those available in Butler County, such as 523- 524-, 892-, etc. You then call these numbers and ask your questions of an adult who answers.

Is this a random sample? Does everyone have an equal chance of selection?

NOTES AND COMMENTS

1. The statistical inference procedures we will study in this class require that the data be collected according to a random sample (or a simple random sample from a large population).

2. How does the random sampling model relate to reality?

In some cases, data cannot be collected by an actual random sample. Before applying a statistical analysis that requires the random sampling procedure, one must decide whether the collected data may be viewed as if it could have been obtained by random sampling.

Example:

A biologist plans to study the distribution of body length in a certain population of fish in Chesapeake Bay. The sample will be collected using a fishing net.

Is this a random sample?

Can the data be viewed as if it could have been obtained by random sampling?

This is an example of sampling bias: the systematic tendency for some elements to be more readily selected than others. One wouldn’t want to apply a statistical analysis procedure that requires the random sampling to such data.

3. It often happens that the population actually studied is narrower than the population that is of real interest. In such cases, one must argue that the results from the narrower population can be extrapolated to the population of interest. This extrapolation is not based on statistical grounds but on the subject matter grounds.

Example:

In determining the employment rate of Butler County, we took a SRS of phone numbers.

Have we actually sampled all adults in Butler County?

What is the population that we have sampled from?

Does it matter?

Summarizing Data

Given that we have a set of data, we now wish to summarize the information in the data. Usually, the data set will be large and simply viewing all the data will not provide any useful information. So we wish to summarize various aspects of the “distribution” of the data. Our goal will be to summarize the distribution (Center, Spread, and Shape) of the data.

When summarizing data we can take two approaches. We can summarize the data using pictures/graphs (GRAPHICAL SUMMARY) or using numbers (NUMERIC SUMMARY). Both have advantages and disadvantages, as we will see below.

However, the kind of summary one uses depends on the “Kind” of data at hand, so we first consider the different kinds of data encountered in statistics.

[pic]

Kinds of Data

There are two basic kinds of data, each of which can be further classified into two sub-classes. The basic distinction of data types is Quantitative versus Qualitative or more commonly Numeric versus Categorical. As the names imply, Quantitative/Numeric data is any measurement that results in a number. Conversely, Qualitative/Categorical data is any measurement that results in a quality, or characteristic, or non-numeric measurement of an element.

Quantitative/Numeric Data

Numeric data can be further classified into Continuous data and Discrete data. Continuous data is also referred to as Interval data since these measurements can take on ANY value in an interval or range of possible values. Discrete data can only take on certain values within some range.

Unfortunately, since we live in a discrete world (we can’t measure anything exactly), all measurements are discrete in nature. However, at least in theory, we can talk about continuous data. For example, when measuring the age of respondents, we can talk about their actual true age (resulting in a continuous measurement) or we talk about their age, rounded to the nearest year (a discrete measurement).

Qualitative/Categorical Data

Non-numeric data can also be classified into two sub-categories, depending upon the nature of the relationship of the possible values of the data. The two kinds of categorical data are Nominal and Ordinal. Nominal data is any non-numeric data whose values cannot be ordered in any “natural” way. Any ordering of the values of nominal data is strictly arbitrary. Ordinal data, however, has values that can be ordered in a non-arbitrary way. For example, ethnicity values cannot be ordered except in a subjective manner. However, degree of parental involvement in school-work (Low, Medium, and High) can be naturally ordered.

[pic]

A Data Example—Classify the variables below.

|State |County |District |Base Salary|Enrollment |Academic Rating |Doctorate |

|Deb |14 |F |110 |White |4 |Low |

|Mary |16 |F |111 |American Indian |0 |Low |

|Sam |15 |M |100 |White |2 |High |

|Phillipe |15 |M |120 |Hispanic |3 |Medium |

|Jean |16 |F |135 |Hispanic |1 |Medium |

|Louise |13 |F |140 |Black |1 |Medium |

|Jack |15 |M |95 |White |7 |Low |

GRAPHICAL SUMMARY OF CATEGORICAL DATA

PIE CHARTS

Goal: Summarize the distribution of categorical data.

Example: Summarize the distribution of races.

|This is a Minitab example of a pie chart. |[pic] |

|This is a JMP example. |[pic] |

Goal: Suppose we wish to determine if the Race distribution is different for Gender.

Two categorical variables can be summarized in pie charts one of two ways. The first is to simply use multiple pies, one for value of the second variable, then pie chart the first variable. Below is the Race variable summarized separately for each Gender.

[pic]

NOTE: Many “pies” can be reported in a single graphic.

A second way would be to take a single pie and divide it into two parts. Then in each half record the proportion of each of the other values.

BAR CHART

Goal: Summarize the distribution of categorical data.

Example: Find the distribution of Races in the small data set.

[pic]

Comment: Two categorical variables can be summarized in bar graphs using “stacked” bar graphs. Bars are simply subdivided within each bar of the second categorical variable. Or separate bar charts for each Gender could be given in the same picture.

|[pic] |[pic] |

Conclusion: How do the race distributions differ for Females and Males?

GRAPHICAL SUMMARY OF NUMERIC DATA

Our goal in summarizing numeric data is to summarize the distribution of the data. Recall that the distribution of a set of data can be summarized in three features: Center, Spread, and Shape. The following graphical methods allow one to summarize the distribution of a set of numeric values.

Below is some data on the grades from the first exam in an intro stat course (STA 261) for the Winter of 1998. There were 71 students in the course and the underlined scores represent the scores of the female students in the class. We will use this information later.

82 72 63 60 71 70 89 68 92 85 93 79 68 85 82 71 79 79 75 60 87

75 78 91 73 68 85 68 80 72 87 86 90 80 91 72 93 87 68 67 86 88

73 79 67 68 70 78 76 89 72 64 88 64 46 77 53 72 75 91 83 87 84

86 66 73 77 89 73 87 82

Our aim is to summarize the features of this distribution of exam scores. We will use a Histogram (the most common form of pictorial summary but very awkward to rescale or change) and a Stem and Leaf Plot (a more modern, very easy to use an adapt summary). Another very commonly used graph is a Dotplot.

DOTPLOT

Goal: Summarize the distribution of data that is Numeric.

Method

1. Draw an “X-axis” with a scale that covers the range of the data, that is, from the minimum to maximum values.

2. Record, with a dot, where each observation would occur on this axis.

[pic]

Notes for Dotplots:

1. At times you need to compare the distributions of a variable but for two different groups. For example in the exam data, recall that we had the gender of the student evident in the data and was denoted by underlined and non-underlined scores. Female exam scores were underlined. Suppose we wish to compare the distribution of exam scores for females and males.

To do so we could just obtain separate dotplots, one for females and the other for males. However, in order to insure compatibility of the two, we want the scales to be the same. This makes comparisons of the two distributions easier.

Below is the Minitab stacked dotplot of the exam data.

[pic]

What can we conclude about the two distributions of female and exam scores?

NOTE: Since we are stacking the dotplots vertically there is no limit to the number of groups we could compare!

[pic]

[pic]

[pic]

HISTOGRAM

Goal: Summarize the distribution of data that is Numeric.

Method

1. Create a Frequency Table of the data.

2. Create a Histogram from the Frequency Table information.

Creating a Frequency Table

1. Find the smallest and largest values in the data.

2. Create between 5 and 20, equal-sized, non-overlapping intervals (also known as bins) that span the smallest to largest values. (The first interval should include the smallest value and the last interval should include the largest value.)

3, For each observation in the data set, tally it in the appropriate interval.

4. Tabulate or count the number of observations in each interval. The resulting table is the Frequency Table.

Relative Frequency Table

A Relative Frequency Table reports the percentage or proportion of observations in each interval rather than the count or frequency of observations in each interval.

Creating a Histogram

1. For each interval determine the midpoint or center of the interval.

2. Plot the frequency (or relative frequency) for each interval on the y-axis against the interval midpoint on the x-axis.

Do not forget to label your y- and x-axes appropriately.

Example: Obtain the histogram for the exam score data.

82 72 63 60 71 70 89 68 92 85 93 79 68 85 82 71 79 79 75 60 87

75 78 91 73 68 85 68 80 72 87 86 90 80 91 72 93 87 68 67 86 88

73 79 67 68 70 78 76 89 72 64 88 64 46 77 53 72 75 91 83 87 84

86 66 73 77 89 73 87 82

Creating the Frequency Table

Find the smallest and largest values in the data.

Create between 5 and 20, non-overlapping intervals and count the number of observations in each interval.

Create the Histogram

Comment: Suppose we don’t like our histogram. How do we change it?

HISTOGRAMS IN EXCEL

Histograms are not part of the normally installed options in Excel. To histogram numeric data, you need to have the “Analysis Toolpack” Add-in installed.

|[pic] |[pic] |

Unfortunately the default intervals/bin sizes used by Excel are not very conducive to interpretation. With experimentation and some time you can have Excel (and any software package for that matter) provide you with a more interpretable histogram. After a bit of playing I got Excel to provide the histogram above.

HISTOGRAMS IN MTB

[pic]

NOTES AND COMMENTS

1. As seen earlier, if your histogram does not convey the aspects of the distribution that you need conveyed, it is not easy to expand or contract a histogram. If the intervals sizes were chosen to large in size then to “stretch” the histograms requires you to “start over” from the beginning, defining new intervals and tabulating the results into the new intervals.

Stem plots we will see next, do not share this problem and are very easy to stretch and compress.

2. Back-to-back Histograms

At times you need to compare the distributions of a variable but for two different groups. For example, in the exam data, recall that we had the gender of the student evident in the data and was denoted by underlined and non-underlined scores. Female exam scores were underlined.

Suppose now we would like to compare the distributions of exam scores for the two genders. Did females do better on the exam than males? Are the spreads the same? Are the shapes the same for the two genders? To answer these questions ee could just obtain a histogram for each gender separately. However, since the two groups will likely have different minimums and maximums, our resulting histograms may not be comparable since our intervals could be very different.

To use the same intervals for both genders, proceed as before using the entire data set, but now keep track of how many females and how many males occur in each interval. So the frequency table would have two columns of counts, one for females the other for males.

[pic]

The back-to-back histogram is then created one of three ways, top-to-bottom, side-by-side, or overlaid. Below are some examples from Minitab.

|[pic] |[pic] |

Top-to-bottom and side-by-side histograms are limited to just two groups. Any number of groups can be overlaid, but unless the graph allows for transparent histograms, most of the distributions will cover up other distributions. Hence, histogram comparisons are usually limited to two groups.

[pic] [pic]

STEM & LEAF PLOT

Goal: Summarize the distribution of data that is Numeric.

A stem and leaf plot (aka stem plot) produces a picture of numeric data like a histogram, but is much simpler to create and much easier to change once it’s done.

Method

1. Find the smallest and largest values in the data.

2. Divide each observation into a “stem” part and a “leaf” part so that there are approximately 5 – 20 different stem values between the smallest value’s stem and the largest value’s stem.

3. List the different stem values in a column.

4. Draw a line to the right of this stem column.

5. For each observation, record the leaf value of the observation on the appropriate stem line.

Example: Obtain the stem plot for the exam data below.

82 72 63 60 71 70 89 68 92 85 93 79 68 85 82 71 79 79 75 60 87

75 78 91 73 68 85 68 80 72 87 86 90 80 91 72 93 87 68 67 86 88

73 79 67 68 70 78 76 89 72 64 88 64 46 77 53 72 75 91 83 87 84

86 66 73 77 89 73 87 82

-----------------------------------------------------------------------------------------

4|6

5|3

6|30880888778446

7|2109199558322390862725373

8|29552750760768983746972

9|2310131

NOTES AND COMMENTS

1. We need to provide a legend on how to read the stem plot and provide the units of measurement. In the above stem we simply need to indicate how to interpret the 9|0 in the last row of the plot. We include the legend

9|0 = 90% on exam #1 in STA 261 Winter Semester 1998.

2. It’s usual to order the leaves within a stem although not mandatory.

3. I usually leave a space between groups of five leaves. It makes it easier to count, which will come in handy for later techniques.

In the tables below we present the original stem plot and the stem plot with notes 2 and 3 applied. Notice that the same information is contained in both graphs. The enhanced version has the advantage that the data is now ordered from smallest to largest and that quantiles (eg percentiles) can be readily determined.

|Original Version | |Enhanced Version |

| | | |

|4|6 | |4|6 |

|5|3 | |5|3 |

|6|30880888778446 | |6|00344 67788 8888 |

|7|2109199558322390862725373 | |7|00112 22223 33355 56778 89999 |

|8|29552750760768983746972 | |8|00222 34555 66677 77788 999 |

|9|2310131 | |9|01112 33 |

| | | |

|9|0 = 90% on Exam #1 in STA 261 Winter 98 | |9|0 = 90% on Exam #1 in STA 261 Winter 98 |

4. If the data have a large number of digits in each observation (consider the superintendent data seen earlier and the salary values), it is usual to reduce the number of digits by either truncating or rounding the numbers to a smaller number of digits.

For example, using the salary data, we note that the max and min are $167,271 and $80,000. If we divide the salaries as follows:

16 | 7271 and 8 | 0000

we obtain 8 stems. However, our leaves contain too many digits. (Ideally the leaves should contain one or two, at most three digits or the stem plot gets too “busy.”)

To reduce the salaries to numbers with one leaf digit we could truncate all digits after the “thousands” digit or we could round to the nearest thousand. Either is acceptable and used in practice. If we truncate the 167271 would be recorded as 167 and the 80000 recorded as 80 with units of $1,000 of dollars.

If we round to the nearest thousand, we would obtain 167 and 80. In this example, truncating and rounding yield the same values, but this will not usually be the case! In practice, truncation is usually easier since trailing digits are simply ignored. Rounding requires you to mentally round up or down and hence adds one more step to the process.

5. Sometimes your original choice for the stem/leaf may yield too many or too few stems, resulting in a stem plot that is too stretched or too thick. In both cases the plot is difficult to interpret with respect the distribution.

Any given stem plot can be expanded or contracted to make more or fewer stems and hence more or less detail on the plot. To expand a stem plot any stem can be expanded by a factor of two or five. For example, taking the 80’s stem in our data

8|00222 34555 66677 77788 999

we can expand this stem into two stems or five stems as follows. To expand the stem into two stem (essentially doubling the resolution of the stem) we simply put all the leaves from 0 to 4 in one stem and all the leaves from 5 to 9 in another stem, resulting in the following:

8|00222 34

8|55566 67777 78899 9

If expanding each stem by two does still not convey the information required, we can expand by five, putting the leaves 0 and 1 in one stem, 2 and 3 in the next, 4 and 5 in the next, 6 and 7 together, and 8 and 9 in the last stem. Our 80’s stem would then look like

8|00

8|2223

8|4555

8|66677 777

8|88999

The entire plot would look like

4|6 4|6

5|3 4|

6|00344 67788 8888 5|

7|00112 22223 33355 56778 89999 5|3

8|00222 34555 66677 77788 999 5|

9|01112 33 5|

5|

6|00

4| 6|3

4|6 6|44

5|3 6|677

5| 6|88888 8

6|00344 7|0011

6|67788 8888 7|22222 3333

7|00112 22223 333 7|555

7|55567 78899 99 7|677

8|00222 34 7|88999 9

8|55566 67777 78899 9 8|00

9|01112 33 8|2223

8|4555

8|66677 777

8|88999

9|0111

9|233

9|0 = 90% on Exam #1 in STA 261 Winter 1998

It is essential that when we form our expanded or contracted stems, that the “interval” sizes be the same. Otherwise we get a distorted picture of the data. For example, if we expanded the stems into four stems putting leaves 0, 1 into one stem, 2, 3, and 4 in another, 5, 6, and 7 in the third stem and finally leaves 8 and 9 in the last stem we obtain

4|6

4|

5|

5|3

5|

5|

6|00

6|344

6|677

6|88888 8

7|0011

7|22222 3333

7|55567 7

7|88999 9

8|00

8|22234

8|55566 67777 7

8|88999

9|0111

9|233

9|0 = 90% on Exam #1 in STA 261 Winter 1998

In this example it didn’t make a great deal of difference, but it can, so:

Always expand stems into 2 or 5 stems ONLY!!!

NOTES AND COMMENTS (Stem and Leaf Plots Continued)

6. One big advantage of stem plots over histograms is that the original data is readily available in a from a stem plot. For example, in our histogram

we only know how many observations are in an interval, and do NOT know what values they were. In the (50-59) interval we only know there was one value.

4|6

5|3

6|00344 67788 8888

7|00112 22223 33355 56778 89999

8|00222 34555 66677 77788 999

9|01112 33

9|0 = 90% on Exam #1 in STA 261 Winter 1998

With the stem and leaf plot, however, we know exactly how many are in each stem (or interval) as well as what values. From the stem plot we know that there was one value in the 5 stem (corresponding to an interval of 50-59) and the value was 53.

7. Just as with dotplots and histograms, we can also have two stem plots back-to-back. We use a common stem list, but then record the observations from the two groups on either side. For example, using original choice of stems for the exam data, we obtain

Males | | Females

|4|6

|5|3

4 46888|6|00377 888

35|7|00112 22223 33556 77889 999

66799|8|00222 34555 67777 889

011|9|1233

9|1 = 91% on Exam #1 for a Female

91% on Exam #1 for a Male = 1|9

Note that like the histogram limitation, stem plots can only compare two groups or distributions.

NOTES AND COMMENTS (Stem and Leaf Plots Continued)

8. The resulting stem plot provides the same information about the distribution of the data as a histogram. Namely, one can glean information about the Center, Spread, and Shape of the data. Below we have taken the stem plot just created above and turned it on its side. The resulting display is similar to the histogram we created earlier, WITH EXACTLY THE SAME INFORMATION ABOUT THE CENTER, SPREAD, AND SHAPE OF THE DISTRIBUTION AVAILABLE.

|Original Version |

| |

|4|6 |

|5|3 |

|6|30880888778446 |

|7|2109199558322390862725373 |

|8|29552750760768983746972 |

|9|2310131 |

| |

|9|0 = 90% on Exam #1 in STA 261 Winter 98 |

CENTER?

SPREAD?

SHAPE?

[pic]

Graphical Descriptive Statistics Summary

1. Do not forget to include titles, labels, and legends.

2. Avoid three-dimensional figures since they detract from making your point.

3. Be aware of using scales that provide misleading interpretations of the figure. See the text for examples.

4. Remember that a distribution is the entire set of observations.

5. A summary of the distribution should include mention of:

1. Center = a typical or middle value

2. Spread = notes the variability of the data

3. Shape has three characteristics

a. Modality or number of peaks

b. Symmetry/Skewness

Symmetric means the figure is a mirror image (never perfect in real life but approx).

Skewed means no mirror image and the direction (positive/negative) is determined by the long tail or the extreme values (aka outliers).

c. Bell or Normal shape

Summarizing Data—numerical methods

We first provide an example to illustrate our goal of summarizing the distribution of a single set of numbers with a few well chosen numbers.

Below is a copy of a wrapper of a “vending machine size” Milky Way candy bar.

[pic]

Questions

1. What is the “population” of all such candy bars?

2. Summarize the weights of all of these candy bars using a single number.

3. Draw a picture of the entire population of Milky Way bar weights.

4. What does the Net Wt. 58.1 gm mean relative to your picture?

To determine what the 58.1gm meant several of my classes bought and weighed Milky Way candy bars over the course of two years. The data is below with the year, total weight of the candy bar in the wrapper, the package weight, the candy bar weight (by subtraction), and the code from the wrapper.

|Year Done |Total Wgt |

|[pic] |[pic] |

Below are the candy bar weights.

62.2 59.6 60.4 59.7 62.4 57.1 61.5 64.6 61.6 59.5 59.7 54.1

61.6 64.5 57.4 56.0 60.2 60.5 61.3 59.2 59.7 60.7 59.1 57.4

61.3 57.2 58.4 62.1 59.6 58.3 57.1 58.7 58.6 61.5 59.9 58.4

58.6 61.9 60.3 60.7 60.0 58.2 57.7 53.2 58.7 61.9 58.8 60.0

60.5 61.3 60.1 59.4 60.0 60.5 63.4 59.5 59.1 60.6 57.5 60.9

59.1 57.2 58.4 58.5 60.1 57.5 57.9 55.3 59.7 60.6 57.7 60.8

57.7 58.1 56.8 58.1 57.0 58.5 58.3 59.4 59.8 59.7 59.7 60.5 56.1

What can you conclude about the distribution of Milky Way candy bar weights from these numbers?

Below is the stem and leaf plot for this data. Now, what can we conclude about the distribution?

OUT LO : 53|2, 54|1

55 | 3

55 |

56 | 01

56 | 8

57 | 0112244

57 | 557779

58 | 11233444

58 | 5566778

59 | 111244

59 | 556677777789

60 | 00011234

60 | 5555667789

61 | 333

61 | 556699

62 | 124

62 |

63 | 4

OUT HI : 64|5, 64|6 63 | 4 = 63.4 grams of Milky Way candy

Can we say exactly or more precisely where the center is? What about the spread? Can we say for certainty how normal or non-normal the distribution is?

Goal: We want to summarize data using numbers. Our numbers will summarize the center, spread, and the various measures of shape. Whereas the numbers found from graphical summaries are very subjective, our numeric summaries will be very objective.

Notation: Denote the data (numbers) by x[pic], x[pic], ..., x[pic]. , where x[pic] = measurement

on the i[pic] element and n = number of measurements.

NUMERIC SUMMARY OF NUMERIC DATA

In this section we will compute numeric values of the center of the distribution. These measures summarize a “typical” value in the data. In statistical parlance, these measures are known as measures of center or location.

1. MEASURES OF THE CENTER: AVERAGE, MEDIAN, and MODE

AVERAGE

Aside: Recall that we use “mean” when summarizing the population and “average” when summarizing a sample. In both cases, the number is calculated the same way although we denote the final value differently, using ( and[pic].

Defn: The average is the sum of the data values divided by the number of values.

Notation and formula: [pic]

Example: Using the candy weights we obtain: [pic].

Since this data represents a sample of all vending pack size Milky Way candy bars we denote it as [pic] and would refer to it as the sample average versus the population mean!

Interpretation: The average is an all inclusive summary of the center of the data.

The average includes all of the data in its calculation, hence all of the data are “represented” in the average.

The average is referred to as a center of the data since it represents the point where the data would balance.

MEDIAN

Defn: The median is the middle number in the “ordered” data.

Notation: There is no universally accepted notation for the median like there is for the mean and average. Often, however, M and [pic]are used to represent the sample median and [pic] the population median.

Simple Examples

Here are the IQ’s of the eight students from the small data set seen earlier in the Graphical Descriptive statistics section.

120 110 111 100 120 135 140 95

We first order the data

95 100 110 111 120 120 135 140

then find the middle number. In the case of an even number of observations, the median is defined to be the average of the middle two numbers. So for this example, the median would be

median = [pic]

To illustrate finding the median for an odd number of observations, suppose our ordered data set was

15 19 28 34 75 97 120 130 150

With nine observations there is a “middle number,” namely 75 in this example. There are an equal number of values both above and below 75. So, for this example, our median would be: median = 75.

Note: When finding the median by hand the data must be ordered. Stem and leaf plots are very useful in finding the median by hand for larger data sets. However, to do so requires you use the “enhanced” version of the stem plot where the leaves have been ordered within each stem. For extremely large data sets (and this is true for all of the numeric measures) you would use a calculator or computer to do the grunt work computations.

Example: To find the median we use the stem plot seen earlier and reproduced below.

OUT LO : 53|2, 54|1

55 | 3

55 |

56 | 01

56 | 8

57 | 0112244

57 | 557779

58 | 11233444

58 | 5566778

59 | 111244

59 | 556677777789

60 | 00011234

60 | 5555667789

61 | 333

61 | 556699

62 | 124

62 |

63 | 4

OUT HI : 64|5, 64|6 63 | 4 = 63.4 grams of Milky Way candy

Recall the there are 85 observations in the candy data an odd number, so our median will be value with 42 observations on one side and 42 on the other. So we need to find the 43rd largest value. Counting from the top we obtain

median = 59.6

Interpretation: The median represents the 50th percentile of the data, meaning that 50% of the data are above and 50% are below the median. The median also represents the “center” of the data since it has equal numbers of observations on either side of it.

MODE

Defn: The mode is the most frequently occurring number (or numbers, you could have more than one mode!).

Simple Examples

|[pic] |

|[pic] |

Notes and comments

1. An outlier is

2. What effect do outliers have on Measures of Center?

Average = affected by outliers, the Average is pulled or drawn towards the outliers

Median = not affect by outliers, unless there are more than 50% outliers

Mode = usually not affected by outliers

Numeric measures that are not affected by outliers are called robust or resilient.

3. If the distribution is perfectly symmetric, then the mean and median will be the same.

4. In general, the mean is most often used since it has “better” properties than the median. However, if the data are skewed, then the median is more commonly used to measure the center.

5. Symmetry and averages and medians—If the average is much greater than the median, then the data is skewed. Which way? Why?

2. MEASURES OF SPREAD: RANGE, VARIANCE, STANDARD DEVIATION, and IQR

In this section we will compute numeric values of the spread of the distribution. These computed values will quantify the spread or variability of the data.

• Measures of spread or variability are always non-negative (ie, > 0).

• If a measure of spread is zero, then _______________________.

• The larger the spread the larger the measure of spread or variability.

Why do we need to measure Spread? Isn’t Center enough?

The growth of 5 different Chrysanthemum flowers, in mm:

Data Set #1 Data Set #2

65, 70, 72, 76, 82 70, 72, 73, 75, 75

______________ ______________

. . . :

-------+---------+---------+---------+---------+---------Data #2

. . . . .

-------+---------+---------+---------+---------+---------Data #1

66.5 70.0 73.5 77.0 80.5 84.0

[pic][pic]

RANGE

Defn: The range is the difference between the largest and smallest values, a single number. (While this is the most common definition some also define the range as from the smallest to largest. Actually this is a more informative definition of range!) Range has the same units as the data.

Example: The ages of the small data set has a range of 16 -13 or 3 years. However saying the ages range from 13 to 16 years is more informative since you know the values the ages range from which is NOT known when you simply say the range is 3.

Note: While a nice measure of spread, the range suffers because it only uses the two most extreme values and none of the values in between. For example, consider the following two examples (and corresponding dot plots). In the first example, since the values are spread between the minimum and maximum values, the range is an informative measure of spread. However in the second example, the range is not very descriptive of the data since most of values fall in a much smaller range.

|[pic] |

|[pic] |

SAMPLE VARIANCE

Defn: Sample variance is the “average” of the squared distances or deviations of every sample observation from the average of the data.

Notation and formula: [pic]

Example: We’ll do one simple example ONLY for illustration purposes. In real life such computations are done using a calculator or a computer. Let’s consider the ages of the students in the small data set.

|Age |Average Age |Age - |(Age – Average Age)2 |

| | |Average Age | |

|15 |14.875 |0.125 |0.01563 |

|14 |14.875 |-0.875 |0.76563 |

|16 |14.875 |1.125 |1.26563 |

|15 |14.875 |0.125 |0.01563 |

|15 |14.875 |0.125 |0.01563 |

|16 |14.875 |1.125 |1.26563 |

|13 |14.875 |-1.875 |3.51563 |

|15 |14.875 |0.125 |0.01563 |

| | | |6.87504 |

So the variance would be s2 = [pic]. What are the units?

Note: The denominator in the sample variance is not the sample size, n, but rather the sample size minus 1 or n – 1.

Interpretation: The variance is the “average” squared distance to the center, or average, of the distribution.

Note: Because the variance has awkward units (variance has units squared!), the variance is used only to obtain the standard deviation. For example using the small data set, the variance of Age is 0.9821 years2. The awkwardness is how does one interpret “squared years”?

SAMPLE STANDARD DEVIATION

Defn: The sample standard deviation is the positive square root of variance and we use s to denote the standard deviation. You will also see the standard deviation abbreviated as STDEV or SD.

Example: Continuing the age example above, the standard deviation would be

[pic], with units?

Interpretation: SD has no physical interpretation, but we will see that, for certain distributional shapes, the standard deviation can be used to determine probabilities.

INTER QUARTILE RANGE (IQR)

Defn: The Inter Quartile Range is defined to be the difference between the Third Quartile (denoted by Q3) and First Quartile (Q1) of a data set. The Inter Quartile Range is denoted by IQR and equals Q3 - Q1.

Defn: The Quartiles of a distribution are simply the 75th and 25th percentiles of the distribution. The 75th percentile is the value in the data value with at least 75% of the data less than or equal to it. Likewise the 25th percentile is the data value with at least 25% of the data less than or equal to it.

A very simple method of finding the quartiles is as follows:

1. Find the median of the data. Recall the median is the middle value!

2. The Third Quartile (Q3) is then the median of all the observations > the median.

3. The First Quartile (Q1) is the median of those observations < the median.

Example: We return to the data sets seen earlier when we introduced the median. We see how we calculate the two quartiles once we have the median.

Example #1

The ordered data set with an even number of observations is

95 100 110 111 120 120 135 140.

(

115.5

The median was found to be 115.5. Then Q3 = the median of the values above 115.5. There are four values, an even number, so the median will be the average of the middle two. The middle two are 120 and 135 so Q3 = 127.5.

Likewise the median of those values below the median of 115.5 is the First Quartile and would be the average of 100 and 110; so Q1 = 105.

The Inter Quartile Range is then IQR = Q3 – Q1 = 127.5 – 105 = 22.5.

Example #2

In the second example we started with an odd number of values and our median will the “middle value” which is 75 for this data.

15 19 28 34 75 97 120 130 150

(

median

The quartiles would then be Q3 = 125 (the average of 120 and 130) and Q1 = 23.5 (the average of 19 and 28), with an IQR of 125 – 23.5 = 101.5.

Interpretation: The IQR of a data set represent the range of the “middle” 50% of the data. Note that since Q3 has 75% of the data less than it and Q1 has 25% of the data less than it, there is 50% between the two quartiles.

Like the range, IQR is a single number, however, without a frame of reference this single number is not as informative as saying the IQR ranges from Q3 to Q1. For example, in the above example, the IQR is 101 but without knowing where the IQR is, it is not very informative. We will see that the IQR is typically displayed in a picture that will provide a frame of reference.

Notes and comments

1. The standard deviation is the only measure of spread that has some usefulness concerning probabilities.

2. What effect do outliers have on Measures of Spread?

Range

Variance

Standard Deviation

IQR

3. The SD is the most commonly used measure of spread, however, since outliers affect it, the IQR should be used for skewed data. In real life, both would be reported for skewed data.

3. MEASURES OF THE SYMMETRY AND NORMALITY/KURTOSIS

SYMMETRY

Formula: [pic]

Interpretation: < 0 ( Negative Skew (long tail in negative direction).

= 0 ( Perfectly symmetric around average.

> 0 ( Positive Skew (long tail in positive direction).

NORMALITY/KURTOSIS

Formula: [pic]

Interpretation: < 0 ( Mesokurtic (tails too thin).

= 0 ( Perfectly Normal/bell-shaped.

> 0 ( Leptokurtic (tails too thick).

ONE FINAL GRAPHICAL DISPLAY

FIVE NUMBER SUMMARY

Defn: The five number summary of a data set is following set of five numbers

{ Minimum, Q1, Median, Q3, Maximum }.

Example: Using the IQ data ( 95 100 110 111 120 120 135 140) we already found the median to be 115.5 and the two extremes are 95 and 140. The two quartiles are 105 and 127.5, resulting in our 5-number summary:

{ 95, 105, 115.5, 127.5, 140 }.

Interpretation: While 5-Number summary is a useful numeric summary, it is most commonly used in creating the final graphical summary of data, the Boxplot.

[pic]

BOXPLOT

Defn: A Boxplot is a graphical display of data that uses the 5-Number summary.

[pic]

Interpretation: What can a boxplot tell us about the distribution? What about:

Center?

Spread?

Shape?

OUTLIERS

Defn: An outlier is any value in the data that much larger or smaller than most of the other data.

There are two commonly used ways of determining if observations are outliers, Z-Scores and Fences in boxplots.

Z-SCORES

Defn: A z-score is also known as a standardized score and is denoted by z. The

z-score for an observation from a data set with average, [pic], and sample standard deviation, s, is

[pic]

Interpretation: A z-score represents how many SD’s an observation is away from the center as defined by the average.

Uses: Z-scores are used to determine outliers. Observations with z-scores beyond ± 3 or 4 are usually considered outliers.

Z-scores can also used to compare observations from two different data sets that are measured on two different units, like, heights and weights.

Example: Returning briefly to the exam score data the average was [pic]= 77.2 with a SD of s = 10.1. We would declare the score of 43 an outlier since its z-score is [pic] and beyond ± 3.

OUTLIERS (Continued)

FENCES IN BOXPLOTS

Defn: The Upper and Lower Fences in a boxplot are the points 1.5(IQR’s) beyond the first and third quartile.

Lower fence = Q1 – 1.5IQR & Upper fence = Q3 + 1.5IQR.

Uses: The fences are used to determine if any points are outliers in a data set.

Any points beyond the fences are considered outliers.

Example: Recall for the exam score data, the five number summary was

{ 46.00 70.00 76.50 85.75 93 }.

Hence the fences are

Upper fence = 85.75 + 1.5(85.75-70.00) = 109.375

Lower fence = 70.00 - 1.5(85.75-70.00) = 46.375.

Note that from our Stem/Leaf plot, only one value falls beyond the fences.

| |

|4|6 |

|5|3 |

|6|00344 67788 8888 |

|7|00112 22223 33355 56778 89999 |

|8|00222 34555 66677 77788 999 |

|9|01112 33 |

| |

|9|0 = 90% on Exam #1 in STA 261 Winter 98 |

MODIFIED BOXPLOT

Defn: A Modified Boxplot is an enhancement of an ordinary boxplot that includes information about outliers or extreme values, using the fences.

[pic]

[pic][pic]

Example 1: Using the IQ data and 5-number summary { 95, 105, 115.5, 127.5, 140 }, the resulting boxplot is:

Example 2: Using the STA 261 exam data we find minimum = 43, maximum = 93, median = 78, Q1 = 70, and Q3 = 86. The resulting Modified Boxplot is:

[pic]

Interpretation: What can a boxplot tell us about this distribution? What about:

Center?

Spread?

Shape?

Note: Boxplots, while they do a fine job of indicating the center and spread, they can only tell us whether a distribution is symmetric. It can not tell us the modality nor whether the distribution is normal or not.

To illustrate consider the three Minitab examples below.

Can you determine which one is Normally shaped?

[pic]

[pic]

[pic]

AN EXAMPLE TO ILLUSTRATE WHAT NUMERIC SUMMARIES CAN TELL YOU

Below is the numeric summary Price and G Skid Pad performance from Car and Driver November 2002. From this information can you reproduce the pictures of the Price and G Skid Pad data?

|[pic] | |

| | |

|[pic] | |

Here are the actual histograms and boxplots of the data. How close did you come?

[pic]

[pic]

NUMERIC SUMMARY OF NUMERIC POPULATIONS

To this point, we have been talking about the measurements of center and spread in the context of a sample. Consequently, they are characteristics of the sample and hence, statistics.

The same measurements could have been computed on a population. In this case, they would be parameters.

MEAN

Defn: The mean is the average of all N values in the population.

The population mean is denoted by μ (lower case Greek letter mu).

Notation and formula: [pic].

POPULATION VARIANCE and POPULATION STANDARD DEVIATION

Defn: The POPULATION VARIANCE is the average squared distance of values in the population to the mean of the population. The POPULATION STANDARD DEVIAITON is the positive square root of the variance.

The population variance is denoted by σ2 (σ is the lower case Greek letter sigma), and the population standard deviation is denoted by σ.

Notation and formula: [pic], note the divisor is N!

Note: The reason for using n-1 in calculating the variance of a sample will be explained later in the section on sampling distributions and estimation.

ALTERNATE USE OF AVERAGE AND STANDARD DEVIATION

While the average (mean if we are talking about a population) and standard deviation are excellent and commonly used measures of center and spread, they also serve another purpose as given in the following Theorem.

Chebyshev's Theorem: For any distribution, at least [pic] of the measurements must lie within k standard deviations of the mean (for k > 1). That is, at least [pic] of the measurements must lie

within [pic].

| |

|4|6 |

|5|3 |

|6|00344 67788 8888 |

|7|00112 22223 33355 56778 89999 |

|8|00222 34555 66677 77788 999 |

|9|01112 33 |

| |

|9|0 = 90% on Exam #1 in STA 261 Winter 98 |

Example: For the exam score data, the average and sample stdev were 76.57 and 10.31. For k = 2 we should observe at least [pic] of the exam scores between [pic]

or [ 55.95 , 97.10 ]. Inspecting the Stem/Leaf plot of the exam score data we find that all but two of the exam scores are between [ 55.95 , 97.10 ]. Hence we have 54/56 or 96.43% of the data between these two values. Note that 96.43% is greater than the ¾ or 75% that Chebychev’s theorem claims!

[pic]

STATISTICS IS ALL ABOUT RELATIONSHIPS!

Here’s a final example to illustrate that relationships can be hard to see.

Background:

[pic]

[pic]

DESCRIPTIVE DATA SUMMARY

I. Single Variable or Set of Data/Information (ie, One Column in a Spreadsheet)

GOAL: Using pictures and/or numbers, summarize the Distribution (Center, Spread, and Shape) of a set of Data

A. Categorical Data

1. Graphical Summary

a. Pie Chart

b. Bar Chart

2. Numeric Summary

a. Table with % or proportions

B. Continuous Data

1. Graphical Summary

a. Dot Plot

b. Frequency (or Relative Frequency) Table

c. Histogram

d. Stem and Leaf Plot

e. Box Plot

2. Numeric Summary

a. Center: Average, Median, Mode, Trimmed Average

b. Spread: Range, Variance, Standard Deviation, Inner Quartile Range (IQR)

c. Shape: Skewness and Kurtosis (Normality)

-----------------------

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download