Http://education



Statistics Activity

A boxplot is a graphical representation of a common statistical process of conveying the properties of a set of data points. It displays the median, the first and third quartile points and the upper and lower bounds (excluding outliners or extreme values.)

The following introduction to some variations of the process is copied from It is useful to understand that different technology might use different algorithms and therefore have different results.

How to Draw a Boxplot

There is a commonly accepted method of drawing the whiskers on a boxplot. However, there is a plethora of methods for drawing the box. Some of these methods were developed because they extend nicely to percentiles other than 25% and 75%, others were chosen because of theoretical considerations and some were developed for simplicity. Bob Hayden chose his method partly for metaphysical reasons! In this paper I will opt for simplicity, with a bit of metaphysics thrown in for good measure.

For an extensive discussion on drawing a boxplot gleaned from emails to the ap-stats Internet mailing list I suggest you read the article Ticky-Tacky Boxes.

Drawing the Box

We have to start with the box, as constructing the whiskers requires we use the box to determine if any data values are outliers. I will describe two methods, both simple, from which you can choose. The first is historically correct and has Bob's nice metaphysical property, while the second is the method used by the TI-82 and TI-83 graphing calculator. If your students use this technology and you want them to get the same answers as their calculator then you should opt for this method.

Tukey's Method

The method recommended by Tukey, who invented the boxplot, is as follows:

Find the median. Then find the median of the data values whose ranks are LESS THAN OR EQUAL TO the rank of the median. This will be a data value or it will be half way between two data values.

With a dataset with an odd number of values, include the median in each of the two halves of the dataset and then find the median of each half. This gives the first and third quartiles. If the dataset has an even number of values, just split the data into two halves, and find the median of each half.

Here is an example using a small dataset, which contains an odd number of values:

35 47 48 50 51 53 54 70 75

Split the data into two halves, each including the median:

35 47 48 50 51 and 51 53 54 70 75

Find the median of each half. In this example, the first quartile is 48 and the third quartile is 54. Hence the interquartile range is 54-48 = 6.

I’ll add a number to the above dataset to illustrate how to find the quartiles for an even number of values (what the heck, the data is bogus anyway):

35 47 48 50 51 53 54 60 70 75

Split the data into two halves:

35 47 48 50 51 and 53 54 60 70 75

Now find the median of each half. In this example, the first quartile is 48 and the third quartile is 60. Hence the IQR is 60-48 = 12.

Bob Hayden prefers this method because the five number summary of five numbers gives the five numbers themselves. For example, take the dataset 1 4 78 81 345. The minimum is 1, the maximum is 345, the median is 78. Splitting the dataset into two halves each containing the median gives Q1 as 4 and Q3 as 81. Very neat. This is the metaphysical property that he has noted.

Alternative Method, As Drawn by the TI-82 and TI-83

If you wish your students to get the same answer as the TI-82 and TI-83 graphical calculators then this is an acceptable alternative. Note it is non-standard in the world of statistics.

Find the median. Then find the median of the data values whose ranks are LESS THAN the rank of the median. This will be a data value or it will be half way between two data values.

Here is the same example using the above numbers, which contains an odd number of values:

35 47 48 50 51 53 54 70 75

Split the data into two halves, not including the median:

35 47 48 50 and 53 54 70 75

Find the median of each half. In this example, the first quartile is 47.5 and the third quartile is 62. Hence the interquartile range is 62-47.5 = 14.5.

Note the difference in the answers between the two methods! It is not really surprising when you consider that we are doing a 5 number summary on a set of only 9 numbers.

Drawing the Whiskers

The commonly accepted method among statisticians for drawing the whiskers is somewhat more complicated than that described in most senior secondary texts written to the Queensland syllabuses. The maximum length of each whisker is 1.5 times the interquartile range (IQR). To draw the whisker above the 3rd quartile, draw it to the largest data value that is less than or equal to the value that is 1.5 IQRs above the 3rd quartile. Any data value larger than that should be marked as an outlier. Some statisticians differentiate between ‘mild’ outliers and ‘severe’ outliers. Mild outliers lie between 1.5 and 3 IQRs above the 3rd quartile while severe outliers are more than 3 IQRs above the 3rd quartile. I don’t believe we need to make the distinction between mild and severe outliers in our courses.

Here is an example, using the first set of numbers above using Tukey’s method of determining Q1 and Q3:

35 47 48 50 51 53 54 70 75

The IQR is 6. Now 1.5 times 6 equals 9. This is the maximum length of the whisker. Subtract 9 from the first quartile: 48 - 9 = 39. Note that 35 is an outlier, and the whisker should be drawn to 47, which is the smallest value that is not an outlier.

Add 9 to the third quartile: 54 + 9 = 63. Any value larger than 63 is an outlier, so in this instance both 70 and 75 are outliers. Draw the whisker to the largest value in the dataset that is not an outlier, in this case 54. Since this value is the 3rd quartile, we draw no whisker at all! Mark 70 and 75 as outliers. The boxplot is given below:

[pic]

If we use the alternative method with the IQR of 14.5 then we have no outliers, and the boxplot looks like this:

[pic]

The message here may be: beware of constructing boxplots on a small set of numbers!

Why 1.5 IQRs?

"First of all, does anyone know why it is customary to multiply 1.5*IQR in order to find outliers? In particular, why is it 1.5 and not, for example, 2?"

From Paul Velleman The "official" answer from John Tukey (when I asked) is: because 1 is too small and 2 is too large. There is a paper by Hoaglin and Ygelevich in which they did some simulations and showed that the outlier rule is really quite good across a pretty wide array of distributions. I don't know the full reference.

There are many statistical classroom activities that involve using technology to visualize the results. Often data is collected in student activities. If each student generates 2 pieces of data, a class of 30 will give you 60 data values which is often enough to analize.

1. Collect a set of 40 values by rolling a pair of dice and adding the values. The bounds will be 2 and 12 and if the dice are fair you would expect to have something close to a normal bell curve.

2. Using a graphing calculator, a spreadsheet or software program, enter the values, sort them and generate a box plot of them. Report the values, the median, the IQR and if possible include a copy of the box plot.

3. The following set of data will have a spread from 3 to 300, but it will not generate a normal bell curve. Collect 12 coins (4 pennies, 4 dimes, and 4 dollars) or 12 dollar bills – play money will do (4 each of $1’s, $10’s, and $100’s). Think of a way to randomly choose 3 of them at a time. (For example: number them and have a computer choose random numbers from 1 to 12.) Add the values of the three chosen coins/bills. Repeat this process until you have 30 sums. Generate a box plot of your results. Compare the median with 111, 3/12 of the sum of 12 coins/bills. Report the values, the median, the IQR, and if possible include a copy of the box plot.

4. Repeat with a different combination of coins/bills.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download