I



4.1 Presentation of Data

In this section, we give a brief overview of the most common methods for the visual presentation of data. We will close with a summary of the strengths and weaknesses of each method.

A bar graph is a graph in which numerical quantities are represented by heights (or lengths) of bars. A bar graph gives a visual representation of data that allows one to see patterns in the data quickly, without numerical analysis. The table below lists the amount of calories in one slice of various types of bread.

ITEM CALORIES

Italian Bread 65

French Bread 95

Rye Bread 74

Wheat Bread 80

Sourdough Bread 64

Egg Bread 105

We used the above data to construct the bar graph below.

[pic]

Note that each variety of bread has a bar associated with it and the height of the bar shows the number of calories. For example, the height of the bar labeled ‘Wheat’ is 80 calories as measured by the vertical axis. At a glance, the bar graph shows that a slice of Egg Bread contains considerably more calories than a slice of Italian Bread and that the calorie content of a slice of Wheat Bread is about the same as a slice of Rye Bread. The bar graph could have been presented horizontally, as shown below.

[pic]

A multiple bar graph is a bar graph which displays two (or more) sets of data on the same graph. Different colors or styles are used to distinguish the sets. This allows one to see trends within a data set or between two data sets. In the example below, the solid bars represent Company A while the dotted bars represent Company B.

[pic]

In the above graph, we see that Company A peaked in 1993, and that Company B has been on a steady upward trend. Company A’s profit was significantly higher than Company B’s from 1989-1993.

A line graph is similar to a bar graph except that a single point is plotted instead of a bar. Adjacent points are then connected by line segments, as shown in the example below.

[pic]

Line graphs are used to present data that can be presented in some natural sequence, usually time periods. This allows one to see trends easily, as in the above example. If there were no natural sequence, it would not make sense to connect the dots.

A multiple line graph is a line graph that presents more than one set of data using different styles of lines. An example is given below.

[pic]

In a pictograph, quantities are measured by the number of icons that appear in a given category. The graph should give the value of one icon, as below.

Pictographs are more eye-catching than bar graphs, but they require a little calculation on the part of the reader. In the above example, we see six and one-half money bags over 1992, so in 1992 the profit was 6.5 times $2,000.00, or $13,000.00.

A pie chart is a circular graph that shows the portions of a whole taken up by various categories.

[pic]

Pie charts are ideal for depicting relative quantities. In the above example, one can see that there are about twice as many Juniors (37%) as Seniors (18%). We also see that a little more than one-half of the population of the dorm consists of Juniors and Seniors. The entire pie represents 200 students thus, the total number of Juniors in the dormitory is 37% of 200 people (or 74 students).

When deciding how to present data of one variable, one should consider the strengths and weaknesses of the various types of graphs.

Bar graphs

• Good for displaying total quantities.

• Data can be read off at a glance.

• Numerical quantities are easily compared.

• Two (or more) sets of data can be analyzed using a multiple bar graph.

Line Graphs

• Excellent for presenting data that involves time.

• Trends can be easily analyzed.

• Easy to construct.

• Good at displaying total quantities.

• Data can be read off at a glance.

• Numerical quantities are easily compared.

• Two (or more) sets of data can be analyzed using a multiple line graph.

• Not ideal for presenting relative quantities.

Pictographs

• Eye catching.

• Good for displaying total quantities.

• Data is not easy to read off at a glance.

• Not easy to construct.

• Not ideal for presenting relative quantities.

Pie Charts

• Ideal for displaying relative quantities.

• Not ideal for displaying total quantities.

Problems

1. In 1995, the Smith family had a net income of $50,000 per year. They spent $25,000 on their mortgage, $12,000 on living expenses, set aside $10,000 for their retirement, and spent $3,000 on miscellaneous items. Display these data in a bar graph, line graph, pictograph, and a pie chart. Explain which of these you think best presents the data.

2. The Health Resources and Services Administration reported the three-year patient survival rates for all U.S. organ transplants from Oct. 1, 1987 to Dec. 31, 1991.

(a) Present the following data in a graph.

|Organ |Survival Rate |

|Heart |74 % |

|Liver |67 % |

|Kidney |87 % |

|Pancreas |82 % |

|Heart & Lung |42 % |

|Lung |39 % |

(b) Construct a graph which displays the following data:

|Organ |Patients on wait-list |

| |as of 1/31/96 |

|Heart |3,448 |

|Liver |5,755 |

|Kidney |31,119 |

|Pancreas |284 |

|Heart & Lung |197 |

|Lung |1,938 |

(c) The survival rate is higher for patients who receive a heart and lung transplant than

patients who just receive a lung transplant. Does the data in (b) help explain this?

3. The Insurance Institute for Highway Safety conducted 5mph crash tests on different types

of 4-door utility vehicles. The damage repair costs are reported below:

Make Repair

Isuzu Rodeo/Honda Passport $8,200

Chevrolet Blazer/GMC $4,100

Toyota 4Runner $7,200

Ford Explorer $5,700

Jeep Grand Cherokee $5,800

Land Rover Discovery $6,600

(a) Present the above data in a graph.

(b) Suppose that you are an advertiser for Toyota. Present the data in a way which downplays the differences in repair cost.

4. This graph is a cross between a pictograph and a bar graph.

[pic]

Why is it misleading? (Hint: Think about how much milk the cartons pictured would

contain).

5. The following table shows how much money (in millions of dollars) the U.S. Justice

Department has received from asset forfeitures in criminal cases. Display the data in a

graph.

Year Assets ($)

1986 93.7

1987 177.6

1988 205.9

1989 580.8

1990 459.6

1991 643.6

1992 531.0

1993 555.7

1994 549.9

6. The following two graphs display the same data.

[pic]

[pic]

(a) What is the difference between the 2 graphs?

(b) What impression would you get from glancing at the second graph?

(c) See if you can present the same data in a way which emphasizes the growth in profit

from 1989-1995 more than the first graph.

7. Consider the line graph below.

[pic]

(a) Why is a line graph not a good choice to present the above data?

(b) Present the data in a more appropriate graph.

8. A common way to present pie charts is to picture them in three dimensions, as below.

[pic]

(a) Try to estimate what percentage of the pie is taken up by each of the three different

pieces.

(b) We took the pie chart in (a) and pictured it in the normal manner. Now guess how

much of the pie is taken up by the 3 pieces.

[pic]

(c) Explain why you think there was a difference between (a) and (b). (In actuality, the

pieces are all the same size!)

9. (a) Display the following data:

INDUSTRIAL & COMMERCIAL DEVELOPMENT

(millions of dollars)

|Year |Howard County |Prince William County |Loudon County |

|1991 |57 |64 |91 |

|1992 |44 |69 |38 |

|1993 |36 |79 |61 |

|1994 |20 |66 |55 |

|1995 |26 |99 |57 |

SOURCE: County economic development offices

(b) Suppose that you want to open a catering service in one of the three counties. Based

on your graph, which one would you choose to locate in?

(c) What other information would you like to have about the counties before you make

your decision?

10. The following two bar graphs display the same data.

[pic]

[pic]

(a) Explain the difference between the two graphs.

(b) If you were the President of the above company from 1989-1995 and you had to

display one of the above graphs to your shareholders, which would you choose? Why?

4.2 Analytic Tools

In this section, we develop some of the standard tools for the analysis of data.

Mean (average)

Mean, also known as average, is a measure of central tendency of data. To compute the mean of a (finite) set of data, sum the data entries and divide by the number of entries. For example, suppose that a student has five exams scores of 92, 89, 90, 95, and 0 (he missed one exam). Then we find the mean as follows:

mean = [pic]

Median

Another measure of central tendency is median. The idea behind the median is to find a number such that (roughly) half of the data entries are above this number and (roughly) half are below. The method for computing a median depends on whether the number of entries is odd or even. In either case, one starts by arranging the entries in increasing order. When the total number of entries in the list is odd, the median is the number that appears exactly in the middle of the ordered list. When the total number of entries in the list is even, the median is the average of the pair of numbers in the middle of the ordered list.

Example (odd case): Calculate the median of the numbers: 92, 89, 95, 90, 0.

Solution: We arrange the data in increasing order:

0, 89, 90, 92, 95

The third entry, 90, is the median because it occurs in the middle of the ordered list. (

Example (even case): Calculate the median of the numbers:

282, 550, 621, 427, 112, 891.

Solution: We arrange the data in increasing order:

112, 282, 427, 550, 621, 891

Because there is an even number of entries, no single number appears in the middle of the list. So, we average the pair of numbers in the middle, 427 and 550:

median = [pic] (

It is not so easy to pick out the middle of a large list of data by just looking at the list. Again, the method depends on whether the number of entries is odd or even. Assume that the data is arranged in order. When the number of entries in the list is odd, divide the number of entries by 2 and round up the resulting number. This is the position of the middle entry. When the number of entries in the list is even, divide the number of entries by 2. This is the position of the first of the pair of middle entries.

Example: Calculate the median of the numbers:

60, 72, 51, 30, 3, 12, 21, 42, 63, 54, 6, 15, 33, 36, 45, 75, 24, 69, 9, 18, 39, 27, 48, 57, 66.

Solution: We arrange the data in increasing order:

3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75.

There are 25 entries in the list (25 is odd). To find the position of the middle entry, divide 25 by 2. The result is 12.5. Round up to 13. The 13th entry (counting left to right), namely 39, is in the middle of the list. Thus, the median of the data is 39. (

Example (even case): Calculate the median of the numbers:

8, 4, 7 ,9, 3, 2, 7, 9, 2, 1, 9, 1, 9, 2, 6, 1, 0, 2, 76, 2, 7, 1, 1, 8, 2, 7, 1, 6, 7, 9.

Solution: We arrange the data in increasing order:

0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 9, 9, 9, 76.

There are 30 total entries in the list. Since 30 is even, we divide 30 by 2 to get 15. The 15th entry and the 16th entry are the pair of numbers in the middle of the list Counting left to right, these entries are 4 and 6, respectively. We average these numbers to arrive at the median, which is 5. (

Frequency Distributions

A frequency distribution is a table that shows how often various values occur in a set of data. For instance, suppose that one has recorded the daily temperatures (highs) in Baltimore, Maryland for the last two weeks of July:

Day Temp (high) degrees Fahrenheit

1 81

2 84

3 85

4 87

5 79

6 77

7 89

8 89

9 94

10 92

11 94

12 96

13 90

14 92

The range of the data is from 77 degrees to 96 degrees, so to create a frequency distribution we write the numbers from 77 to 96 and beside each one we record the number of times the number occurs in our data:

Frequency Distribution

Temp Frequency

77 1

78 0

79 1

80 0

81 1

82 0

83 0

84 1

85 1

86 1

87 0

88 0

89 2

90 1

91 0

92 2

93 0

94 2

95 0

96 1

The above frequency distribution is accurate but not very informative because each temperature does not occur very often, hence, the table is filled mostly with ones and zeros. It would be better to list intervals of temperatures and the number of times recordings fell within those intervals. If we choose our intervals too large, as below, then all of the data would fall within one or two intervals and, again, the frequency distribution would not be very revealing, as you can see.

Temp range Frequency

61 - 80 2

81 - 100 12

So, instead, we choose to group the data by four-degree intervals and form the new frequency distribution below. Three or five-degree intervals would serve us just as well; this is a judgment call. As a general rule of thumb, choose your interval lengths so that you have five to ten intervals. All intervals should have the same size. At first glance, the intervals below may appear to be three degrees wide. For instance, the difference of the end points of the first interval is 80 - 77 = 3, but this interval actually includes four values: 77, 78, 79, 80. In general, to calculate the number of values in a range of integers, take the difference of the end points and add one.

Temp Frequency

76 - 79 2

80 - 83 1

84 - 87 3

88 - 91 3

92 - 95 4

96 - 99 1

Now it is easy to see patterns in the frequency distribution. For instance, the temperature range 92 - 95 occurs more often than any other. Also, the two temperature ranges 84-87 and 88-91, occur with equal frequency. Some conclusions that we might draw from this frequency table would be: (1) One would expect the high temperature in Baltimore on a random day at this time of year to fall within the range 84-95. (2) The high temperature in Baltimore at this time of year is about as likely to fall within the range 84-87 degrees as the range 88-91. These generalizations about the weather in Baltimore are crude since the weather in Baltimore during the summer that the data was collected may have been atypical. Later we will introduce more rigorous notions of what is “likely” or “typical”. The science of drawing conclusions from samples of data lies within the realm of Statistics.

Histograms

A histogram is a bar graph of a frequency distribution. The length of each bar on the graph represents the frequency with which data falls within the corresponding interval. The purpose of turning a frequency distribution into a histogram is to give a more visual representation of the data. This often allows one to see patterns in the data quickly, without numerical analysis. The histogram for the frequency distribution above is shown below. We have plotted the temperature ranges along the horizontal axis and the frequencies along the vertical axis.

[pic]

The histogram exhibits the same information in the frequency distribution but does so more expediently and allows for the visual recognition of patterns.

Stem-and-leaf Plots

A stem-and-leaf plot is a special type of frequency distribution. It stores all the same information as an ordinary frequency distribution but, in addition, it conveys the visual information of a histogram. Moreover, an easy way to produce a histogram is to first form a stem-and-leaf plot, as in the following example.

We will form a stem-and-leaf plot from the data below which is a list of the calories in the items served at the fast-food restaurant, Bumpy’s Burgers.

Item Calories

Bumpy burger 413

Bumpy burger with cheese 485

Double Bumpy burger 591

Bumpy burger deluxe 660

Chicken sandwich 439

Chicken platter 584

Taco salad 351

Bowl ‘O Chili 383

Chef’s salad 266

Fish-wish dish 356

Beer battered fish 405

Bumpy’s hot wings 429

Steak and cheese Sub 562

Tuna sub 412

BLT sandwich 290

First, we observe that the range of the calories is 266-660. The digits 2 through 6 will become the stems in our plot and we list them next to a vertical line:

[pic]

We enter the datum 413 by chopping off the 4 and writing the number 13 as below.

[pic]

Next, we enter the datum 485 by writing the number 85 next to the number 13, separating them with a comma:

[pic]

Continuing in this fashion, we have the final stem-and-leaf plot:

[pic]

A single entry to the right of the vertical bar in a stem-and-leaf plot is called a leaf. The stem-and-leaf plot above is said to be a two-leaf stem-and-leaf plot because each leaf is a two digit number. We could have formed a one-leaf stem-and-leaf plot with the same data by writing the stems 20, 21, 22, ..., 64, 65, 66, next to the vertical bar. Then, the entry 413, for instance, would have been plotted by writing the number 3 to the right of the vertical bar in the row that begins with the stem 41. [Why would a one-leaf stem-and-leaf plot have been a bad choice in this example?]

From the stem-and-leaf plot, one can see that the calorie recordings are bulked around the 400-499 range and that there is a gradual rise and fall in frequencies as one sweeps from the lowest interval to the highest. The purpose of forming a stem-and-leaf plot is to enable one to make these types of observations.

The number of entries in each row of the stem-and-leaf plot corresponds to the frequency with which the data falls into the corresponding interval; this is what makes it so easy to turn the stem-and-leaf plot into a histogram, as we have done below. This time, however, we have drawn the bars left to right just to show that one always has the option of plotting the frequencies along the horizontal axis instead of the vertical. This is a matter of preference.

[pic]

A stem-and-leaf plot can be extremely useful tool when calculating the median of a data set, since it (roughly) arranges the data in order.

Scatter plots

A scatter plot is a graph of the data points that is used to examine (visually) the relation between two variables. For instance, suppose that we have collected the heights and weights of seven individuals as listed below.

[pic]

We will let x represent a person’s height (in inches) and y represent the person’s weight. To form a scatter plot of this data, we plot the points (71,165), (72,195), and so on. This means we will let the height vary along the horizontal axis and let the weight vary along the vertical axis. The scatter plot looks like this:

[pic]

Each point on the scatter plot represents the height and weight of an individual. If the data points in a scatter plot show a general upward trend when reading from the left to the right (as in the previous scatter plot), we say that there is a positive correlation between the variables. One would expect that height and weight would be positively correlated since, typically, the taller a person is the heavier they are.

Similarly, when the points display a general downward trend from left to right, we say there is a negative correlation between the variables. An example is given below.

Example: Each day, an owner of a hot dog stand decides to vary the price of one hot dog to see what effect there will be on the number of hot dogs sold that day. Below are the prices she charged along with the number of hot dogs she sold.

[pic]

The corresponding scatter plot is below.

[pic]

As the scatter plot indicates, the higher the price per hot dog, the lower the number of hot dogs sold. Thus, there is a negative correlation between price per hot dog and the number of hot dogs sold. (

As we have seen above, scatter plots are helpful in determining possible relationships between two variables in a data set. In the next section we will apply our knowledge of linear models to study (possible) linear relationships.

Problems

1. Find the mean and median for the following list of numbers:

18, 12, 27, 32, 12, 5, 8

2. Find the mean and median for the following list of numbers:

1000, 1800, 1776, 1492, 1812, 1995

3. 54 students took an elementary logic exam on which one can score a maximum of 55

points. Their scores are recorded below.

|53 |47 |42 |39 |36 |

|52 |46 |41 |39 |35 |

|52 |45 |41 |39 |35 |

|52 |44 |41 |39 |34 |

|52 |44 |40 |39 |32 |

|51 |44 |40 |38 |31 |

|50 |43 |40 |38 |31 |

|50 |43 |40 |38 |30 |

|49 |43 |40 |37 |28 |

|48 |42 |39 |37 |22 |

|47 |42 |39 |36 | |

(a) For the above data, set up a frequency distribution using intervals of 21 - 23, 24 - 26,

27 - 29, and so on.

(b) Make another distribution using intervals of 20 - 25, 26 - 31, 32 - 37, and so on.

(c) Compare the two distributions (Explain the differences you see).

(d) Which distribution is more informative? Why?

4. A two-day real estate seminar was run each week for 36 weeks. After each seminar,

follow-up calls were made and the percentage of attendees who later used the techniques

from the seminar were recorded:

|70 |21 |15 |75 |88 |56 |

|66 |87 |33 |82 |43 |74 |

|43 |95 |78 |64 |72 |36 |

|45 |27 |52 |87 |84 |92 |

|38 |33 |60 |73 |71 |74 |

|40 |67 |78 |82 |53 |100 |

(a) Construct a stem-and-leaf plot for these data.

(b) Find the median of the data.

5. An employee at the registrar’s office recorded how many students he serviced each hour.

8:00 9:00 10:00 11:00 12:00 1:00 2:00 3:00

---------------------------------------------------------------------------------

Mon 33 10 8 17 20 12 30 6

Tues 14 36 11 2 22 0 12 11

Wed 18 16 10 23 9 14 10 20

Thurs 9 17 17 19 12 16 15 13

Fri 22 8 14 37 10 2 12 5

(a) Create a stem-and-leaf plot for the above data.

(b) Calculate the median number of students helped each hour.

(c) Construct a histogram for the data, using the stem-and-leaf plot.

(d) Find the average number of students serviced per hour.

(e) A supervisor wants an estimate of the number of students helped per hour on a typical

day, what would you tell her?

6. A union steward has recorded the annual salaries of the union members that he oversees.

|$25,167 |$68,102 |$32,766 |$19,861 |

|$45,879 |$27,923 |$53,645 |$39,781 |

|$25,567 |$22,198 |$50,132 |$13,861 |

|$27,814 |$29,100 |$69,101 |$18,181 |

|$32,161 |$37,151 |$29,003 |$28,912 |

|$27,921 |$23,291 |$22,201 |$28,103 |

|$26,191 |

(a) Create a stem-and-leaf plot to display these data with appropriate intervals.

(b) Calculate the median annual salary for these union members.

(c) Calculate the mean (average) annual salary.

(d) Which of these do you think is a better measure of a typical union member’s salary,

the median or the average? Why?

7. For each of parts (a) - (d), write down a list of five numbers with the given criteria:

(a) The median is equal to the biggest number in the list.

(b) The median and the mean (average) of the numbers differ drastically.

(c) The mean, and median are equal.

(d) The mean is equal to the biggest number on the list.

8. Below are the exam scores for a student in a history class.

87, 97, 86, 0, 84, 98,

(a) Compute the mean and median of the exam scores. Which would the student prefer to

be used as his grade for the course, the median or the mean?

(b) Suppose that the instructor agreed to drop the lowest exam score. Now which would

the student prefer, the median or the mean?

9.

|Year |Total Footwear produced |Non-rubber Footwear |

| |(millions of pairs) | |

|1984 |383.5 |303.2 |

|1985 |336.5 |265.1 |

|1986 |310.9 |240.9 |

|1987 |312.1 |230.0 |

|1988 |325.3 |234.8 |

|1989 |312.8 |221.9 |

|1990 |290.3 |184.6 |

|1991 |282.1 |169.0 |

|1992 |273.6 |164.8 |

|1993 |252.0 |171.7 |

(a) Calculate the median amount of non-rubber footwear and the median number of total

footwear produced over the 10 years.

(b) What is the average production of both total and non-rubber footwear over the 10 year

period?

10. Crime in 1992. [Refer to data set #1 which can be found at the following web site:

]. Create a stem-and-leaf plot displaying the violent crime rate (per 100,000 people) for the 50 states and the District of Columbia.

(a) Who are the outliers (data points that lie far away from the others)?

(b) Looking at your plot, find the median crime rate.

(c) What effect does D.C. have on the mean, median?

(d) Which are the 5 most violent states?

(i) Can we conclude that these states had the most total number of crimes?

(ii) What do these states have in common?

(e) Why do you think the crime rate is so low in North Dakota?

(f) Assuming that the population of the U.S. in 1992 was 248,000,000, estimate how many violent crimes there were in 1992.

11. Vital Statistics. [Refer to data set #2 which can be found at the following web site: ] Create a stem-and-leaf plot for birth rates for the fifty states.

(a) Find the median for these data.

(b) Which state is typical in terms of birth rates?

(c) Which states are extreme? Do you have any explanation for this?

12. Resident Population. [Refer to data set #3 which can be found at the following web site: ] Create a histogram for the percent increase of population in the 50 states for 1991-1993.

(a) Which state was the fastest growing? The slowest? Why do you think that is?

(b) Calculate the median for the data.

(c) Which state is “typical” in terms of it’s growth over this period?

(d) The population of Vermont is relatively small. Does this mean that this state is not

popular to live in?

(e) For each of the states, you have been given the 1993 population and the percent increase of the population from 1990-1993. Compute the population of the following states in 1990: Texas, New Hampshire and Pennsylvania.

(f) How do you think the forecasts for the state populations for the year 2000 were arrived at? Make your own forecasts for populations of Iowa and Florida for the year 2000 using the 1993 populations and the percent increases over the three year period 1990-1993. How do your forecasts compare with those on the data sheet (made by the Census Bureau)?

Solutions

1. mean = 16.28571429; median = 12

2. mean = 1645.833333; median = 1788

3. (a)

|21-23 |24-26 |27-29 |30-32 |33-35 |36-38 |39-41 |42-44 |45-47 |48-50 |51-53 |

|1 |0 |1 |4 |3 |7 |15 |9 |4 |4 |6 |

(b)

|20-25 |26-31 |32-37 |38-43 |44-49 |50-55 |

|1 |4 |8 |24 |9 |8 |

(c) The frequency distribution in (a) contains more information, however, the distribution

in (b) seems easier to analyze.

(d) The distribution in (a) contains more information, the ranges are smaller.

4. (a)

|10 |0 |

|9 |5, 2 |

|8 |7, 2, 7, 2, 4, 8 |

|7 |0, 8, 8, 3, 5, 2, 1, 4, 4 |

|6 |6, 7, 0 ,4, |

|5 |6, 2, 3 |

|4 |3, 5, 0, 3 |

|3 |8, 3, 3, 6 |

|2 |7, 1 |

|1 |5 |

|0 | |

(b) The median is the average of the 18th and 19th value, or, the average of 67 and 70, which is 68.5.

5. (a) Stem-and-leaf plot:

|3 |3, 0, 6, 7 |

|2 | 0, 2, 3, 0, 2, 2 |

|1 | 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 4, 4, 4, 7, 8, 6, 7, 7, 9, 6, 5 |

|0 | 2, 0, 8, 6, 9, 9, 8, 5 |

(b) The median is the average of the 20th and 21st numbers, which are 14 and 14, so the

median is 14.

(c) Turn the stem-and-leaf plot on its side.

(d) The average is 15.3 students serviced per hour.

(e) No one correct answer, but you must give a reasonable whole number, so 15.3 and 45

are clearly wrong but an answer like 17 would be acceptable

6. (a)

|6 |8102, 9101 |

|5 |3645, 0132 |

|4 |5879 |

|3 |2766, 9781, 2161, 7151 |

|2 |5167, 7923, 5567, 2198, 7814, 9100, 9003, 8912, 7921, 3291, 2201, 8103, 6191 |

|1 |9861, 3861, 8181 |

(b) The median is the 13th value = $28,103.00

(c) Average = $32960.48

(d) The median disregards the exceptionally high salaries while the mean takes each into

account. There really is no “better” measure, they measure different things and in

certain circumstances one may be more appropriate than the other. E.g., one might

say that a “typical” worker makers around $28,103.00 (the median) while in a case

when reporting union salaries for budgeting purposes, the mean might be more

appropriate.

7. Note: These are not the only correct answers.

(a) 1, 2, 3, 3, 3 (median is 3).

(b) 1, 1, 1, 500, 6997 (median = 1, mean = 1500).

(c) 1, 2, 3, 4, 5 (median = mean = 3).

(d) 1, 1, 1, 1, 1 (mean = 1).

8. (a) median = 86.5 mean = 75.3333, the student would like to be given a grade based on

the median.

(b) median = 87 mean = 90.4, the student would like to be given a grade based on the

mean.

9. (a) median Total Footwear produced = 311.5 million pairs, median Non-rubber = 225.95

million pairs.

b) average Total Footwear produced = 307.91 million pairs, median Non-rubber =

218.6 million pairs.

4.3 Linear Regression

Linear regression is the fitting of a line to data points in such a way that the line seems to represent the general pattern of the data. Since it is rare that data points fall exactly on a straight line, there is a certain degree of subjectivity as to which line best fits the data. For this reason, there are many methods of linear regression. We will study three such methods. Each method might produce a slightly different line and, in many cases, probably none of them will run through all of the data points.

Our first method of linear regression is the “eye-ball” method in which one uses artistic judgment to draw a line that seems to fit well the data points on a scatter plot. On the scatter plot below, we have used this method to fit a line to the data points from the hot dog example of section 4.2.

[pic]

One of the problems with this method is that the line drawn is not unique: different persons might draw different lines. Later, we will introduce two methods of linear regression that are more mathematically-based. For right now, we will continue to work with the line we have drawn to see how it can be used.

The purpose of linear regression is to give a simple (linear) rule for the collected data so we can use it to predict data points that were not collected. Once we have a line drawn, by whatever method, we can find its equation from any two points that lie on the line. We now review this technique. [Skip ahead, if this is familiar to you].

Finding the equation of a line given two points on the line:

Example:

Let’s find the equation of the line drawn on the previous scatter plot. The line is clearly non-vertical line and is, therefore, represented by an equation of the form y = mx + b. We need to calculate m (the slope) and b (the y-intercept).

Step (1): Choose any two points that lie on the line - they do not have to be actual data points. We choose the points (0.90, 50) and (1.3,10), which we label as (x1, y1) and (x2, y2), respectively (we selected these points on the line because their coordinates were easy to read).

Step (2): Compute the slope of the line using the formula:

[pic]

Step (3): Choose either of the points and plug it into the formula y = mx + b along with the value of m calculated above. Solve for b.

[pic]

Step (4): Write out the equation of the line:

[pic]. (

Once we have the equation of the linear regression line, we can use it to generate or predict data points that were not actually collected. Interpolation is the generation of data points between the collected data points (i.e. within the range of collected data points). Extrapolation is the generation of data points beyond the range of collected data points.

As an example of interpolation, we return to the hot dog example above. Strictly speaking, we do not know how many hot dogs will be sold when the price is $1.25 each. However, we can use the line that we have fit to the data to get a reasonable estimate. In particular, we will find the point on the line that corresponds to a hot dog price of $1.25, then use the vertical coordinate of the point as our estimate. We could just read this point off the graph but a more accurate method is to solve for the point algebraically. We set the variable x (cost) to be 1.25 in the equation of the line and solve for y, the number of hot dogs sold.

[pic]

Thus, we would expect that she would sell about 15 hot dogs at a price of $1.25 each. This is an example of interpolating. In this case

We now give an example of (linear) extrapolation. In the previous example, suppose that we wanted to know “At what price per hot dog would she sell 100 hot dogs?” We can predict this using the equation of the line we drew by setting the variable y (number of hot dogs) to be 100 and then solving for x.

[pic]

Thus, we would expect to sell about 100 hot dogs when the price per hot dog is $0.40. Note that the point we found, (0.40, 100), lies outside the range of the given data. This is an example of extrapolation, by definition. If this point had been within the range of collected data points, then this would have been an example of interpolation.

The reliability of interpolation/extrapolation heavily depends upon how well the predicting line (or curve) follows the pattern of the data. Even in the best cases, predicted data points obtained from interpolation or extrapolation are no more than predictions and the real pattern of data may vary considerably. Extrapolation is especially unreliable far away from the given data points. In the previous example, if we try to extrapolate using the equation to find how many hot dogs would be sold at a price of $10.00 per hot dog, we get a value of y = -860 which is unreasonable since it is negative.

An outlier is a data point which lies far from the normal pattern of data.

In the example below, the points seem to follow the pattern of a line sloping downward (as you move to the right). We have circled a data point (0.95, 10) that does not seem to follow this pattern. This point is considered an outlier.

[pic]

We can only conjecture as to why this point is an outlier. For instance, perhaps that on the day the data point was collected, the hot dog stand closed early or maybe the rain kept customers away.

Linear Regression Method 2: The Median-Median Method

Another form of linear regression is the median-median method. We will use the following data to demonstrate this method:

[pic]

Step (1) : Plot the data on a scatter plot.

[pic]

Step (2) : Take the median of the x-coordinates of the points:

58, 62, 64, 68, 69, 71, 72

Since there are 7 values, the median is the 4th number in the list (68). We then divide up the data points into two categories: those whose x-coordinates appear on the list before the median and those whose x-coordinates appear on the list after the median. [In our case we discard the point (68,172) whose x-coordinate is the median.]

Step (3) : For both groups of data points, calculate the median x-coordinate and the median y-coordinate.

The first group is (58, 124), (62, 158), and (64, 137). The median of the x-coordinates is 62 and the median of the y-coordinates is 137.

The second group is (69, 201), (71, 165), and (72,195). The median of the x-coordinates is 71 and the median of the y-coordinates is 195.

Step (4) : Plot these two new points on the scatter plot (circle them, for future reference), and connect the two points with a line.

[pic]

Lastly, we can find the equation of the line and use it to predict data values as before.

Linear Regression Method 3: The Least-Squares Method

This method of linear regression produces a line (sometimes called the ‘best-fit line’) which minimizes vertical distances between the line and the data points. This method is by far the most used and is built into many calculators and spreadsheet packages. In the appendix, we have included both a Microsoft Excel tutorial and a tutorial for the TI-81 graphing calculator. However, demonstrate this method by hand using the data below:

[pic]

Step (1) : Create a table as below:

| |x |y |xy |x2 |y2 |

| |1.40 |3 |4.20 |1.9600 |9 |

| |1.10 |20 |22.00 |1.2100 |400 |

| |1.05 |24 |25.20 |1.1025 |576 |

| |1.00 |45 |45.00 |1.0000 |2025 |

| |0.95 |10 |9.50 |0.9025 |100 |

| |0.90 |54 |48.60 |0.8100 |2916 |

| |0.85 |50 |42.50 |0.7225 |2500 |

| |0.75 |62 |46.50 |0.5625 |3844 |

| |0.70 |67 |46.90 |0.4900 |4489 |

| |0.65 |71 |46.15 |0.4225 |5041 |

|Totals: |9.35 |406 |336.55 |9.1825 |21900 |

Each data point gets recorded as a row in the chart. For instance, for the data point

(1.40, 3):

[pic]

This table can easily be constructed on a spreadsheet, only the first 2 columns need to be entered in by hand, with the rest of the table filled in by using cell relations.

Step (2) : Use the formula below to calculate m (the slope of the least-squares line):

Slope of the Least-Squares Line

[pic]

where n is the number of data points,

[pic] equals the total of the x column (in the chart);

[pic] equals the total in the y column;

[pic] equals the total in the xy column;

[pic] equals the total in the [pic] column.

Thus, the slope of the least-squares line in our example is:

[pic]

m =[pic]

m =[pic]

m =[pic]= -97.8080636.

Step (3) : Use the formula below to find b (the y-intercept of the least-squares line):

The y-intercept of the least-squares line:

[pic]

b =[pic] = 132.0505395.

Step (4) : Write the equation of the least-squares line:

In our example,

y = -97.8080636x + 132.0505395

Step (5) : Plot the line on the scatter plot. (By finding any two points on the line).

Note: You want to choose two points which will be on the scatter plot. (Not far away from regular data points).

In our example, if x = 0.70, then y = -97.8080636(0.70) + 132.0505395 = 63.58489498;

if x = 1.20, then y = -97.8080636(1.20) + 132.0505395 = 14.68086318. On the scatter plot, we mark the points (0.70, 63.58489498) and (1.20, 14.68086318) differently to distinguish them from our data points.

[pic]

It is interesting to note that the least-squares line in this example does not pass through any of the actual data points, but remember that this line merely a simple model of the actual situation. Our model makes the simplifying assumption that the number of hot dogs sold on any given day depends entirely upon the price. A more realistic model would take into account many other factors, the weather that day, how long the hot dog stand was open, the day of the week, etc.

We can now use the least-squares line for interpolation or extrapolation. For instance, if x = $0.80, then a reasonable guess at y would about 54 hot dogs, as seen from the graph[1]. We could also arrive at this guess algebraically by plugging x = 1 into the equation and solving for y.

y = -97.8080636(0.8) + 132.0505395 = 53.80408892.

Warning: One should be aware that if we switch the variables around, the least-squares method will produce a different line. (This is because the least-squares line minimizes vertical distances).

The Correlation Coefficient

One of the most beneficial aspects of the least-squares method is that it has its own self-evaluation test. More specifically, there is a number r that you can compute called the correlation coefficient that gives an idea as to how well the data is fit by the least-squares line. In the previous notation,

The correlation coefficient:

[pic]

The correlation coefficient is a number between -1 and 1 and can be interpreted as follows:

• the sign of r tells you whether there is a positive or negative correlation between the variables x and y. (Positive r means positive correlation; negative r means negative correlation.).

• if r is actually equal to 1 or -1, then all of the data points lie exactly on the least-squares line.

• if r is close to 1 or -1, this means that the data points lie close to the line, i.e. the line seems to closely follow the pattern of the data. (By values close to 1 or -1, we mean numbers such as: 0.96, -0.99, 0.92, -0.945, etc.).

• if r is to close to zero, this tells us that the data probably does not follow the pattern of a straight line.

In the above hot dog example, we get a correlation coefficient of r = -0.881797734.

From the negative sign, we see that there was a negative correlation between hot dog prices and the number of hot dogs sold. Since the value is fairly close to -1 we can conclude that the data is modeled fairly well by the least-squares line.

Warning: Unfortunately, Microsoft Excel calculates and displays r2 instead of r. One can recover r by taking the square root, but the sign of r has to be guessed.

Using Microsoft Excel® to Perform Linear Regression

We will explore whether there is a linear relationship between the amount of total footwear produced in the United States and the amount of non-rubber footwear produced.

|Year |Total Footwear produced |Non-rubber Footwear |

| |(millions of pairs) | |

|1984 |383.5 |303.2 |

|1985 |336.5 |265.1 |

|1986 |310.9 |240.9 |

|1987 |312.1 |230.0 |

|1988 |325.3 |234.8 |

|1989 |312.8 |221.9 |

|1990 |290.3 |184.6 |

|1991 |282.1 |169.0 |

|1992 |273.6 |164.8 |

|1993 |252.0 |171.7 |

SOURCE: US Census Bureau

Enter the data into a spreadsheet (in column form), as below.

[pic]

We would like to analyze total footwear versus non-rubber footwear, so select (highlight) both columns of data as shown below:

[pic]

Under the INSERT column, select CHART, then (to the right) select AS NEW SHEET.

You will be guided through five windows of options. At each window, you will be given various options. Select these options:

Step 1 : The range has already been specified by highlighting. Choose NEXT.

Step 2 : Choose XY scatter plot. Choose NEXT.

Step 3 : Choose XY scatter plot format number 1 or 3 (your choice). Choose NEXT.

Step 4 : Choose columns (not rows) - ignore the rest of this window. Choose NEXT.

Step 5 : Choose “NO” for the legend option. Add appropriate descriptions. Choose FINISH.

Your scatter plot will now appear as a new sheet. You can click back to the table of data that we input by using the tabs below the chart, and vice versa.

Adjust the scale of the x- and y-axes to center the data in the middle of the chart. To do this, double click on the chart, then double click on one of the axes (say the x-axis). A window will pop up which is titled “Format Axis”. Choose SCALE. Change the minimum x value from 0 to a more reasonable number (200 or 250 will work in our example). Similarly, change the scale of the y-axis so that the data fits nicely.

We will now use the spreadsheet to fit the least-squares line to the data. To do this,

1. Click on one of the data points until all of the data points are highlighted (Microsoft may not highlight all of the points, this is a computer glitch).

2. Under the INSERT menu, select TRENDLINE. A window will pop up with two folders “TYPE” and “OPTIONS”.

3. Under the “TYPE” folder, select LINEAR. (We want to fit the data with a line).

4. Under the “OPTIONS” folder check the two boxes: Display equation on chart and

Display R-squared value on chart.

5. Click on OK.

Excel will now fit a line to your data.

Problems:

Q1: Does there appear to be a relationship between total production and non-rubber production? If so, is it positive, negative, linear, or related in some other way?

Q2: Use the slope of the line to answer the following questions:

(a) If total footwear production increased over one year by 50 million pairs, how much of an increase (decrease) would you expect in non-rubber footwear?

(b) If non-rubber footwear decreased by 50 million pairs, how much of an increase (decrease) would you expect for total footwear?

Q3: Recall that the x-intercept of a line is the point(s) where the line intersects the x-axis (y=0).

(a) What is the x-intercept of the line?

(b) The x-intercept is the amount of total footwear that the linear model is predicting when the amount of non-rubber footwear is equal to zero (y = 0). Why is this prediction not correct?

(c) Calculate the y-intercept of the line. This is the amount of non-rubber footwear that the linear model predicts when total footwear is equal to zero (x = 0).

(d) Why are the predictions in (b) and (c) so unrealistic?

Q4: Predict the level of non-rubber production when total production is at 350 million pairs per year.

Q5: Predict the level of total production when non-rubber production is at a level of 100 million pairs.

Problems

1. [Footwear]

|Year |Total Footwear produced |Non-rubber Footwear |

| |(millions of pairs) | |

|1984 |383.5 |303.2 |

|1985 |336.5 |265.1 |

|1986 |310.9 |240.9 |

|1987 |312.1 |230.0 |

|1988 |325.3 |234.8 |

|1989 |312.8 |221.9 |

|1990 |290.3 |184.6 |

|1991 |282.1 |169.0 |

|1992 |273.6 |164.8 |

|1993 |252.0 |171.7 |

(a) Create a scatter plot using the above data with Total production as x and non-rubber production as y.

(b) Does it seem that these variables might be linearly related?

2. [Astronomical Data]

The period of revolution and the mean distance from the sun for each of the nine planets is given in the chart below. The period of revolution is the amount of time a planet takes to circle the sun.

| |Period of revolution |Distance from sun |

| |(years) |([pic]1,000,000 km) |

|Mercury |0.244 |57.9 |

|Venus |0.624 |108.2 |

|Earth |1.000 |149.6 |

|Mars |1.882 |227.9 |

|Jupiter |11.86 |778.3 |

|Saturn |29.46 |1427.0 |

|Uranus |84.01 |2869.6 |

|Neptune |164.80 |4496.6 |

|Pluto |247.70 |5900.0 |

(a) Create a scatter plot using these data.

(b) Artistically, fit a curve to the data.

(c) Use (b) to predict the mean distance from the sun for two hypothetical planets with period of revolution of 75 years and 200 years, respectively.

3. For each scenario below: Predict whether there will be a positive or negative relationship between the variables. Also, explain whether you think the relationship might be linear (the data will be fit well by a straight line).

(a) Some money is deposited in a bank which compounds interest daily.

x = time the money is left in the bank; y = amount of money in the bank.

(b) A family is driving 55 miles per hour towards Los Angeles.

x = hours spent driving; y = the family's distance to Los Angeles in miles.

(c) x = the age of a person in years; y = that person's height in inches.

4. [Simple interest vs. compound interest]

(a) Recall the following formulas for the future amount , A, of an account in which an initial deposit P is made at annual interest rate, r, for time t.

Simple interest: [pic][pic]

Compound interest: [pic]

Use a principal amount, P of $100.00 and an interest rate r of 10%. Calculate the future amount earned at simple interest at the end of each year for 10 years. Use the 10 answers that you have generated to create a scatter plot with years on the horizontal axis and the future amounts on the vertical axis. There should be 10 points plotted.

Now do the same for the compound interest case with compounding period being daily (n =365), and all other variables the same as above.

(a) Which is growing faster, simple or compound? Why?

(b) Notice that the simple interest data can be “well fit” by a line, whereas the compound interest cannot. Explain.

5. [Nutritional information of baked goods & desserts]

It is a fact that each gram of carbohydrate has 2 calories. Below are listed both the amount of calories and the grams of fat contained in a single serving of certain foods:

|Food item |Calories |Carbohydrate |

| | |(grams) |

|Corn Muffins |170 |24 |

|Cinnamon Rolls |181 |31 |

|Easy Pancakes |144 |16 |

|French Bread | 86 |18 |

|Rye Bread |114 |22 |

|Whole Wheat English Muffin |260 |45 |

|Onion Bagels |186 |39 |

|Sunflower-Nut Wheat Muffins |171 |25 |

|White Bread | 97 |19 |

|Coffee Cake |316 |48 |

|Sponge Cake |188 |36 |

|Pear-Raisin Upside-Down Cake |286 |46 |

|Banana Cake |281 |46 |

|Lemon-Poppyseed Pound Cake |273 |37 |

|Fudge Cake |311 |47 |

|Old-Time Fudge | 67 |13 |

|Nut Brittle | 72 |10 |

|Walnut Caramels | 94 |11 |

|Chocolate Chip Cookies | 99 |12 |

|Oatmeal Cookies | 82 |12 |

|Shortbread | 95 |10 |

|Buttermilk Brownies |158 |23 |

|Cheesecake Supreme |420 |29 |

|Apple Dumpling |619 |92 |

(a) For each food item, form a ratio of calories to carbohydrates.

(b) Find the average ratio of the above foods.

(c) Form a scatter plot and fit it with a line. Find the slope of the line.

(d) Is the slope of the line equal to the average ratio calculated in (b)? If not, draw a line on your scatter plot through the point (0,0) with slope being the average ratio in (b).

(e) Does this line better fit the data than the line you drew?

(f) In general, do you think that this is a good method to fit a line? Explain.

6. [Refer to the data set below - Student grades]

|Student # |Exam 1 |Final Exam |Overall percent |

|1 |91 |73.3 |85.6 |

|2 |66 |50.7 |62.3 |

|3 | 92 |81.3 |81.3 |

|4 |76 |46.7 |59.3 |

|5 |81 |60.0 |75.6 |

|6 |61 |60.0 |61.4 |

|7 |85 |80.0 |80.8 |

|8 |90 |93.3 |96.5 |

|9 |78 |80.7 |78.7 |

|10 |88 |76.7 |79.6 |

|11 |97 |85.3 |88.5 |

|12 |87 |75.3 |74.3 |

|13 |51 |32.0 |65.3 |

|14 |83 |82.6 |71.5 |

|15 |85 |87.3 |88.5 |

|16 |95 |86.0 |86.0 |

|17 |70 |73.3 |73.5 |

|18 |90 |75.3 |77.7 |

|19 |68 |60.0 |60.7 |

|20 |86 |52.6 |63.7 |

|21 |49 |78.7 |78.5 |

(a) Use the students grade’s to see if their exam 1 scores are an accurate indicator of their overall performance in the course. To do this, make a scatter plot with the student’s exam 1 scores on one axis and their final grades (by percentage) on the other.

(b) What is the least squares equation for the data?

(c) Use the equation in (b) to predict the final grade of a student who scored a 57 on exam 1.

(d) Use the equation in (b) to predict the exam 1 score of a student with a final grade of 100.0.

(e) Repeat (a) - (d) using the student’s scores on the final exam to predict the final score for the course.

7. Crime in 1992 : [Refer to data set #1 which can be found at the following web site: ] Create a scatter plot of the violent crime rate per 100,000 people versus prisoner rate per 1,000 people.

(a) Using your plot, do you see a relationship between the violent crime rate and

the prisoner rate? Explain why or why not.

(b) Fit a line to your data. Find the equation of your line.

(c) Using the equation of your line, guess what the prisoner rates would be for states

with violent crime rates (per 100,000 people) of :

(i) 1,000

(ii) 2,833

(iii) 375

(iv) 746

(d) Compare your answers in (c) to the actual data for Maryland, D.C., Virginia,

and Tennessee, respectively. Were your predictions close? Explain.

(e) What are some reasons a state like Tennessee might be so far off from the

prediction?

8. Auto Repairs : [Refer to data set #6 which can be found at the following web site: ] A consumer agency took identical cars needing identical repairs to various local auto repair shops. The resulting repair costs are in the right-hand column of the table below. The left-hand column gives a customer satisfaction rating for each repair shop. The rating is on a scale of 0 - 100, with 100 being best.

(a) Use a spreadsheet to create a scatter plot and to find the equation of the least-squares line for these data.

(b) Is the data fit well by the line?

(c) What does the equation predict that the cost of a repair that scored 100 will be?

9. Vital Statistics : [Refer to data set #2 which can be found at the following web site: ] Use a spreadsheet to create a scatter plot, plotting birth rate versus infant mortality for all fifty states.

(a) Do you see any correlation between these items?

(b) Use the spreadsheet to find the least-squares line to the data.

(c) Using your line from part (b), give a corresponding infant mortality for the following

birth rates (read the corresponding point off of your line): 14.0, 18.0, 25.0

10. Use data set # 2 (see problem 9) to create a scatter plot to see if there is any correlation between heart disease and cancer.

(a) Describe the correlation that you see, if any. Is it linear or more complicated than that?

(b) What is the equation of the least-squares line given by the spreadsheet?

(c) Would the data be better fit by another curve? Use the various options in the spreadsheet to try to fit other curves to the data: What are the equations of the power and exponential models given? Do these curves fit the data better?

12. A student using a spreadsheet to analyze data ended up getting the following:

y = 9x - 1 r2 = 0.1463355

(a) Use the equation in (a) to predict y when x = 11.

(b) Use the equation in (a) to predict x when y = 10.

(c) Can you tell whether the data fit well by the line?

13. T-Bill rates : [Refer to data set #4 which can be found at the following web site: ] Someone hypothesizes that the 3-month T-bill rate is generally higher during certain seasons (spring, summer, fall, winter) and lower during others. Use any tools (scatter plot, etc.) you have learned to support or dispute this claim.

14. Household Income and Public Assistance : [Refer to data set #5 which can be found at the following web site: ]

(a) Would you expect there to be a relationship between median household income in a city and the percent of the city’s households receiving public assistance? What is the relationship (if any) that you expect?

(b) Use the data given to produce a scatter plot showing median household income versus percent of households receiving public assistance.

(c) Do you think you could fit a line to the scatter plot to arrive at in part (b)? Explain.

(d) What are some reasons that the median income in a given city might not be related to the percent of public assistance that the city receives?

(e) Suppose that a certain city has a median income of $100,000. What percent of the households in the city would you expect to receive public assistance?

(f) Did the data display the relationship you expected in (a)? Explain.

15. Measurement and the geometry of circles:

For this problem you will need:

a pen or pencil

a ruler or a yard stick

a long piece of string

10 circular objects having varying sizes (like cups, Frisbees, plates, etc.)

(a) Measure both the diameter and circumference of the 10 circular objects.

Organize the data you have collected into a table with 3 columns: the type of object, its diameter, and its circumference. [To measure the circumference, take a piece of string, wrap the string around the circle, and mark the string where it wraps around once. The string can then be straightened out and measured using a ruler.]

(b) Using the data you have collected, create a scatter plot with diameter on the horizontal axis and circumference on the vertical axis.

(c) Use the median-median method to run a line through this data.

(d) Calculate the equation of your line and determine its slope.

(e) Recall from high school geometry that there is a formula that relates the circumference of a circle to its diameter. What was the formula?

(f) Compare the equation of your line to the formula.

(g) Come up with some reasons why the equation you came up with might differ from the formula.

(h) Do you think your results would be the same if you had done all of your measurements in centimeters instead of inches (inches instead of centimeters)?

(i) Use both your equation and the formula in (e) to predict the circumference of an object that has diameter 1,500,000 miles. By how much do they differ? Why or why not?

16. [Local restaurant survey]

The following data was taken from a survey of local restaurants which were ranked in 3 categories on scales of 0-30 with

0-9 = poor to fair

10-19 = good to very good

20-25 = very good to excellent

26-30 = extraordinary to perfection.

The categories are Food Quality, Decor of restaurant, and Quality of service. The cost reflects the estimated price (including tip) of a dinner with one drink.

|Restaurant |Food |Decor |Service |Cost ($) |

|The Cheesecake Factory |18 |18 |17 |19 |

|Chili’s Grill & Bar |15 |15 |16 |14 |

|Food Factory |22 |5 |11 |11 |

|Hard Times Café |19 |15 |18 |12 |

|Hay-Adams Dining Room |22 |26 |24 |40 |

|La Miche |23 |20 |22 |35 |

|Ledo Pizza |18 |9 |14 |12 |

|The Prime Rib |26 |25 |24 |42 |

|Silver Diner |15 |17 |16 |13 |

|Tako Grill |23 |15 |19 |22 |

|TGI Friday’s |15 |15 |15 |15 |

|The Place on K Street |23 |20 |21 |33 |

|Inn at Little Washington |29 |29 |28 |71 |

|Hunan Lion |21 |21 |21 |22 |

|Union Street Public House |18 |19 |17 |20 |

|West End Café |20 |19 |20 |29 |

|The Chart House |18 |1 |19 |27 |

|The Conservatory |26 |8 |25 |50 |

|Hamburger Hamlet |16 |15 |16 |15 |

SOURCE: Zagat Survey of Washington Restaurants

For each of the combinations below and do the following: Create a scatter plot and find the equation of the least-squares line. Discuss the relations that is exhibited in the scatter plot.

(a) Decor vs. Service

(b) Cost vs. Decor

(c) Cost vs. Service

(d) Decor vs. Food

(e) Cost vs. Food

17. [Imports vs. Exports]

The following chart shows the total dollar value of goods imported and exported in 1994 through various ports of entry in the United States.

|Port of Entry |Export ($) |Import ($) |

|Maine |1,106,044,218 |2,208,649,490 |

|St. Albans |2,214,134,505 |3,637,130,387 |

|Boston |2,353,437,436 |6,463,665,681 |

|Providence, RI |47,370,275 |455,390,397 |

|Ogdensburg, NY |4,595,960,465 |7,038,298,506 |

|Buffalo, NY |15,888,221,439 |14,661,302,405 |

|NY City |30,551,747,615 |42,871,907,085 |

|Philadelphia |3,238,763,395 |9,190,459,168 |

|Baltimore |4,479,119,728 |7,425,369,754 |

|Norfolk |7,240,752,781 |4,742,908,955 |

|Wilmington, NC |2,324,720,499 |3,486,009,758 |

|Charleston |5,194,834,324 |5,373,455,712 |

|Savannah, GA |5,319,611,727 |7,193,450,304 |

|Tampa |3,411,918,692 |4,993,946,226 |

|Mobile, AL |1,719,186,193 |2,036,943,685 |

|New Orleans |13,780,098,960 |17,410,150,470 |

|Port Arthur, TX |553,030,849 |2,448,297,046 |

|Laredo, TX |10,921,040,117 |12,069,195,708 |

|El Paso |3,910,844,623 |6,586,807,808 |

|San Diego, CA |2,672,311,461 |4,138,451,899 |

|Nogales, TX |2,059,731,775 |4,325,354,672 |

|Los Angeles |32,856,044,275 |45,951,213,996 |

|San Francisco |20,303,432,223 |26,595,044,839 |

|Col-Snk |4,683,986,731 |3,837,470,323 |

SOURCE: Dept. of Commerce

Create a scatter plot for the above data and find a relation (if any) between exports and imports. Find the least-squares line. [Note: You will probably want to round off the numbers and scale them appropriately.]

18. [The physics of a bouncing ball]

For this problem you will need:

one tennis ball (or some other ball that bounces well)

a yardstick (or ruler)

one lab partner

One trial of this experiment consists of the following:

1. Hold the tennis ball at a predetermined height (measured from the bottom of the ball).

2. Let the ball drop freely.

3. Record the height of the ball (measured from the bottom of the ball) at the peak of its first bounce.

Suggestion: Drop the ball in front of a wall and use paper or masking tape to mark the heights.

Repeat the experiment with varying starting heights (the more the better). Use at least 10

different starting heights.

Plot your results on a scatter plot with starting heights on the horizontal axis and the bounce height on the vertical axis. Fit the least-squares line to the data. Predict the bounce height of a ball dropped from (a) 6 feet (b) 12 feet (c) 1 mile (5280 ft.)

Web

19. U.S. Population. Use the web to find the number of residents in the United States for each of the years from 1900 to 1996. (Try the U.S. Census Bureau Web site.) Make a scatter plot with the year on the horizontal axis and the number of residents on the vertical axis. Use a spreadsheet to fit a least-squares line to the data.

(a) Use your least-squares line to estimate the resident population in each of the years 1950, 1970, and 1990. Note that you have data points that will tell you what the actual populations were. How close (in value) were the predicted populations to the actual values? Was your line a good estimator for these values?

(b) Use the least-squares line to estimate what the resident populations was in 1781. What is wrong with this answer? How far back in time do you think your line can be used to meaningfully estimate the resident population?

(c) Do you think that the U.S. population is growing linearly? Use the linear regression you have performed to back up your argument.

20. Gross Domestic Product (GDP). On the Web, find the U.S. GDP for each of the years, 1929 to 1996. Use a spreadsheet to fit the data with a least-squares line.

(a) What is the least-squares equation of the line? What is the correlation coefficient?

(b) Is the data fit well by the line?

(c) Predict the GDP in the year 2020.

(d) The GDP is reported in 1996 dollars, i.e. inflation is taken into account by reporting dollar values in 1996 dollars. Had inflation not been taken into account, what affect would this have on the scatter plot?

Solutions

1. (a)

[pic]

(b) Yes, it appears that line may fit the data.

2. (a)

[pic]

(b)

[pic]

c) If x = 75, then the curve predicts y is about 2700. If x = 200, then y is about 5180.

5. (a) and (b)

[pic]

(c)

[pic]

(d) No, but it is close. The new line is drawn below (as a thin line).

[pic]

(e) They both reasonably fit the data.

(f) No, this method only works if the data pattern seems to “include” or come near (0,0).

6. (a) and (b)

[pic]

(c) plug x = 57 into the equation y = 0.4567x + 39.383, and we get 65.4149.

(d) substitute y = 100 into the equation y = 0.4567x + 39.383, and we get 132.7283, this number is not likely to coincide with reality, this is OK, this is because our model does not give actual data, especially when extrapolating.

(e) redo the above with Final Exam versus Overall Percent.

-----------------------

[1] One should never interpolate or extrapolate using a graph drawn by hand, always use the equation of the line to avoid errors and inaccuracies.

-----------------------

= $2,000.00

in Profit

1990 1990 1991 1992 1993 1994 1995

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

(

(

(

[pic]

[pic]

(

(

(

[pic]

[pic]

[pic]

(

(

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download