Arizona State University
Statistics or
Never attribute to malice that which is adequately explained by stupidity.
Numbers don’t lie. And often perception is not reality. Case in point, people are always concerned, worried even, about the increasingly violent society in which they live. Statements like “what is this world coming to,” are commonplace. People may tend to feel increasingly unsafe many may become more and more reluctant to go out at night. Some been known to hide behind their TV or computer, instead of venturing out. There are many who feel that when they were young, crime was not as bad as it is now. People attribute arbitrary reasons for this new wave of perceived violence. “Exhaustive music videos glorify violence, causing a violent cycle to never end...” “No wonder there is so much crime these days, look at all the violence on TV and in the movies…” “Kids have no respect for their parents, teachers or elders these days, this contributes to more violence…” Or “the remote control teaches us to become impatient, and we are more likely to quickly pull the trigger…” Images from the OJ murders, Columbine shootings or 9/11 tend to fill our televisions, replaying the same isolated scenes over and over again. People are shot every night on reruns of Law and Order. So, it’s natural for people to criticize the amount of violence in our society, but rarely do these same people utter any voice toward thinking their utterances through to its logical conclusion. Instead, many appear to become angry about the rise of violent crime in this country and tend to make matters worse by linking this acquired malice to other elements in society (music videos, teenagers, TV, OJ), fostering a wider net of hate. More importantly, they never once pause to check out the numbers. And in a matter of moments, anyone can do just that, check out the numbers. Any of us can access on the WWW the FBI’s Index of Crime Statistics. So, we did.
Below are the nation wide statistics from 1982 to 2001, showing, by year, the number of violent crimes nationwide. During this twenty year span, while the nation’s population grew from 231,664,458 in 1982 to 284,796,887 in 2001, the number of violent crimes as defined by murder, rape, robbery and assault did not steadily increase, as expected. In fact, clearly there was a stunning decline in violent crimes over the last decade. Violent crime reached it’s peaked 1992, with 1,932,274 reported instances and since then, violent crime has dropped over 25 percent. (The homicides on September 11, 2001 were not included.)
[pic]
But, look at the age we live in. Ripped from the headlines:
• Arizona Kids Are Home Alone, A new survey says 30 percent in kindergarten through 12th grade take care of themselves”
• Of the 85 prisoners executed in 2000, 43 white, 35 African American, 6 Hispanic, 1 American Indian
• Vietnam - 58,168 deaths, total abortions since 1973, 44,670,812 as of April 22, 2004
• Should juveniles be tried as adults? Kids are killing these days in record numbers
Statistics are tossed at us in such a deluge the numbers alone seem almost controversial, 30 % of school age children left alone, 35 out of 85 executed are African American, 44 million abortions in last 30 years. … Certainly, each of these topics elicits emotion from within each of us, too many parents leave their children unsupervised, there is not enough funding for day care, death penalty, pro or con, racially biased, too many, too few juveniles? And if you want to clear a room with angry combatants, start with the age-old question, woman’s choice or murder of the unborn? No matter your stand on these topics, as you comb through the headlines, statistics besiege you.
Why is quantitative literacy important? When confronted with numbers associated with hotly contested issues or highly controversial ethical or moral arguments, raw numbers themselves, such as the above stated 44,670,812 abortions in the last 30 years, need to be examined so they may be fully understood. As always, we begin by examining the number for credibility? Is it even viable? This particular number or numbers similar to it appear on various websites. We found the number we quoted and similar such numbers at , , and . Are they accurate, well, we simply have no way of knowing, but it is an oft published statistic. Is it viable? Now, that is a different question altogether. It is certainly quoted often enough.
So, following our pattern of analysis, if the number seems to be viable, then we continue. If it is viable, what implications are fair to divvy out? These 44 million aborted fetuses would be 30 or younger, so for argument sake, let us assume it is fair to say a large percentage would be alive today. If this assumption is reasonable, 40 million plus the 290 million US citizens comes to 330 million. We are talking a population of 330 million people, and 44/330 is slightly larger than 13 % or slightly greater than 1/8. What does this mean? Has society aborted 1 in 8? Don’t questions abound in your mind? Is this correct? Were these all abortions performed out of necessity? How many were medical? Or moral? Or personal choice? Does the reason for the abortion matter to you? Does the reason for the abortion matter to you if you take into account this new “1 in 8” statistic as a measure of how often abortions do occur? Certainly, one may argue that 1 in 8 could be construed as an alarming rate. But, the point of view and the emotions you feel are personal for you. The point is that 44 million is the statistic we are confronted with. Our ability to perform math tells us 1 in 8 is a logical consequence of this statistic. What you do in the subsequent interpretation is your decision. But, quantitative literacy will allow you to understand the statistic in context and make that interpretation.
Statistics themselves are numbers that stand alone. Honest. Raw. Naked numbers. The name of the game in statistics is to draw inferences about a population or topic. If we are using polls, we are basing the inference on a smaller random sample. When trying to then form a conclusion, we must be careful. Correlation is not causation, just because numbers correlate does not mean one causes the other. Inferring characteristics about a population based on the raw data is the immediate reaction as we scan the headlines, but should it be? Can graphs be misleading? How good are we at recognizing misleading information?
Causation and Correlation
Clearly, there exists a high correlation between the amount of blood alcohol level in a person’s body and the likelihood they will get into an auto accident. I do not think any rational person would dispute the added inference that drinking alcohol can cause an auto accident. The data that supports the two factor’s relationship, the higher the number of drunks compared to non drunks who get into the accident, imply correlation. That drinking lead to or caused, the accident implies causation. It will be our task to determine whether a factor’s data that correlates to some other factor’s data can be interpreted to mean that one factor influences the other.
Correlation A correlation exists between two factors if a change in one of the factor’s data is associated with a rise or decline in the other factor’s data.
Causation A causation exists between two factors if a one factor causes, determines or results in the other factor’s data to rise or decline.
Correlation as a result of causation As with drinking and auto accidents, we can often infer that a correlation is tied to causation. Another equally clear case can be made by considering tobacco use and lung cancer. The numbers correlate, one can equate the amount one smokes with the likelihood of succumbing to lung cancer. Those who smoke more have a higher percentage of their population inflicted with lung cancer. And, for years, the Surgeon General has been telling us that smoking causes lung cancer. The more you smoke, the higher the risk of developing lung cancer.
Correlation with no causation. Hidden factors Just because two factors correlate does not mean one factor causes the other. One of the easiest examples to spotlight the difference and to have it plainly explained is to look at a common correlation between divorce and death. In most states, there is a significant negative correlation between the two, the more divorces, the less deaths. Since the two correlate negatively, the natural question arises, does getting a divorce reduce the risk of dying; does staying married increase the chance of dying? All joking aside about the obvious hidden implication, it is a third unseen factor that causes the correlation. Death and divorce do not have a causal relationship. Age does. The older the married couple, the less the likelihood they will get a divorce. The older the married couple, the higher likelihood they will pass away. There is a negative correlation between divorce and age and a positive correlation between age and death. The younger you are, the more likely you are to get divorced. The older you are, the more likely you are to pass away. Since the higher number of divorces occur with younger people, and since younger people tend to live longer, we have a transitive relationship implying the higher number of divorces relating to the longer life spans. Correlation. Causation. Very different. Yes, there is a correlation between divorce and death. No, neither causes the other. In plain English, getting a divorce will not increase or decrease the likelihood you will die.
Accidental Correlations Sometimes there exists accidental correlations where there is no hidden other factor or unseen logical explanation. The winner of the Super Bowl and the party of the winner of the presidential race in the country correlate highly every four years, but I do not think football predicts the presidential races, or visa versa. This is an accidental correlation.
Misleading Information
Breast cancer will afflict one in eleven women. But this figure is misleading because it applies to all women to age eighty-five. Only a small minority of women live to that age. The incidence of breast cancer rises as the woman gets older. At age forty, one in a thousand women develop breast cancer. At age sixty, one in five hundred. Is the statistic one in eleven technically correct? Yes. Should a 40 year old woman be concerned with getting breast cancer? Certainly. Should they worry that one in eleven of their peer group will be afflicted? No. And while one in a thousand in their peer group will get afflicted, this by no means minimizes the seriousness of the issue, but sheds a more realistic light on it.
To draw scatterplot:
• Arrange the data in a table.
• Decide which column represents the x–values (the label representing data along the horizontal axis). Those values need to be the perceived cause, the independent variable. Decide which columns represents the y–values (data represented along the vertical axis). These values need to appear to be affected by the perceived causes, the dependent variable.
• Plot the data as points of the form or an ordered pair, (x, y).
• Analysis: We can make predictions if the points show a correlation.
* if the points appear to increase to the right, this is a positive correlation.
* if the points appear to decrease to the right, this is a negative correlation
Positive Correlation: We expect that if the values along the horizontal axis increase, so do the values associated with the vertical axis. That is, as we increase x, we increase y. The more we study, the higher we expect to score on Exam One.
[pic]
Negative Correlation: We expect that if we increase x then we decrease y. The higher the temperature, the less minutes we will jog.
[pic]
Problem 1
For each below, decide if there is a correlation between the two factors. If there is, is it a positive correlation or negative correlation. Then decide if the two factors have a causal relationship. If they don’t have a causal relationship, but they do correlate, determine if there are hidden factors that explain the correlation, if the correlation is accidental or if there is misleading information.
a. A child’s shoe size, a child’s ability to do math
b. Blood alcohol level and reaction time
c. A girl’s body weight, the time the girl spends playing with dolls each day
d. Price on an airline ticket, the distance traveled
Solution
a. Positive correlation. A child’s shoe size does correspond to a child’s ability to do math. The larger a child’s shoe size, the better in math they are. But, the relationship is not causal; large feet do not cause a child to perform math better. There is a hidden factor. Age. The older the child, the larger the child’s shoe size. As a child get older, they take more math classes. The more math classes the child has taken, the better the child performs in math.
b. Negative correlation. The higher the blood alcohol level, the shorter (slower) the reaction time. The relationship is causal.
c. Negative correlation. As a girl’s body weight increases, they play with dolls less each day. No causal relationship here, again a hidden factor. And again it is age. The more a girl weighs, the older she is, the less time she spends playing with dolls.
d. Positive correlation. The longer the distance, generally, the more expensive the ticket. Causation.
Problem 2
Let’s examine the basic question, “Do students who do better on a placement exam perform better in a college algebra course?” Below is the data. Draw a scatter plot and answer the question.
|Placement Score |Final Average |
| |College Algebra |
|70 |91 |
|68 |89 |
|56 |71 |
|40 |62 |
|78 |95 |
|59 |65 |
|67 |85 |
|45 |66 |
|61 |70 |
Solution:
We need to examine the data visually to see if there exists a positive or a negative correlation between higher placement test scores and performance in college algebra. Below construct a scatter plot.
Though not all points show the same trend, then general trend is an increase in placement score does translate to an increase in the final average grade.
Problem Set
For problems 1 – 4, decide if there is a correlation between the two factors. If there is, is it a positive correlation or negative correlation. Then decide if the two factors have a causal relationship. If they don’t have a causal relationship, but they do correlate, determine if there are hidden factors that explain the correlation, if the correlation is accidental or if there is misleading information.
1. A boy’s height, a boy’s time spent watching cartoons each day
2. Altitude, air pressure
3. Number of homes sold, realtor’s income
4. Number of abortions in US, number of people who are Pro-choice
5. From a survey of 2000 people, the table below represents averages for the number of years in school and the associated average monthly salary. Make a scatter plot labeling the x and y axes. Label the x and y axes.
|Number of Years in School |Average Monthly Salary |
|Less than 12 (approx. 10) |$ 1,500 |
|12 |$ 1,750 |
|14 |$ 2,100 |
|16 |$ 2,550 |
|18 |$ 3,000 |
|20 |$4,200 |
6. Draw a line which closely fits the scatter plot for accumulative donations for a charity by year below:
[pic]
7. From the scatter plot below, interpret the linear pattern and predict the percent of students who failed the math course in the year 2,000.
8. Which data below has the greatest negative correlation?
a)
b)
c) [pic]
d)
Construct and Draw Inferences
Constructing and drawing inferences are essential to critical thinking and problem solving. When faced with statements, problems and puzzles, we do more than use common sense. We use problem solving skills, try to fit patterns and infer statements that follow logically from the statements given. We determine what is reasonable and what is not. We determine what should logically follow and what should not in order to make good decisions.
Circle Graphs
Ripped from the nation’s headlines: Should a juvenile be tried as an adult? To address this issue, we should ask ourselves many questions and look at this crucial problem from many perspectives. For many of us, the first question we ask may be “Do juveniles who murder pose a chronic problem in this country?” Well, what’s chronic? If a large percentage of all murders were done by juveniles, this could be called chronic.
So, we return to the Crime Index as defined by the FBI from 2001. Let us ask the question, “does there exist a correlation between age and those who commit murder in this country?” As long as we have the information grouped by category, which in this case is by age, we can recognize large numbers displayed in data as a percent of the whole in a pie chart or circle graph.
First, let’s see how a circle graph or pie chart is made. We tend to subdivide a circle into sectors represented by their central angle in either degrees (out of 360 degrees) or the percent of the circle that is to be shaded (out of 100 %).
Below are common subdivisions of a circle:
[pic]
So, for our question: “is their a correlation between age and those who commit murder in this country?”, we examine the data taken from the Crime Index. Of the 10,113 number of known murderers in the country in 2001, there age distribution was given as follows:
|Age, in years |Number |
| 1 to 4 |0 |
| 5 to 8 |0 |
| 9 to 12 |14 |
|13 to 16 |454 |
|17 to 19 |1,695 |
|20 to 24 |2,767 |
|25 to 29 |1,571 |
|30 to 34 |992 |
|35 to 39 |855 |
|40 to 44 |645 |
|45 to 49 |455 |
|50 to 54 |272 |
|55 to 59 |158 |
|60 to 64 |85 |
|65 to 69 |59 |
|70 to 74 |37 |
|75 and over |54 |
|Total |10,113 |
So, since the data is already organized, let’s find the density of each age group. This means we will reconstruct the table and find the percent of murderers for each category, 1 to 4, 5 to 8, 9 to 12, 13 to 16 and so on. Note, not all categories are partitioned into equal time intervals.
|Age, in years |Number |Relative frequency |Central Angle |
| 1 to 4 |0 |0 | |
| 5 to 8 |0 |0 | |
| 9 to 12 |14 |[pic] |[pic] |
|13 to 16 |454 |[pic] |[pic] |
|17 to 19 |1,695 |[pic] |[pic] |
|20 to 24 |2,767 |[pic] |[pic] |
|25 to 29 |1,571 |0.115 |41.4 degrees |
|30 to 34 |992 |0.098 |35.3 degrees |
|35 to 39 |855 |0.085 |30.4 degrees |
|40 to 44 |645 |0.064 |23 degrees |
|45 to 49 |455 |0.045 |16.2 degrees |
|50 to 54 |272 |0.027 |9.7 degrees |
|55 to 59 |158 |0.016 |5.62 degrees |
|60 to 64 |85 |0.008 |3 degrees |
|65 to 69 |59 |0.005 |2.1 degrees |
|70 to 74 |37 |0.004 |1.3 degrees |
|75 and over |54 |0.005 |2 degrees |
|Total |10,113 |1 |360 degrees, a whole circle |
The pie chart below is illuminating. Very quickly, by glancing at the chart, we can tell that 20 to 24 year olds commit the most murders, but a close second seems to be 17 to 19 year olds, as well as 25 to 29 year olds. If a juvenile is defined to be under 18 years of age, then this appears to be a chronic problem because the second most dense population of murderers occurs in the age group 17 to 19 year olds. Now when we factor in the 13 to 16 year olds (454), the problem of juvenile murder seems to be more acute. For murders committed by teenagers alone, we have within the 13 to 19 year old age group, accounted for 454 + 1695 or 2,149 murders committed by teenagers. This comes to 2149/10,113 or just a little over 20 percent, and this doesn’t include the children who are 12 or under.
[pic]
Now, let’s continue to address this problem again. Numbers never lie. But rearranged, could they deceive? Could the very same numbers be used by the opposing side of the argument to make the opposing view more viable? As said, first, we re-arrange the numbers.
|1 to 19 |14+454+1695=2163 |
|20 to 39 |2767+1571+992+855=6185 |
|40 to 59 |645+455+272+158=1530 |
|60 and over |85+59+37+54=235 |
Giving us the following pie chart: [pic]
If we consider only the murders where we know the age of the murderer. There were 10,113 of these murders.
But, we are trying to represent the opposing point of view and we are trying to show murder by juveniles is not ‘chronic.” So, in 2001, there were an additional 5375 murders where the age of the perpetrator was unknown. Regrouping, our table looks like:
|1 to 19 |2,163 |
|20 to 39 |6,185 |
|40 to 59 |1,530 |
|60 and over |235 |
|Unknown |5,375 |
And the pie chart now looks like: [pic]
Notice how much smaller the piece of the pie for the 1 to 19 year old segment now is compared to the whole. This is significant difference from the above pie chart where we did not factor in the murders committed by people of unknown ages. To further enhance our argument, we may construct the slices of the pie with a 3–dimensional representation. We then angle the segment of the pie we are trying to ostensibly hide so that it stands out less. Now, our point that juvenile crime is not a chronic problem seems more justified to the viewer’s eye.
[pic]
To add the final touch to enhance our argument, let’s re-categorize, so we change two groupings: 1 to 19 and 20 to 29 to 1 to 16 and 17 to 39. If we keep the category of unknown murderers in the groupings, let’s compare the original pie chart with the final one. To the naked eye, a quick glance reveals a sliver on the left chart compared to nearly a quarter of the pie on the right.
[pic][pic]
Statistics don’t lie, they can be re-arranged though to show what ever is on one’s agenda.
Problem 2
The graph below is shown and a TV anchor man states, “There was a sharp dramatic increase in drunk driving convictions between the year 1999 and the year 2000.” Consider the statement and reply to its accuracy.
[pic]
Solution
According to the figure, the actual increase in drunk driving convictions between 1999 and 2000 was 12, up to 732 from 720 the year before. Though this is an increase, it can not be considered a “sharp dramatic increase”. Evaluating the data in another way, the percent increase, [pic]is not significantly sharp or particularly dramatic. The anchor man was over dramatizing the report, the words were inflammatory, bordering on misleading.
Problem 3 Drawing Inferences
A bucket has small green balls, medium blue balls, large pink balls, and very large red balls. A child picks ten balls, selecting each randomly so each ball is equally likely to be selected. Four such trials were conducted. Which trial most closely resemble the theoretical probability that would occur if the balls were selected randomly ten times?
[pic]
a)
|Balls |Number of Balls |
| |Selected |
|Small Green |2 |
|Medium Blue |2 |
|Large Pink |2 |
|Very Large Red |4 |
b)
|Balls |Number of Balls |
| |Selected |
|Small Green |3 |
|Medium Blue |2 |
|Large Pink |3 |
|Very Large Red |2 |
c)
|Balls |Number of Balls |
| |Selected |
|Small Green |2 |
|Medium Blue |3 |
|Large Pink |2 |
|Very Large Red |3 |
d)
|Balls |Number of Balls |
| |Selected |
|Small Green |3 |
|Medium Blue |2 |
|Large Pink |4 |
|Very Large Red |1 |
Solution
First, we need to calculate the theoretical probability for each type of ball. Recall, the probability is the number of successful outcomes divided by the total number of outcomes. The total number of balls is 20. There are 6 small green ones, 4 medium blue ones, 7 large pink ones, and 3 very large red ones. Below are the theoretical probabilities:
|Balls |Prob. |
|Small Green |6/20 |
|Medium Blue |4/20 |
|Large Pink |7/20 |
|Very Large Red |3/20 |
If ten balls were selected, we could anticipate 3 out of 10 balls to be small green ones, 2 out of 10 to be medium blue ones, 3.5 out of 10 to be large pink ones and 1.5 out of 10 to be very large red ones. This trial outcome is impossible and so choice b) is the closest trial to these expected results.
Problem Set
1. In the year 2000, a state lottery distributes its $ 2 million proceeds in the following manner:
|Proceeds |Beneficiary |
| $ 900,000 |Education |
|$ 500,000 |Cities |
|$ 200,000 |Highways |
|$ 200,000 |Senior Citizens |
|$ 160,000 |Libraries |
|$ 140,000 |Other |
Draw a circle graph to represent this data.
2. The graph below shows the companies profits in its first four years of existence. [pic]
What’s wrong with this statement, “There was a substantial increase in the company’s profit in its first 4 years of existence.”
3. Poll your classmates as to the most important ‘hot button’ campaign issue. Create a table as you see below.
|Topic |Frequency |Relative Frequency |Density |
|Terror | | | |
|Racial Discrimination | | | |
|Abortion | | | |
|Death Penalty | | | |
|Drugs | | | |
|Education | | | |
Construct a histogram and a pie chart for the data.
For problems 4 and 5, use the following data for the US Census Bureau. In 1999, there were roughly 280,000,000 US citizens, and 35,000,000 lived in poverty. Of these 35 million, 12,100,000 were children, where 4,500,000 of these children lived in families who were under one-half of the poverty level. The poverty level was defined as $ 13,290 per family of three. For each problem, construct a circle graph as designated below.
4. Draw a circle graph whose population is the citizens of the United States. Section the circle graph into two sectors, one sector representing the US citizens who live above the poverty level, one sector representing the US citizens who live below the poverty level.
5. Draw a circle graph whose population is those citizens who live below the poverty level. Section the circle graph into three sectors, one sector representing the adults who live below the poverty level, one sector representing the children who lived in families who lived under one-half the poverty level and the third sector is all of the other children who lived below the poverty level.
6. Project: Circle graphs, drawing inferences
Sometimes we choose to see what we want to see. We all stretch the truth, exaggerate what we need, ignore what hurts us and to what end, personal wealth at the expense of personal worth? From the US Census Bureau, 2000: Child poverty in America dropped from 13.5 million children in 1998 to 12.1 million in 1999. With a whisper of optimism, we rationalize that this improvement was great. Was it?
Do you ever have trouble focusing on exams or concentrating on homework assignments? How hard do you think it would be to concentrate on your exams, homework, or even your instructor' lectures if your family didn't have enough money to feed you? What if you were in poor health and your family couldn't afford to take you to the doctor or provide the medicine you need? The bitter truth is that in 2000, 12,100,000 children in America were living in poverty and confronted these challenges every day.
If a family of three were living below the poverty line in 2000, they had an income below $13,290 a year. Living in poverty can translate to residing in crowded housing, having your utilities turned off, not owning a phone, or refrigerator or car, not having enough food to feed your family, not enough medicine to heal your loved ones. And the heart wrenching statistic is that 4.5 million children live in families that exist below one-half of the official poverty level.
Do we have your attention, are you gasping in proper reverence? We should. Particularly because in 2000, America was experiencing one of its greatest flashes of economic prosperity. Business was skyrocketing, and people were spending. But, was just a minute percentage of Americans benefiting from this new wealth? Ironically, in 2000, the unemployment rate in the U.S. was lower than it had been in years, but the percentage of poor children in working families was soaring. There were many possibilities to explain this phenomenon, but "Some economists (said) that if wages had kept pace with the cost of living since the 1960s, the minimum wage would (have been) between $12 and $14 dollars" ().” Instead, the minimum wage was $5.15.
Assignment Go to the US Census Bureau. Find out how many children there were in the US in 2000. Construct a circle graph with the following categories: Children who lived above the poverty level, children who lived below the poverty level. Draw separate sections of the circle graph for those children who lived above $ 6,645 a year (half of the poverty level of $13,290 a year) and those who lived below $ 6,645 per year. Also, include a section of the graph for those children who lived in the upper 1 % of the income bracket and determine what that income level was. Then tackle the following questions?
a. Do you think there is a positive, negative or no correlation between concentrating in high school and graduating from high school? Is it a causation relationship? Why?
b. Do you think there is a positive, negative or no correlation between concentrating in high school getting into college? Is it a causation relationship? Why?
c. Do you think there is a positive, negative or no correlation between concentrating in high school and acquiring well-paying jobs? Is it a causation relationship? Why?
d. Do you think there is a positive, negative or no correlation between staying healthy and having access to doctors and medicine? Is it a causation relationship? Why?
e. Do you think there is a positive, negative or no correlation between poverty and crime? Is it a causation relationship? Why?
f. Do you think there is a positive, negative or no correlation between issues that politicians and lawmakers have as a top priority and issues that affect those under 18, who can not vote? Is it a causation relationship? Why?
For problems 7-9, use the following data. Source: In 2000, the population of California was 33,871,648 and 134,227 Californians purchased 193,489 handguns. 103,743 people purchased one hand gun, 28,453 people purchased two to five handguns totaling 71,363 handguns. 1,855 people purchased 6 to 12 handguns, totaling 14,053 handguns and 176 people purchased more than 12 handguns for a total of 4330 handguns.
7. Construct a circle graph representing the number of people who purchased handguns in California in the year 2000. Separate the sectors into those who bought one gun, two-five guns, 6-12 guns and more than 12 guns. .
8. Construct a circle graph representing the number of handguns purchased by Californians in the year 2000. Separate the sectors into those who bought one gun, two-five guns, 6-12 guns and more than 12 guns. .
9. Based on your results in problems 8 and 9, construct an argument pertaining to gun control and argue one side of the argument based on these results.
For problems 10-15, use this information provided: 5,000 years ago, forests covered nearly 50% of the earth's land surface. Since the advent of humans, forests now cover less than 20%. Forests serve as the lungs to our planet by providing the very oxygen with which we breath. The rate of deforestation is increasing and although extinction is nature’s way of selectively re-aligning our living world, this extinction, the most acute since the dinosaurs, is not nature’s way. Humans have caused it, by themselves.
Source: According to RAN (Rainforest Action Network) and Myers (Op sit). In Central and South America, Bolivia, whose land mass is 1,098,581 square kilometers once had a forest cover of 90,000 sq km, now has a forest cover of 45,000 sq km. Brazil, whose land area is 8,511,960 sq km, once had a forest cover of 2,860,000 sq km, now has a forest cover of 1,800,000 sq km. Central America has a land area of 522,915 sq km, once had a forest cover of 500,000 km and now has a forest cover of 55,000 km. Columbia has a land area of 1,138,891 sq km once had a forest cover of 700,000 km and now has a land area of 180,000 km. Ecuador’s land area is 270,670 km, once had a forest cover of 132,000 sq km and now has a forest cover of 44,000 km. Mexico’s land area is 1,967,180 sq km, one time its forest cover was 400,000 sq km and now its forest cover is 110,000 sq km.
10. For each country, construct a circle graph where the circle represents the land area of the country. Divide each circle into two sectors, one for the country’s land area that was once covered by forests and one for the land area not that was not covered by forests at that time.
11. For each country, construct a circle graph where the circle represents the land area of the country. Divide each circle into two sectors, one for the country’s land area that is currently covered by forests and one for the land area that is currently not covered by forests at that time.
12. For each country, construct a circle graph where the circle represents the original extend of forest cover. Divide the circle into two sectors, one for the existing land area covered by forests and one for the land area lost to deforestation.
13. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central America Columbia, Ecuador and Mexico. Divide the circle graph into twelve sectors, two for each country, where one sector represents the land area currently covered by forests and the other the land area currently not covered by forests.
14. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central America Columbia, Ecuador and Mexico. Divide the circle graph into twelve sectors, two for each country, where one sector represents the land area that was once covered by forests and the other represent the land area at that time that was not covered by forests.
15. After assimilating the information and viewing the circle graphs from problems 10-15, provide an argument, either pro or con, with regard to the following statement: “Deforestation of the rain forests in Central and South America is threatening the local environment as well as the global environment. It should be a not button issue in today’s society.”
16. Project: Below is another table taken from the FBI Crime List in 2001. Using circle charts (pie charts), take an issue you gleam from the table provided and show both sides of the argument, using pie charts to visually sway your reader. Outline the issue, show the supporting table(s) and pie chart(s). Discuss the benefits and harm of such practices.
|Murder Victims1 |
|By Race and Sex, 2001 |
|Race of victim | |Total | |Sex of Victims |
| | | | |Male |Female |Unknown |
| |
|Total white victims |
|Total victims2 |13|
| |,7|
| |52|
| September 11, 2001, were not included in any murder tables (Tables 2.3-2.15). | |
| See special report, Section V. | | | | | | |
|2 Total number of murder victims for whom supplemental homicide data were received. |
Measure of Central Tendency
Finding a number that best represents a set of data is important to you right now. Because your choice of the “representative” number that best indicates your grade can determine your course grade. Mathematicians say that to find the number that is going to serve as the spokesperson for the data should reflect the measure of the center or the middle of the data.
Usually we begin by averaging the numbers to find that representative number; we find the sum and then divide by the number of data points. But, if the data consists of exam scores and you earned a 95, 95, 95, 95, 95, and a 45, then your average is found with two calculation, 95 + 95 + 95 + 95 + 95 + 45 = 520 and 520/6 = 86.7 . This means the center of your data, or the letter for the grade that best represents your data is a B according to your average, and yet you never once earned a B. In fact, you earned only A’s, except for one failing grade. You pause, because clearly you earned 5 grades of a 95 and just one grade of a 45. This must count for something, right? The data that appears the most, the 95, is called the mode and it is simply another representation of the tendency of the data.
Now that we see there is more than one way to refer to the center of the data, let’s begin with perhaps a more realistic example. Suppose we knew you had the following exam scores, 60, 80, 60, 70, 80, 80, 90, and 95. Your thinking perhaps you deserve an A because your last two grades were A’s. Or at the very least, you deserve a B. You begin by finding your average or mean, which is the sum of the scores divided by the number of scores; so you average your grades. First you add the scores: 60 + 80 + 60 + 70 + 80 + 80 + 90 + 95 = 615. You had 8 exams and the average is found by dividing 615 by 8; 615/8 = 76.9 or a C. Uh oh. You change your strategy. You argue that you scored an 80 three times, you deserve a B. The mode is the data that occurs most frequently, and your mode is an 80. Does this help your argument? Well, one more indication of the middle of your data is the middle value when you align the numbers in order, either from top to bottom or from bottom to top. So, we arrange our data as 60, 60, 70, 80, 80, 80, 90, and 95. The data that occurs in the middle is called the median, like the median of the highway. If there is an odd number of data, the median will be a number. If there is an even number of data, there will be two numbers in the physical middle of the data, and when this occurs, you need to average the two middle numbers. For us, there are two 80’s smack in the middle of the data, another indication you deserve a B. Now, perhaps the last possible argument you may use to justify you are a B student is that your last four exam grades, 80, 80, 90 and 95 showed you were more of a B student than a C student at the end of the course. So, despite having an average or mean of a 76.9, your mode and median scores were an 80 and you’re your grades in the second half of the semester were certainly not indicative of a C student. What grade should you get? What grade did you earn?
Real Estate You meet with a real estate agent and carefully explain to the agent the price range of the homes you are interested in seeing. The agent taps away on their computer and tells you they tell you they completely understand what you want, that you are looking for homes in the $130,000 to $160,000 range. You nod your head in agreement. The agent informs you they have found three neighborhoods where the mean (average) value of houses in the three neighborhoods are $ 128,571, $136,786 and $ 161,429. Each subdivision is small, just like you prefer, with 14 homes. They explain the need for you to sign a exclusive right to buy agreement before they take you out. Impressed with both the immediacy and the detail provided, you quickly sign the agreement. The real estate agent takes you out for the day. After cozying into the front seat of their car, you sit back and enthusiastically await what should prove to be a worthwhile day of house hunting. By the end of the day, you are nodding your head sideways, not up and down, and you are straining to think of ways to break the stupid exclusive right to buy agreement you just signed earlier. What happened? Let’s see.
The three subdivisions you saw:
| |Sleepy Brook |Vista View |Meadowlands |
|House 1 |205,000 |130,000 |300,000 |
|2 |400,000 |130,000 |400,000 |
|3 |500,000 |135,000 |400,000 |
|4 |80,000 |140,000 |500,000 |
|5 |70,000 |150,000 |65,000 |
|6 |60,000 |125,000 |70,000 |
|7 |80,000 |125,000 |70,000 |
|8 |80,000 |125,000 |65,000 |
|9 |100,000 |125,000 |65,000 |
|10 |100,000 |125,000 |65,000 |
|11 |60,000 |125,000 |65,000 |
|12 |60,000 |125,000 |65,000 |
|13 |60,000 |120,000 |65,000 |
|14 |60,000 |120,000 |65,000 |
|Average Value |136,786 |128,571 |161,429 |
Which subdivision was closest to your liking? Well, clearly Vista View is the only subdivision that even had homes in your price range, with 5 of the 14 homes within your price range. But, this was the least likely subdivision because it’s average value home was a little below your range. But, visiting the other two subdivisions was useless, they had no homes in your range. The agent never checked the values of the homes in the three subdivisions, they only checked the average value of the homes. To cut the agent some slack, checking 3 subdivisions with 14 homes each would have been a lengthy endeavor, because each home would have needed to be accessed individually on the computer screen. Remember, the agent wanted to impress you with their quick research. Still, the oversight was caused because you did not have enough information about the data. Measures of Central Tendency informs us as to the behavior of the middle of the data, without the need to see every tedious piece of data. Since pulling up each home would have been too time consuming (42 homes) what other pieces of information could have been pertinent so that you would have known that only Vista View was worth visiting?
Mean, Median, Mode and Range. The mean or average value for a set of data is the average most of us are familiar with, where we take the sum of the data and divide the sum by the number of pieces of data.
For Sleepy Brook: [pic]
For Vista View: [pic]
For Meadowlands: [pic]
But, clearly, this was not enough information about the middle of the data. What else could have helped us. Well, in the Meadowlands subdivision, there were 8 of the 14 homes were worth $65,000, one-half of our lower limit for our price range. This would have been helpful to know. The mode is the piece of data that shows up the most frequently. So, in the Meadowlands, the mode is 65,000, occurring 8 times. For Vista View, the most is 125,000 occurring 7 times. This mode is close to our price range. And Sleepy Brook? It’s mode was much lower, 60,000, occurring 5 times.
What other tendency for the data would have been helpful. How many homes are not in our price range would be too easy of an answer, huh. Well, if we order the data from smallest to largest (or largest to smallest), the middle of the data is called the median. We use the image of a median on a highway to remember the name, because the median on a highway physically divides the highway in half. Our median does the same thing. If we order our data, then:
| |Sleepy Brook |Vista View |Meadowlands |
|House 1 |60,000 |120,000 |65,000 |
|2 |60,000 |120,000 |65,000 |
|3 |60,000 |125,000 |65,000 |
|4 |60,000 |125,000 |65,000 |
|5 |60,000 |125,000 |65,000 |
|6 |70,000 |125,000 |65,000 |
|7 |80,000 |125,000 |65,000 |
|8 |80,000 |125,000 |65,000 |
|9 |80,000 |125,000 |70,000 |
|10 |100,000 |130,000 |70,000 |
|11 |100,000 |130,000 |300,000 |
|12 |205,000 |135,000 |400,000 |
|13 |400,000 |140,000 |400,000 |
|14 |500,000 |150,000 |500,000 |
|Median |80,000 |125,000 |65,000 |
Note, that if there are an odd number of data, then the median is a single piece of data. If there is an even number of data, the median we are seeing is the average of the middle two items. How would knowing the median have been helpful? Well, if we knew the medium for Meadowlands, then we would have known that one-half of the homes in the subdivision, that is 7 of the homes were $65,000 or below. To keep an average of $161,429, many of the other homes would have needed to be too expensive for us, leaving at best, the possibility of at most a few homes in our range. It turned out, there were no homes in our range.
Which leads us to the dispersion of the data. Dispersion means spreading, scattering or distribution. We can address these different words with different measures of central tendency. The range is the difference between the largest and the smallest data point. For Sleepy Brook, $500,000 - $60,000 = $440,000 or most realistically, there is a large difference between the cheapest and the most expensive home in the subdivision. For Vista View, $ 150,000 - $ 120,000 = $ 30,000 and this tells us all the homes are at least close to our price range. Meadowlands has the problem Sleepy Brook had, the range is $500,000 - $65,000 = $435,000.
To measure the scattering and the distribution of even larger samples of data, we will examine standard deviations a little later. But first, let’s look at mean, median, mode and range a little longer.
Problem One
Below are the Traffic Fatalities per 100 million (108) vehicle miles in 2001 Source: U.S. National Highway Safety Traffic Administration. Rank the states and the District of Columbia in ascending order. Then find the mode, median, mean and range. Discuss the relevance of the numbers. This means if any two correspond closely, look at the data and tell why. If any state is far from the middle of the data, call it an outlier.
|Alabama |1.75 |
|Alaska |1.80 |
|Arizona |2.06 |
|Arkansas |2.08 |
|California |1.27 |
|Colorado |1.71 |
|Connecticut |1.01 |
|Delaware |1.58 |
|District of Columbia |1.81 |
|Florida |1.93 |
|Georgia |1.50 |
|Hawaii |1.61 |
|Idaho |1.84 |
|Illinois |1.37 |
|Indiana |1.27 |
|Iowa |1.49 |
|Kansas |1.75 |
|Kentucky |1.83 |
|Louisiana |2.32 |
|Maine |1.33 |
|Maryland |1.27 |
|Massachusetts |0.90 |
|Michigan |1.34 |
|Minnesota |1.06 |
|Mississippi |2.18 |
|Missouri |1.62 |
|Montana |2.30 |
|Nebraska |1.36 |
|Nevada |1.71 |
|New Hampshire |1.15 |
|New Jersey |1.09 |
|New Mexico |1.99 |
|New York |1.18 |
|North Carolina |1.67 |
|North Dakota |1.45 |
|Ohio |1.29 |
|Oklahoma |1.55 |
|Oregon |1.42 |
|Pennsylvania |1.49 |
|Rhode Island |1.01 |
|South Carolina |2.27 |
|South Dakota |2.00 |
|Tennessee |1.85 |
|Texas |1.72 |
|Utah |1.25 |
|Vermont |0.96 |
|Virginia |1.27 |
|Washington |1.21 |
|West Virginia |1.91 |
|Wisconsin |1.33 |
|Wyoming |2.16 |
Answer:
|Massachusetts |0.90 |50 |
|Vermont |0.96 |49 |
|Connecticut |1.01 |47 |
|Rhode Island |1.01 |47 |
|Minnesota |1.06 |46 |
|New Jersey |1.09 |45 |
|New Hampshire |1.15 |44 |
|New York |1.18 |43 |
|Washington |1.21 |42 |
|Utah |1.25 |41 |
|California |1.27 |37 |
|Indiana |1.27 |37 |
|Maryland |1.27 |37 |
| |1.27 |37 |
|Virginia | | |
|Ohio |1.29 |36 |
|Maine |1.33 |34 |
|Wisconsin |1.33 |34 |
|Michigan |1.34 |33 |
|Nebraska |1.36 |32 |
|Illinois |1.37 |31 |
|Oregon |1.42 |30 |
|North Dakota |1.45 |29 |
|Iowa |1.49 |27 |
|Pennsylvania |1.49 |27 |
|Georgia |1.50 |26 |
|Oklahoma |1.55 |25 |
|Delaware |1.58 |24 |
|Hawaii |1.61 |23 |
|Missouri |1.62 |22 |
|North Carolina |1.67 |21 |
|Colorado |1.71 |19 |
|Nevada |1.71 |19 |
|Texas |1.72 |18 |
|Alabama |1.75 |16 |
|Kansas |1.75 |16 |
|Alaska |1.80 |15 |
|District of Columbia |1.81 |(X) |
|Kentucky |1.83 |14 |
|Idaho |1.84 |13 |
|Tennessee |1.85 |12 |
|West Virginia |1.91 |11 |
|Florida |1.93 |10 |
|New Mexico |1.99 |9 |
|South Dakota |2.00 |8 |
|Arizona |2.06 |7 |
|Arkansas |2.08 |6 |
|Wyoming |2.16 |5 |
|Mississippi |2.18 |4 |
|South Carolina |2.27 |3 |
|Montana |2.30 |2 |
|Louisiana |2.32 |1 |
|Mode |1.27 | |
|Median |1.55 | |
|Mean |1.57 | |
The mean and median are close, this means the number in the middle of the data and the average are close together. The number of states that rank above and below the average and the number of states that rank above and below the middle state, GA, are close to the same. So, the data is not top or bottom heavy. Yet, this doesn’t mean the data is dispersed evenly. Why?
Let’s see why.
Problem Set
For problems 1-6 below, find the mean, median and mode for the data.
1. 1, 3, 4, 4, 4, 5, 5, 6 2. 3, 3, 4, 4, 4, 5, 5, 6
3. 3, 3, 3, 4, 4, 5, 5, 6 4. 3, 3, 3, 4, 5, 5, 5, 6
5. 1, 1, 1, 1, 2, 2, 6, 6 6. 1, 1, 2, 2, 6, 6, 6, 6
7. What is the median time it took for the students to write the exam?
|Student ID Number |Time to Take Exam |
|4025 |1:25 |
|1026 |1:09 |
|8790 |0:59 |
|1029 |0:45 |
|2943 |1:01 |
|2020 |1:12 |
|2084 |1:25 |
|5091 |1:31 |
|7812 |0:49 |
|5103 |2:00 |
|6092 |1:42 |
8. Below is the year and the percent of children under the age of 4 in a city that attended Day Care.
|Year |Percent |
|1970 |15 |
|1972 |17 |
|1974 |15 |
|1976 |16 |
|1978 |18 |
|1980 |17 |
|1982 |21 |
|1984 |31 |
|1986 |12 |
|1988 |15 |
|1990 |16 |
|1992 |17 |
|1994 |15 |
|1996 |12 |
What is the mode for this set of data?
9. From the US Census Bureau, 1999, below is the state rankings of the percent of elderly persons, 65 years and over that live below the poverty level. Rank the states and the District of Columbia in ascending order. Then find the mode, median, mean and range. Discuss the relevance of the numbers. This means if any two correspond closely, look at the data and tell why. If any state is far from the middle of the data, call it an outlier.
|Alabama |15.5 |
|Alaska |6.8 |
|Arizona |8.4 |
|Arkansas |13.8 |
|California |8.1 |
|Colorado |7.4 |
|Connecticut |7.0 |
|Delaware |7.9 |
|District of Columbia |16.4 |
|Florida |9.1 |
|Georgia |13.5 |
|Hawaii |7.4 |
|Idaho |8.3 |
|Illinois |8.3 |
|Indiana |7.7 |
|Iowa |7.7 |
|Kansas |8.1 |
|Kentucky |14.2 |
|Louisiana |16.7 |
|Maine |10.2 |
|Maryland |8.5 |
|Massachusetts |8.9 |
|Michigan |8.2 |
|Minnesota |8.2 |
|Mississippi |18.8 |
|Missouri |9.9 |
|Montana |9.1 |
|Nebraska |8.0 |
|Nevada |7.1 |
|New Hampshire |7.2 |
|New Jersey |7.8 |
|New Mexico |12.8 |
|New York |11.3 |
|North Carolina |13.2 |
|North Dakota |11.1 |
|Ohio |8.1 |
|Oklahoma |11.1 |
|Oregon |7.6 |
|Pennsylvania |9.1 |
|Rhode Island |10.6 |
|South Carolina |13.9 |
|South Dakota |11.1 |
|Tennessee |13.5 |
|Texas |12.8 |
|Utah |5.8 |
|Vermont |8.5 |
|Virginia |9.5 |
|Washington |7.5 |
|West Virginia |11.9 |
|Wisconsin |7.4 |
|Wyoming |8.9 |
Standard Deviation and the Normal Distribution
According to a study done by the National Center for Health and Statistics, Mean Body Weight, Height and Body mass Index, United States 1960-2002, American men are (ages 20-74) are 25 pounds heavier in 2002 than they were some 42 years earlier in 1960. In 2002 the average American male weighed 191 pounds, up from his 1960 counterpart who weighed 166 pounds. American women from the same age group followed the same trend, the average American woman weighed 164 pounds in 2002, up 24 pounds from the average American woman from 1960 who weighed 140 pounds. This study caused quite a stir, as nutritionists and diet doctors clamored together to seek solutions. And as you can imagine, the dangers of obesity were revisited when this study was broadcasted. In turn, even the average male heights increased over the 42 year span, from 5‘ 8“ in 1960 to 5‘ 9 ½“ in 2002. And as expected, the average height for the American woman also increased, from 5‘ 3“ to 5‘ 4“. The study was done on a smaller representation of the true population, it was performed on thousands of people and in reality, the population of American Men and Women total in the hundreds of millions. Since these numbers are so large, we assume the data to be a normally distributed around the mean, or average. A normal distribution for a set of data means that there is more data close to the "average," and the less data farther from the average until finally relatively few data points tend to one extreme or the other. The data is symmetrically distributed away from the average.
This is common sense, or mathematical intuition. They are, after all, close to being one and the same. Let’s say you are writing a story about the height distribution of the American male in 2002 because you are trying to correlate it to ethnicity, diet or genes. First you take the population, in this case, those people who participated in the study, and tally up the number of people for each given height. Like most data, if the sample or population is large enough, the heights for the population turn out to be normally distributed. This means most people will be of average height or close to average height. In other words, the average height also will be the height to occur most frequently in our population and the height smack in the middle of the data when it is ordered. Thus, the mean will be the mode and the median too. Next, if a population is normally distributed, if you plot each height in increasing order, the number of men for a given height are symmetrically distributed around the average height. In other words, there will be more people close to average height than far from the average height. In 2002, the average height of the American male was 5’9 ½ ’’, and this height will occur most frequently. Then, for our normally distributed society which we aptly call the American male, the next most common heights occur from 5’9” through 5’ 10”, both heights ½ inch away from the mean height. Next, the most common heights would be expected to occur between 5’ 8” to 5’ 11” And so on. We expect less and less men to have a designated height as we move further from the average height. Intuitively, this fits our preconceived notion of our society, we expect less men than are 6 ‘ 5” than you would find men that are 5’ 11” for instance. And continuing, this means there will be fewer men that are five foot than 5 ‘ 7” and fewer that are 6 ½ feet than 5 ‘ 11”. If the heights do fit a normal distribution, then the heights are distributed symmetrically around the average height.
In short, more people will be closer to the average height than farther from it, and usually the distribution is normally distributed around the average; hence the words normal distribution.
If you looked at normally distributed data on a graph, it would look something like this:
[pic]
The x-axis (the horizontal one) is the value in question, the population’s height for example. The y-axis (the vertical one) is the number of data points for each value on the x-axis., the number of people that are that certain height.
The standard deviation is a measure of how widely values are dispersed from the mean (average value). For populations where the data points are tightly bunched together, the bell-shaped curve is steep and the standard deviation is small. For populations where the data points spread further apart from the average, the bell curve is flatter and the standard deviation is larger.
68-95-99.7 To refine our understanding of a standard deviation, we turn our attention to a graph. In a moment we will show you the calculation for the standard deviation. Right now, we want to present a conceptual understanding for the term ‘standard deviation.’ Recall, in 2002, the American male had a mean height of 5 ‘ 9 ½ “. The standard deviation was 2 3/8 “.
[pic]
For a normal distribution, one standard deviation (in red above) away from the mean in both directions on the horizontal axis will account for approximately 68 % of the population. There are two heights that are 2 3/8 inches from 5’ 9 1/2”, the smaller 5’ 7 1/8” (5’ 9 ½” – 2’ 3/8”) and one larger, 5’ 11 7/8” (5’ 9 ½” + 2’ 3/8”). Thus, 68 % of the American men in 2002 were between 5’ 7 1/8” and 5’ 11 7/8”.
All data found within two standard deviations (in red and green above) from the mean will account for roughly 95 % of the normally distributed population, or the American men. The two heights two standard deviations away from the mean are found with two predictable calculations. We first subtract two standard deviations from the mean, giving us 5’ 9 ½” - 2’ 3/8” - 2’ 3/8” = 5’ 4 ¾” We then add two standard deviations to the mean, giving 5’ 9 ½” + 2’ 3/8” + 2’ 3/8” = 6 ‘ 2 ¼”. So, 95 % of the American men is 2002 were between 5’ 4 ¾” and 6’ 2 ¼”. Recall, the heights of these 95 % of the population are even distributed from the mean.
Data found three standard deviations from the mean (the red, green and blue areas) account for about 99.7 % of normally distributed populations. So, in 2002, 99.7 % of the American men were between 5’ 2 3/8” (5’ 9 ½” - 2’ 3/8” - 2’ 3/8” - 2’ 3/8”) and 6’ 4 5/8”(5’ 9 ½” + 2’ 3/8” + 2’ 3/8” + 2’ 3/8”). From a different perspective, one could infer that in 2002, those American men who were more than three standard deviations away from the mean either were shorter then 5’ 2 3/8” or taller than 6’ 4 5/8”. Since they represented 0.3 % of the American males, they were considered short or tall by the our population’s standards.
If a curve was flatter, the standard deviation would have to be larger in order to account for those 68 percent and if the curve was steeper, the standard deviation would have to be smaller to account for 68 percent of the population. Standard deviation tells you “how spread out the data points in the population are from the mean.”
Why is this useful? Well, if you are comparing test scores for different schools, the standard deviation will tell you how diverse the test scores are for each school. Let's say Washington High School has a higher mean test score than Adams High School for the mathematics portion of the statewide AIMS test administered in the state of Arizona to measure the students understanding of high school mathematics. Our first reaction might be to deduce that the students at Washington are either smarter or better educated by the teachers.
You analyze the data further. The standard deviation, you find out, at Washington is larger than at Adams. This means that at Washington there are relatively more kids at scoring toward one extreme or the other. By asking a few follow-up questions, you might find that Washington’s average was higher because the school district sent all of the gifted education kids to Washington. Or perhaps Adams scores were dragged down and thus appeared bunched together because all of the students who recently have been "mainstreamed" from special education classes. Perhaps the gifted classes were sent out of district. In this way, looking at the standard deviation can help point you in the right direction when asking why data is the way it is.
Example 1 You are trying to decide which teacher’s class to enroll in for Mathematics. You go to a website that claims to have tracked the three teacher’s success rates over the past five years. The final grade for Mr. Allen’s students had a mean score of 76 with a standard deviation of 5, while Mrs. Bennett’s students had a mean score of 74, with a standard deviation of 3 and Mrs. Clyde has a mean score of 79 for the student’s final grade, with a standard deviation of 7. Whose class would you enroll in, that is how would you interpret the data on the web site? Rank the teachers from first to third, so that if one’s section is full, you would know whose class to enroll in next..
Solution: We must quantify the exam scores to interpret the data. For Mr. Allen classes, 68 % of the students earned a final grade that was within 5 points of 76, so 68 % of the students scored earned between 71 to 81. About 95 % scored within two standard deviations of the mean, so 95 % of the students earned a final grade between 66 to 86. Finally, 99.7 % of the students earned a final grade between 61 to 91. Continuing with this thought process, Mrs. Bennett’s students has a lower final grade average, 74. But, 68 % of the students earned a final grade scored between 71 to 77, 95 % earned a final grade between 68 to 80 and 99.7 % earned a final grade between 65 to 83. For Mrs. Clyde’s students, her students earned the highest average, but she had the 68 % , 95 % and 99.7 % spread farther apart, 72-86, 65-93 and 58 – 100 respectively. A quick table allows us to compare the success rates of the three teachers:
| |Mr. A |Mrs. B |Mrs. C |
|68 % |71-81 |71-77 |72-86 |
|95 % |66-86 |68-80 |65-93 |
|99.7 % |61-91 |65-83 |58-100 |
So, to answer the question of which teacher you should take. If you are a good student, you have a better chance of securing an A with Mrs. Clyde first, Mr. Allen second and Mrs. Bennett third. If you struggle at math, you probably would choose Mrs. Bennett first because 99.7 % of her students earn above a 65. Mr. Allen would probably be your second choice, Mrs. Clyde your third choice.
Example 2 In Typical City, USA, the number of hours a teen watches TV has become concern for the town’s elders. They research this and find the teens watch an average of 4 ½ hours of TV a day, with a standard deviation of ½ hour. What percent of the teen’s watch
a) more than 5 hours of TV per day?
b) more than 5 ½ hours of TV per day?
c) less than 5 ½ hours of TV per day?
d) less than 4 hours of TV per day?
e) less than 3 ½ hours of TV per day?
Solution:
a) Since 5 hours is 1 standard deviation above the mean (4 ½ plus ½ ), then 68 % of the teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch between 4 ½ to 5 hours. So, 84 % will watch less than 5 hours, thus 100 % - 84 % or 16 % will watch more than 5 hours per day.
b) Since 5 ½ hours is 2 standard deviations above the mean (4 ½ plus ½ plus ½ ), then 95 % of the teens are distributed within 2 standard deviation or between 3 ½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 47 ½ % (half of the 95 %) will watch between 4 ½ to 5 ½ hours. So, 97 ½ % will watch less than 5 ½ hours, thus 100 % - 97 ½ % or 2 ½ % will watch more than 5 ½ hours per day.
c) From the above paragraph, we have 100 % - 2 ½ % = 97 ½ % of the teens will watch loess than 5 ½ hours of TV per day..
d) Since 4 hours is 1 standard deviation below the mean (4 ½ minus ½ ), then 68 % of the teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch between 4 and 4 ½ hours. So, 50 % - 34 % will watch less than 4 hours per day.
e) Since 3 ½ hours is 2 standard deviations below the mean (4 ½ minus ½ minus ½ ), then 95 % of the teens are distributed within 2 standard deviations or between 3 ½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, but another 47 ½ % (half of the 95 %) will watch between 3 ½ to 4 ½ hours per day. So, 50 – 47 ½ or 2 ½ % of the teens will watch less than 3 ½ hours per day.
Standard score or z-score. If one is analyzing data within 1, 2 or 3 standard deviations from the mean, then you can expect 68 %, 95 % or 99.7 % respectively, of the population to lie within these bounds. What happens if we know that 90 % of the data lies within two scores. What would the standard deviation look like?
Since data rarely if ever is presented to us where the mean is zero and the standard deviation is 1, we use the standard normal curve to help analyze any normally distributed data. Traditionally, this standard curve is referred to as standard scores or z-scores. In other words, the number of standard deviations that data lies above or below a mean is called the standard score or z-score. So, a data value with a z-score of 0 indicates the data is the mean. A data value with a z-score of –1.3 indicates the data value is 1.3 standard deviations below the mean and so forth. If you know the standard deviation and mean of your data, z-scores enable you to determine the percent of data between any two values in the range of your data.
To find each z-score, [pic].
Below is a table for the z-scores for the standard normal distribution.
|z |0 |0.01 |0.02 |0.03 |
|74 |73 |59 |69 |68 |
|76 |75 |62 |73 |78 |
|77 |77 |78 |78 |79 |
|80 |79 |79 |79 |79 |
|81 |83 |80 |82 |83 |
|82 |83 |96 |83 |83 |
|83 |83 |99 |89 |83 |
|79 |79 |79 |79 |79 |
All six students want a B. Allan argues that his middle grade, his median is a B. Bill argues that his mode, the grade that occurs most frequently, is a B. Cindy argues that she has shown great potential, two of her grades are solid A’s. Deanna argues the same argument, but her grades are not quite as erratic as Cindy’s they are not as dispersed away from the average as Cindy’s grades. Eve, like Bill, also argues that her mode is a B. And although Eve and Bill have the same mean, median and mode, Eve is the one with the 68 average. Uh oh….
Standard deviations measure just this, how a data value is deviates from a mean, in other words, a standard deviation is a numerical value that tells the reader how spread out the data is, Allan’s grades are clumped together, he should have a small standard deviation. Cindy’s grade history is more erratic, her grades are farther spread out, she should have a larger standard deviation.
Let’s compare the standard deviations for three of the students, Bill, Cindy and Eve. We will find how much each data value deviates from the mean. But notice, when we do, if we try and sum up these values to get some sort of average, the sum is zero because the deviations above the mean will cancel with those below the mean. So, we will square the results, to keep the deviations positive and then divide by one less than the number of data points. Finally, we will undue the squaring by taking the square root of both sides so that we can get this measure of how each value deviates from the mean.
|Bill |
The three divisional winners and the second place team with the best record makes the playoffs. But, who is the best team? The worst team? How good is good and how bad is bad? Let’s calculate the standard deviation with respect to the number of wins for each team.
First, we find the mean number of wins by adding up all the wins and dividing by 14.
[pic]
So, the average or mean number of wins is 67.4 for the American League teams on Labor, 2004. We then create a table. The table will have the teams listed in order by the number of wins, from most to least. We then create a third column with the difference of the number of wins for a team and the mean number of wins. The final column has this number squared. We will find the sum of that final column.
| | |Wins - Mean |(Wins - Mean)2 |
|NY YANKEES |83 |83-67.4 = 15.6 |15.62 = 243.36 |
|OAKLAND |81 |81-67.4 = 13.6 |13.62 = 184.96 |
|BOSTON |80 |12.6 |158.8 |
|MINNESOTA |77 |9.6 |92.2 |
|ANAHEIM |77 |9.6 |92.2 |
|TEXAS |74 |6.6 |43.6 |
|CHI WHITE SOX |67 |-0.4 |0.2 |
|CLEVELAND |67 |-0.4 |0.2 |
|BALTIMORE |63 |-4.4 |19.4 |
|DETRIOT |62 |-5.4 |29.2 |
|TAMPA BAY |59 |-8.4 |70.6 |
|TORONTO |56 |-11.4 |129.96 |
|SEATTLE |51 |-16.4 |268.96 |
|KANSAS CITY |47 |-20.4 |416.2 |
|SUM |944 |0 |1749.84 |
To find the standard deviation, we complete the last two steps.
[pic]
Recall, to find each z-score, [pic].
Below is the table of the American League teams on Labor Day, 2004. Look at the final column, and recall, as you glance at each teams’ z-score, that for a normal population, a 68 % of the population falls within 1` standard deviation or z-score of the mean, 95 % falls within 2 and 99.7 falls within 3. How good are the NY Yankees and how bad are the Kansas City Royals? You now have a frame of reference to answer that question.
| | |Wins - Mean |z-score |
|NY YANKEES |83 |83-67.4 = 15.6 |15.6/11.6 = 1.3 |
|OAKLAND |81 |81-67.4 = 13.6 |13.6/11.6 = 1.2 |
|BOSTON |80 |12.6 |1.1 |
|MINNESOTA |77 |9.6 |0.8 |
|ANAHEIM |77 |9.6 |0.8 |
|TEXAS |74 |6.6 |0.6 |
|CHI WHITE SOX |67 |-0.4 |-0.03 |
|CLEVELAND |67 |-0.4 |-0.03 |
|BALTIMORE |63 |-4.4 |-0.4 |
|DETRIOT |62 |-5.4 |-0.5 |
|TAMPA BAY |59 |-8.4 |-0.8 |
|TORONTO |56 |-11.4 |-0.98 |
|SEATTLE |51 |-16.4 |-1.4 |
|KANSAS CITY |47 |-20.4 |-1.8 |
Example 2. Does money buy success in baseball? Updated: 9/5/2004 3:37:06 PM The payroll for the American League teams are listed below.
Top of Form
| New York Yankees |
|$ 184,193,950 |
| |
| Boston Red Sox |
|$ 127,298,500 |
| |
| Anaheim Angels |
|$ 100,534,667 |
| |
| Seattle Mariners |
|$ 81,515,834 |
| |
| Chicago White Sox |
|$ 65,212,500 |
| |
| Oakland Athletics |
|$ 59,425,667 |
| |
| Texas Rangers |
|$ 55,050,417 |
| |
| Minnesota Twins |
|$ 53,585,000 |
| |
| Baltimore Orioles |
|$ 51,623,333 |
| |
| Toronto Blue Jays |
|$ 50,017,000 |
| |
| Kansas City Royals |
|$ 47,609,000 |
| |
| Detroit Tigers |
|$ 46,832,000 |
| |
| Cleveland Indians |
|$ 34,319,300 |
| |
| Tampa Bay Devil Rays |
|$ 29,556,667 |
| |
Let’s find the standard deviation from the mean for each team and then compare the rankings to the true standings on September 5th . Does there seem to be a correlation between salaries and success? Does money buy success?
The sum of the 14 American League salaries is $ 986,773,835, thus the average salary is $ 70,483,845.36, which we will round to $70,483,845.
To calculate the standard deviation, we construct the following table.
|Team |Payroll Salary |Payroll – Mean |(Payroll – Mean)2 |
| New York Yankees |$ 184,193,950 |113,710,105 |12,929,987,979,111,025 |
| | | | |
| | | | |
| Boston Red Sox |$ 127,298,500 |56,814,655 |3,227,905,022,769,025 |
| | | | |
| | | | |
| Anaheim Angels |$ 100,534,667 |30,050,822 |903,051,902,875,684 |
| | | | |
| | | | |
| Seattle Mariners |$ 81,515,834 |11031989 |121,704,781,296,121 |
| | | | |
| | | | |
| Chicago White Sox |$ 65,212,500 |- 5,271,345 |27,787,078,109,025 |
| | | | |
| | | | |
| Oakland Athletics |$ 59,425,667 |- 11,058,178 |122,283,300,679,684 |
| | | | |
| | | | |
| Texas Rangers |$ 55,050,417 | | |
| | | | |
| | | | |
| Minnesota Twins |$ 53,585,000 | | |
| | | | |
| | | | |
| Baltimore Orioles |$ 51,623,333 | | |
| | | | |
| | | | |
| Toronto Blue Jays |$ 50,017,000 | | |
| | | | |
| | | | |
| Kansas City Royals |$ 47,609,000 | | |
| | | | |
| | | | |
| Detroit Tigers |$ 46,832,000 | | |
| | | | |
| | | | |
| Cleveland Indians |$ 34,319,300 | | |
| | | | |
| | | | |
| Tampa Bay Devil Rays |$ 29,556,667 | | |
| | | | |
| | | | |
We leave it as an exercise for you to complete the table above.
• The sum of the numbers in the last column divided by 14 is called the variance. Here the variance is 1.62119 x 10^15.
• One standard deviation from the mean is found by taking the square root of this number. The standard deviation is $ 41,783,940.
So, let’s reprint the table, with the standard deviation from the mean listed for each team and it’s ranking in the American League. .
|Team |Payroll Salary |Standard Deviations from the |True Major League Ranking |
| | |Mean | |
| | |(z-score) | |
| New York Yankees |$ 184,193,950 |2.72 |1 |
| | | | |
| | | | |
| Boston Red Sox |$ 127,298,500 |1.36 |3 |
| | | | |
| | | | |
| Anaheim Angels |$ 100,534,667 |0.72 |Tied for 4 |
| | | | |
| | | | |
| Seattle Mariners |$ 81,515,834 |0.26 |13 |
| | | | |
| | | | |
| Chicago White Sox |$ 65,212,500 |-0.13 |7 |
| | | | |
| | | | |
| Oakland Athletics |$ 59,425,667 |-0.27 |2 |
| | | | |
| | | | |
| Texas Rangers |$ 55,050,417 |-0.37 |6 |
| | | | |
| | | | |
| Minnesota Twins |$ 53,585,000 |-0.40 |Tied for 4 |
| | | | |
| | | | |
| Baltimore Orioles |$ 51,623,333 |-0.45 |9 |
| | | | |
| | | | |
| Toronto Blue Jays |$ 50,017,000 |-0.49 |12 |
| | | | |
| | | | |
| Kansas City Royals |$ 47,609,000 |-0.55 |14 |
| | | | |
| | | | |
| Detroit Tigers |$ 46,832,000 |-0.57 |10 |
| | | | |
| | | | |
| Cleveland Indians |$ 34,319,300 |-.87 |8 |
| | | | |
| | | | |
| Tampa Bay Devil Rays |$ 29,556,667 |-0.98 |11 |
| | | | |
| | | | |
Below are the true standing of the teams in the American League.
Updated: 9/5/2004 3:37:06 PM
|AMERICAN LEAGUE EAST |
|~~~~~~~~~~~~~~~~~~~~ |
| |
|TEAM WON LOST PCT GB HOME ROAD EAST CENT WEST STREAK |
| |
|NY YANKEES 83 52 .615 - 45-21 38-31 36-19 15-11 22-14 LOST 2 |
|BOSTON 80 54 .597 2 1/2 48-22 32-32 36-20 19-13 16-12 LOST 1 |
|BALTIMORE 63 71 .470 19 1/2 29-35 34-36 28-29 15-12 15-17 WON 6 |
|TAMPA BAY 59 75 .440 23 1/2 36-34 23-41 21-38 13-12 10-22 LOST 7 |
|TORONTO 56 80 .412 27 1/2 34-37 22-43 21-36 13-19 14-15 LOST 2 |
| |
|AMERICAN LEAGUE CENTRAL |
|~~~~~~~~~~~~~~~~~~~~~~~ |
| |
|TEAM WON LOST PCT GB HOME ROAD EAST CENT WEST STREAK |
| |
|MINNESOTA 77 58 .570 - 43-28 34-30 16-11 34-24 16-16 WON 5 |
|CHI WHITE SOX 67 67 .500 9 1/2 38-32 29-35 16-16 29-27 14-14 WON 2 |
|CLEVELAND 67 70 .489 11 40-30 27-40 17-15 26-31 14-16 LOST 4 |
|DETROIT 62 72 .463 14 1/2 32-32 30-40 10-14 28-28 15-21 WON 1 |
|KANSAS CITY 47 87 .351 29 1/2 30-37 17-50 08-19 25-32 08-24 LOST 2 |
| |
|AMERICAN LEAGUE WEST |
|~~~~~~~~~~~~~~~~~~~~ |
| |
|TEAM WON LOST PCT GB HOME ROAD EAST CENT WEST STREAK |
| |
|OAKLAND 81 54 .600 - 45-19 36-35 23-16 25-15 23-15 WON 3 |
|ANAHEIM 77 58 .570 4 38-28 39-30 24-16 25-14 21-17 WON 2 |
|TEXAS 74 60 .552 6 1/2 42-22 32-38 22-17 22-17 20-18 WON 1 |
|SEATTLE 51 84 .378 30 32-34 19-50 11-28 19-21 12-26 LOST 4 |
Example 3: Homeownership in the USA We will calculate the standard deviation for the percent of the population in this country that owns a home. Below are the state rankings for the percent of homeownership in the United States (to include mobile homes) in 2002. Source: US Bureau of the Census.
|Alabama |73.5 |
|Alaska |67.3 |
|Arizona |65.9 |
|Arkansas |70.2 |
|California |58.0 |
|Colorado |69.1 |
|Connecticut |71.6 |
|Delaware |75.6 |
|District of Columbia |44.1 |
|Florida |68.7 |
|Georgia |71.7 |
|Hawaii |57.4 |
|Idaho |73.0 |
|Illinois |70.2 |
|Indiana |75.0 |
|Iowa |73.9 |
|Kansas |70.2 |
|Kentucky |73.5 |
|Louisiana |67.1 |
|Maine |73.9 |
|Maryland |72.0 |
|Massachusetts |62.7 |
|Michigan |76.0 |
|Minnesota |77.3 |
|Mississippi |74.8 |
|Missouri |74.6 |
|Montana |69.3 |
|Nebraska |68.4 |
|Nevada |65.5 |
|New Hampshire |69.5 |
|New Jersey |67.2 |
|New Mexico |70.3 |
|New York |55.0 |
|North Carolina |70.0 |
|North Dakota |69.5 |
|Ohio |72.0 |
|Oklahoma |69.4 |
|Oregon |66.2 |
|Pennsylvania |74.0 |
|Rhode Island |59.6 |
|South Carolina |77.3 |
|South Dakota |71.5 |
|Tennessee |70.1 |
|Texas |63.8 |
|Utah |72.7 |
|Vermont |70.2 |
|Virginia |74.3 |
|Washington |67.0 |
|West Virginia |77.0 |
|Wisconsin |72.0 |
|Wyoming |72.8 |
Like before, we have a rather large sample. Let’s begin with what we know, let’s find the mean, median, mode and range first ranking to states in ascending order.
|DC |44.1 |- |
|New York |55.0 |50 |
|Hawaii |57.4 |49 |
|California |58.0 |48 |
|Rhode Island |59.6 |47 |
|Massachusetts |62.7 |46 |
|Texas |63.8 |45 |
|Nevada |65.5 |44 |
|Arizona |65.9 |43 |
|Oregon |66.2 |42 |
|Washington |67.0 |41 |
|Louisiana |67.1 |40 |
|New Jersey |67.2 |39 |
|Alaska |67.3 |38 |
|Nebraska |68.4 |37 |
|Florida |68.7 |36 |
|Colorado |69.1 |35 |
|Montana |69.3 |34 |
|Oklahoma |69.4 |33 |
|New Hampshire |69.5 |31 |
|North Dakota |69.5 |31 |
|North Carolina |70.0 |30 |
|Tennessee |70.1 |29 |
|Arkansas |70.2 |25 |
|Illinois |70.2 |25 |
|Kansas |70.2 |25 |
|Vermont |70.2 |25 |
|New Mexico |70.3 |24 |
|South Dakota |71.5 |23 |
|Connecticut |71.6 |22 |
|Georgia |71.7 |21 |
|Maryland |72.0 |18 |
|Ohio |72.0 |18 |
|Wisconsin |72.0 |18 |
|Utah |72.7 |17 |
|Wyoming |72.8 |16 |
|Idaho |73.0 |15 |
|Alabama |73.5 |13 |
|Kentucky |73.5 |13 |
|Iowa |73.9 |11 |
|Maine |73.9 |11 |
|Pennsylvania |74.0 |10 |
|Virginia |74.3 |9 |
|Missouri |74.6 |8 |
|Mississippi |74.8 |7 |
|Indiana |75.0 |6 |
|Delaware |75.6 |5 |
|Michigan |76.0 |4 |
|West Virginia |77.0 |3 |
|Minnesota |77.3 |1 |
|South Carolina |77.3 |1 |
|Mean |69.4 |
|Mode |70.2 |
|Median |70.2 |
Let’s begin to interpret the data. First, we notice the mode and median are the same, and the mean (average) is below the two. The next question, “is the mean significantly below the other two” may be partially answered by observing the range, for comparison sake. The range is 33.2 (77.3 – 44.1) appears rather large. So, by comparison 69.4 versus 70.2 appears to be no big deal. Let’s add to our repertoire our ability to examine the dispersion of the data. For large sets of data, we do not want to obsess over each individual data point. We want to see if the data follows noticeable trends and then interpret any outliers that may appear.
Let’s observe a histogram, made from the data in ascending order. Notice we have so much data we cannot see each state (only a few are labeled for reference) or see each bar representing each state.
[pic]
Notice how the District of Columbia is an outlier, it’s percent of 44.1 far from the mean of 70.2 %. But, in a manner of speaking, NY, HA, CA, RI, with respective percents of home ownership of 55, 57.4, 58 and 59.6 all seem far below the mean of 70.2 This is what we mean by dispersion. We need to quantify how well grouped the data is because we need to know where to draw the line between those states that are significantly below the mean or significantly above the mean. This can be done with what we have identified as a standard deviation; the measure of how data deviates from it’s behavior around the middle of the data. Central tendency. In a perfect world, the mean is in the center of the data, thus it is the median too. And the mode, if we get greedy. The standard deviation, loosely speaking, measures how the data deviates from the mean, and remember, the mean is in the center of the data.
Standard deviation is 6.2
Y = (x-70.2)/6.2 table start at 70.2 count by 0.1, mode at float 2
|DC |-4.21 |44.1 |
|New York |-2.45 |55.0 |
|Hawaii |-2.06 |57.4 |
|California |-1.97 |58.0 |
|Rhode Island |-1.71 |59.6 |
|Massachusetts |-1.21 |62.7 |
|Texas |-1.03 |63.8 |
|Nevada |-0.76 |65.5 |
|Arizona |-0.69 |65.9 |
|Oregon |-0.65 |66.2 |
|Washington |-0.52 |67.0 |
|Louisiana |-0.5 |67.1 |
|New Jersey |-0.48 |67.2 |
|Alaska |-0.47 |67.3 |
|Nebraska |-0.29 |68.4 |
|Florida |-0.24 |68.7 |
|Colorado |-0.18 |69.1 |
|Montana |-0.15 |69.3 |
|Oklahoma |-0.13 |69.4 |
|New Hampshire |-0.11 |69.5 |
|North Dakota |-0.11 |69.5 |
|North Carolina |-0.03 |70.0 |
|Tennessee |-0.02 |70.1 |
|Arkansas |0 |70.2 |
|Illinois |0 |70.2 |
|Kansas |0 |70.2 |
|Vermont |0 |70.2 |
|New Mexico |0.02 |70.3 |
|South Dakota |0.21 |71.5 |
|Connecticut |0.23 |71.6 |
|Georgia |0.24 |71.7 |
|Maryland |0.29 |72.0 |
|Ohio |0.29 |72.0 |
|Wisconsin |0.29 |72.0 |
|Utah |0.4 |72.7 |
|Wyoming |0.42 |72.8 |
|Idaho |0.45 |73.0 |
|Alabama |0.53 |73.5 |
|Kentucky |0.53 |73.5 |
|Iowa |0.6 |73.9 |
|Maine |0.6 |73.9 |
|Pennsylvania |0.61 |74.0 |
|Virginia |0.66 |74.3 |
|Missouri |0.71 |74.6 |
|Mississippi |0.74 |74.8 |
|Indiana |0.77 |75.0 |
|Delaware |0.87 |75.6 |
|Michigan |0.94 |76.0 |
|West Virginia |1.1 |77.0 |
|Minnesota |1.15 |77.3 |
|South Carolina |1.15 |77.3 |
|Mean | |70.2 |
|Standard deviation |6.2 | |
Why use standard deviation The standard deviation can also help you evaluate the worth of all so-called "studies" that seem to be released to the press everyday. Standard deviation is commonly used in business as a measure to describe the risk of a security or portfolio of securities. If you read the history of investment performance, chances are that standard deviation will be used to gauge risk. The same is true for academic studies to determine the validity of exam results, or the effectiveness of educational tools. Standard deviation are also one of the most commonly used statistical tools in the sciences and social sciences. It provides a precise measure of the amount of variation in any group of numbers, be it the returns on a mutual fund, the yearly rainfall in Mexico City, or the hits per game for a major league baseball player.
Problem Set:
For problems 1 to 8, use the 2003 Final Standings of the NFL teams, as indicated below:
2003 NFL Standings. W = wins, L = loses, % = percentage of games won, PF = Points For, that is, points the team scored, PA = points the team allowed.
AFC East W L T % PF PA
New England Patriots 14 2 0 .875 348 238
Miami Dolphins 10 6 0 .625 311 261
Buffalo Bills 6 10 0 .375 243 279
New York Jets 6 10 0 .375 283 299
NFC East W L T % PF PA
Philadelphia Eagles 12 4 0 .750 374 287
Dallas Cowboys 10 6 0 .625 289 260
Washington Redskins 5 11 0 .313 287 372
New York Giants 4 12 0 .250 243 387
AFC North W L T % PF PA
Baltimore Ravens 10 6 0 .625 391 281
Cincinnati Bengals 8 8 0 .500 346 384
Pittsburgh Steelers 6 10 0 .375 300 327
Cleveland Browns 5 11 0 .313 254 322
NFC North W L T % PF PA
Green Bay Packers 10 6 0 .625 442 307
Minnesota Vikings 9 7 0 .563 416 353
Chicago Bears 7 9 0 .438 283 346
Detroit Lions 5 11 0 .313 270 379
AFC South W L T % PF PA
Indianapolis Colts 12 4 0 .750 447 336
Tennessee Titans 12 4 0 .750 435 324
Jacksonville Jaguars 5 11 0 .313 276 331
Houston Texans 5 11 0 .313 255 380
NFC South W L T % PF PA
Carolina Panthers 11 5 0 .688 325 304
New Orleans Saints 8 8 0 .500 340 326
Tampa Bay Buccaneers 7 9 0 .438 301 264
Atlanta Falcons 5 11 0 .313 299 422
AFC West W L T % PF PA
Kansas City Chiefs 13 3 0 .813 438 332
Denver Broncos 10 6 0 .625 381 301
Oakland Raiders 4 12 0 .250 270 379
San Diego Chargers 4 12 0 .250 313 441
NFC West W L T % PF PA
St. Louis Rams 12 4 0 .750 447 328
Seattle Seahawks 10 6 0 .625 404 327
San Francisco 49ers 7 9 0 .438 384 337
Arizona Cardinals 4 12 0 .250 225 452
1. For the NFC teams, find the standard deviation for the number of wins and then find the z-score for each team.
2. For the AFC teams, find the standard deviation for the number of wins and then find the z-score for each team. 1.
3. For the all teams, find the standard deviation for the number of wins and then find the z-score for each team.
4. For the NFC teams, find the standard deviation for PF and then find the z-score for each team.
5. For the AFC teams, find the standard deviation for PF and then find the z-score for each team. 1.
6. For the NFC teams, find the standard deviation for PA and then find the z-score for each team.
7. For the AFC teams, find the standard deviation for PA and then find the z-score for each team.
8. Look carefully at questions 1 to 7. Which is a better predictor of a team’s success, the offense as indicated by the points the team scored (PF) or the defense, as indicated by the points that team allowed (PA). Why?
9. According to the 2005 World Almanac for Kids, below are the 25 largest countries in the world in mid-2004 in no particular order. Find the mean, median and the stand deviation.
1,294,629,555 China
82,424,609 Germany
1,065,070,607 India [pic]
76,117,421 Egypt
293,027,571 United States [pic]
69,018,294 Iran
238,452,952 Indonesia [pic]
68,893,918 Turkey
184,101,109 Brazil [pic]
67,851,281 Ethiopia
153,705,278 Pakistan [pic]
64,865,523 Thailand
144,112,353 Russia [pic]
60,424,213 France
141,340,476 Bangladesh [pic]
60,270,708 Great Britain
137,253,133 Nigeria [pic]
58,317,930 Dem. Rep. of the Congo
127,333,002 Japan [pic]
58,057,477 Italy
104,959,594 Mexico [pic]
48,598,175 South Korea
86,241,697 Philippines [pic]
47,732,079 Ukraine
82,689,518 Vietnam [pic]
10. According to the 2005 World Almanac for Kids, below are the ten largest cities followed by the population in the world in 2004 in no particular order.
Tokyo, Japan 34,450,000; Kolkata (Calcutta), India 13,058,000; Mexico City, Mexico 18,066,000; Shanghai, China 12,887,000; New York City, U.S. 17,846,000; Buenos Aires, Argentina 12,583,000; São Paulo, Brazil 17,099,000; Delhi, India 12,441,000; Mumbai (Bombay), India 16,086,000; Los Angeles, U.S. 11,814,000. Find the mean and the standard deviation.
11. According to the 2005 World Almanac for kids, below are the American League Pennant Winners, with the year they won proceeding the name and their won-lost record following their name since 1970. Remove the shortened strike season of 1981 and the year where there was no world series and find the mean and the standard deviation for wins. .
1970 Baltimore 108 54 , 1971 Baltimore 101 57, 1972 Oakland 93 62, 1973 Oakland 94 68, 1974 Oakland 90 72, 1975 Boston 95 65, 1976 New York 97 62, 1977 New York 100 62, 1978 New York 100 63, 1979 Baltimore 102 57, 1980 Kansas City 97 65, 1981 New York 59 48, 1982 Milwaukee 95 67 1983 Baltimore 98 64, 1984 Detroit 104 58, 1985 Kansas City 91 71, 1986 Boston 95 66, 1987 Minnesota 85 77, 1988 Oakland 104 58, 1989 Oakland 99 63, 1990 Oakland 103 59, 1991 Minnesota 95 67, 1992 Toronto 96 66, 1993 Toronto 95 67 1994 none, 1995 Cleveland 100 44, 1996 New York 92 70, 1997 Cleveland 86 75, 1998 New York 114 48, 1999 New York 98 64, 2000 New York 87 74, 2001 New York 95 65, 2002 Anaheim 99 63, 2003 New York 101 61
12. Look up the ages of the presidents of the United States when they took office. Find the mean and standard deviation of the presidents ages. Then repeat the process, lumping together those presidents who were inaugurated before the Civil War and those who were inaugurated after the Civil War. What do you notice when you compare the pre and post Civil War presidents’ ages?
For questions 13 to 19: M&M’s project - Some years come and go, but other years live in the hearts and mind of men and women for all eternity. Such was the year of 1941, when Pearl Harbor was attacked, Joe DiMaggio hit safely in 56 straight games and M&M’s were first introduced to the public. Daughters everywhere love M&M’s, in particular, some love the blue pieces the most. The original M&M’s had violet candies and no blue ones in 1941. Then, in 1949, tan replaced violet and in 1995, tan was replaced by blue. M&M’s were made round by taking milk chocolate centers and tumbling them to get their smooth rounded shape. We all know M&M stands for Mars and Murrie and that the different color M&M’s taste the same. According to the M&M’s website , that
• M&M’s Milk Chocolate candies are 30 % brown, 20 % each of yellow and red, and 10 % each of orange, green and blue
• M&M’s Peanut Chocolate candies are 20 % each of brown, yellow, red and blue, and 10 % each of green and orange
• M&M’s Peanut Butter and Almond Chocolate candies are 20 % each of brown, red yellow, green and blue
• M&M’s Crispy Chocolate candies are 16.6 % each of brown, red, yellow, green, orange and blue.
Let’s perform our own test and see if our observation of the percent of each color matches the website’s prediction.
13. Buy one pound bags of M&M’s Milk Chocolate for each student in your class. As a class, for each bag, tally up the number of each color M&M. Find the percent of each color for each bag.
14. The tally up the colors for all the bags, and find the percent of colors for the class room sample, which consists of all the bags.
15. Using each bag as individual trials, find the mean, median and mode for each color. Then find the percent of colors based on these findings.
16. How do the results in parts 2. and 3. compare? How do the results compare to M&M’s website statistics?
17. Repeat the experiment for Peanut Chocolate, Peanut Butter and Almond Chocolate, and Crispy Chocolate.
18. Answer this question - how can you run standard deviations in this experiment to help you analyze your findings so that you may decide on the reliability of the data on the M&M’s website?
19. Run those standard deviations to determine the reliability of the data on the M&M’s website.
For problems 20 to 22, refer to Example 3 from the text.
20. Compute the standard deviation for the data from Example 3.
21. Determine which states are the friendliest to home ownership and which states are the least?
22. Is there a cause and effect relationship that you can argue to explain why these states are at either end of this analysis?
23. Barry Bonds or Babe Ruth. Who was the greatest baseball player of all time? To argue your point, quote statistics. Research their batting average and compare it to the batting averages of their peers at the time. How many standard deviations from the mean were their batting averages? Do the same for home runs, RBI’s and on base percentage. Factor in that Barry Bonds played in night games and that Babe Ruth won 20 games as a pitcher. Best of luck…
-----------------------
[pic]
A scatter plot is a graph of ordered pairs that allow us to examine the relation between two sets of data.
[pic]
[pic]
[pic]
[pic]
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- arizona state university financial statements
- arizona state university master programs
- arizona state university tuition
- arizona state university grading scale
- arizona state university grad programs
- arizona state university electrical engineering
- arizona state university course catalog
- arizona state university tuition costs
- arizona state university online degrees
- arizona state university college code
- arizona state university finance degree
- arizona state university grading