Teachoutcoc.org



Math 140 Notes and Activity Packet (Word)Exploratory Data Analysis (EDA), Correlation & RegressionGo over EDA Notes 1 (PDF online only) before doing EDA Act 1Math 140 – EDA Activity 1Using Graphs to Explore Shape, Center, Averages and Outliers1. Open the men and women’s health data on the website . Look under the “Data Sets” tab.2. Look at the following columns of data for both men and women: age, height, weight, pulse and body mass index. For each column, use Statcrunch to create a histogram, dot plot and box plot and find the mean, median, mode, minimum and maximum values. Then answer the following questions. Save your graphs, sample statistics and answers on a word document in your flash drive.a) What are the values trying to measure? What are the units?b) Look at the graphs and determine the shape of the data set? c) Look at the graphs and estimate where you think the center or average should be for the data set. Which is the most accurate average, the mean, median or mode? What is the average?d) Use the box plot to determine if there are any outliers in the data set? Remember, these are values that look far away from most of the data. Do you think these outliers were mistakes in collecting the data or just an “unusual” individual value.NOTE: To copy your graphs in StatCrunch and save them in Word, you will need to follow the following instructions.Click on optionsClick on copyRight click and copy the new windowOpen a word documentControl/Alt V and click on “Device independent Bitmap”Save word document on your flash driveGo over EDA Notes 2 (PDF online only) before doing EDA Act 2Math 140 – EDA Activity 2Shape, Center, Spread and Typical Values1. Open the men and women’s health data on the website . Look under the “Data Sets” tab.2. Look at the following columns of data for both men and women: age, height, weight, pulse and body mass index. For each column, use Statcrunch to create a histogram and find the mean, median, , IQR, standard deviation, Q1 and Q3. Then answer the following questions. Save your graphs, sample statistics and answers on a word document in your flash drive.a) Look at the histogram and determine the shape of the data set? (You do not need to save the graph.)b) Based on the shape, which is the most accurate measure of center, the mean or median? What is the average?c) Based on the shape, which is the most accurate measure of typical spread, the standard deviation or IQR? d) Use the most accurate center and spread to find two numbers that typical values in the data set will fall in between. (Use the mean + or – standard deviation if bell shaped. Use Q1 and Q3 if skewed.)Go over EDA Notes 3 (PDF online only) before doing EDA Act 3&4Math 140 EDA – Activity#3Exploratory Data Analysis ParagraphDirections: Using the health data on my website to write a data analysis paragraph for the following data sets for both men and women: age, height, weight, pulse and body mass index. Use Statcrunch to make a histogram and boxplot and the mean, median, min, max, IQR, Q1, Q3, and standard deviation in order to analyze the shape and outliers. You do not need to save your histogram and boxplot. Write a data analysis paragraph describing the shape of the data set, as well as outliers, measures of center and measures of spread. Also include the size of the data set and the smallest and largest values in the data set. Be sure to include which is the best measure of center and spread and write a sentence interpreting the meaning of the center and spread in the context of the data set. Give the most accurate average for the data set and also give two numbers that typical values fall in between. Also analyze the outliers. Should the outliers be included in the data set or not and what was their affect on the shape, center and spread? Write a data analysis paragraph for women’s ages.Write a data analysis paragraph for men’s ages.Write a data analysis paragraph for women’s height.Write a data analysis paragraph for men’s height.Write a data analysis paragraph for women’s weight.Write a data analysis paragraph for men’s weight.Write a data analysis paragraph for women’s pulse rates.Write a data analysis paragraph for men’s pulse rates.Write a data analysis paragraph for women’s body mass index.Write a data analysis paragraph for men’s body mass index.Math 140 EDA – Activity#4Classifying Summary Statistics1. For each of the following sample statistics, classify it as a measure of spread (variability), a measure of center (average), or a measure of position. Then write a sentence describing what the statistic tells us.a) Meanb) Standard Deviationc) Minimumd) Rangee) Medianf) Quartile 3 (Q3)g) Interquartile Range (IQR)h) Maximumi) Quartile 1 (Q1)j) Modek) Variance2. List all the measures of center. Which is the most accurate for bell shaped (normal) data sets? Which is the most accurate for skewed data sets?3. List all the measures of spread. Which is the most accurate for bell shaped (normal) data sets? Which is the most accurate for skewed data sets?4. List all the measures of position.5. A very important statistic that is not a center, spread or position, is the frequency or sample size. Write a sentence describing the meaning of the sample size.6. Use Statcrunch and the Bear data to find all of the summary statistics we discussed for the bears weight. You need to give the name of the statistic, the number and the units. EDA review problems on the Sampling/Experiment/EDA review sheet located on the Sampling and Experiments page.Go over Correlation Notes (pdf online only), Regression Notes (below in word) and residual notes (pdf online only) before doing Regression Act 1 & 2Correlation and Regression NotesRelationship Hypothesis TestsCategorical / Categorical Relationship (Chi-Squared Independence Test)Ho: Categorical Variables are independent (show distribution of conditional probabilities are the same) Ha: Categorical Variables are dependent (show distribution of conditional probabilities are different)Categorical / Quantitative Relationship (ANOVA) (categorical variable and quantitative variable are independent (not related) (categorical variable and quantitative variable are dependent (related)Quantitative / Quantitative Relationship (Correlation Hypothesis Test) “Regression”Correlation : See if there is a linear relationship between two quantitative variables.The study of that relationship is often called “Correlation and Regression”.Scatterplot : graph for visually seeing correlation or notI. Choosing your variables:Chose which variable will be x (explanatory variable or independent variable) and which variable will be y (response variable or dependent variable)Is one of the variables a natural response variable?Ex) Year (time) and unemployment rates in U.S. Let explanatory variable x be time (years) and let the response variable y be unemployment rate. Unemployment responds to time, but not the other way around.If the variables respond to each other, pick the response variable to be the one you are most interested in or may want to make predictions about.Ex) The unemployment rate in U.S. and the national debt in the U.S.If you are studying national debt and factors that may be related to the national debt, then you should make the national debt be your response variable y (and that means that unemployment rate would be explanatory x). II. Graphing your data (Scatterplot)Make ordered pairs from your x and y data (x , y) and create a scatterplot.StatCrunch: Graph scatterplot pick columns for x and y computeCorrelation Study: see how well ordered pair quantitative data fit a line. (regression line)Correlation Coefficient (r) : number between -1 and +1 that measures the strength and direction of correlation. (Always look at the scatterplot with the r value, Do not just look at r value)r close to +1 (r = +0.893) Strong, Positive Correlation (line going up from left to right (positive slope) and the points in scatterplot are close to line)r close to -1 (r = -0.916) Strong Negative Correlation (line going down from left to right (negatve slope) and the points in the scatterplot are close to the line)r close to 0 (+0.037 or -0.009) No linear correlation (but still could be nonlinear)III. R-Squared(Squaring the correlation coefficient r) R-squared : Percentage of variability in y (response) that can explained by the linear relationship with x (explanatory).Example 1: Rainfall (inches) verses number of car accidents.Explanatory (x) : RainResponse(y) : car accidentsR = +0.51 (There is a moderate positive correlation between rain and car accidents)Note!!! Correlation is not Causation!! Can NOT say that rain CAUSES car accidents.R-squared = 0.51 ^2 = 0.2601 or 26.01%26% of the variability in car accidents can be explained by the relationship with rain. Confounding variables? Age, condition of car , road conditions, experience , texting , drinking, drugs,IV. Standard Deviation of the residual errors (two meanings : Average distance from line & prediction error)The average distance that points are from the regression line. If we use the regression line to make a prediction, the standard deviation of the residuals gives us how much average error we can expect in that prediction.Residual : How far a point is above or below the regression line.Example 2: x: nicotine y: tarSimple linear regression results:Dependent Variable: tar (mg)Independent Variable: nicotine (mg)?tar (mg) = -1.2713139 + 14.207623 nicotine (mg)Sample size: 29R (correlation coefficient) = 0.96136857R-sq = 0.92422952Estimate of error standard deviation: 1.2983543r = 0.961 There is a strong positive correlation between nicotine and tar.r-squared = 0.924 (92.4%) 92.4% of the variability in y (tar) can be explained by the relationship with x (nicotine). Are there any confounding variables that might influence the amount of tar? Carbon Monoxide, company, cost of cigarette, ** Important Note: “Correlation Causation”Just because there is a relationship (correlation), that does not give you the right to say that the x “causes” y to happen. Stand Deviation of residual errors = 1.298 1.3 (same units as y) (mg tar)The points in the scatterplot are about 1.3 mg from the regression line on average. If we try to predict tar from the nicotine, we could have an average error of 1.3 mg. Regression line: y = A + Bx (OLI book)A is y intercept (where line crosses y axis) starting valueB is slope (average rate of change)Regression line: x: nicotine, y: tarY = -1.2713139 + 14.207623 XSlope = 14.2 What does that mean? The amount of tar is increasing by 14.2 mg for every mg of nicotine. Regression Act 1 – abcd onlyExample 3: Health Data StatCrunch Directions: Stat Regression Simple Linear pick x and y columns computeIs a woman’s age related to her diastolic blood pressure?Pick x and y (blood pressure responds to age, but age does not respond to bloodpressure)X: (explanatory or independent variable) Woman’s AgeY: (response or dependent variable) Diastolic Blood PressureStatCrunch PrintoutWomen Diast BP = 47.699875 + 0.59368323 Women Age (years)Sample size: 40R (correlation coefficient) = 0.63594161R-sq = 0.40442173Estimate of error standard deviation: 9.0898362a. The scatterplot and r-value show a moderate positive correlation. (r = 0.636) b. r-squared = 0.404 = 40.4%So 40.4% of the variability in a woman’s diastolic blood pressure can be explained by the relationship with woman’s age. Confounding Variables (influence BP)? Race, Ethnicity, stress, genetics, diet, c. standard deviation of residual errors = 9.1 Blood pressure unitsTwo meanings: Points in scatterplot are 9.1 away from the regression line on average.If we use the regression line to make a prediction (guess a woman’s diastolic blood pressure) we could have an average error of 9.1 blood pressure units.Meaning of Slope of regression line? 0.59 (rate of change between x and y)Women’s diastolic blood pressure increases 0.59 per year.(Example 2 continued)Use the regression line to predict how much tar a cigarette will have if the cigarette has 1.5 mg of nicotine?Y = -1.2713139 + 14.207623 (1.5) = 20.04 mg of tar!!How much error in this prediction? Stand Dev of residual errors is the prediction error. (off by 1.3 mg on average)Histogram of the residuals looks nearly normal (only slight skew)The residual plot is showing a fan (V) shape. So is not evenly spread out. (Fails the homoscedasticity requirement.)Math 140 Regression Activity#1Correlation and Regression with StatcrunchDirections: Your goal today is to explore relationships between quantitative variables using Statcrunch. For each of the following data sets, decide which data set should be the explanatory variable and which should be the response variable. Go to the “Stat” menu, and click on “Regression”, then “Simple Linear”. Put in the columns for the explanatory (x) and the response (y). Click on Fitted line plot, Residuals verses x variable, and a Histogram of the residuals. Save the scatterplot, Residuals verses x variable, and a Histogram of the residuals on a word document with the r value, r-squared, standard deviation of the residuals, and the equation of the regression line on a word document. a) Use the r value and scatterplot to interpret the strength and direction of the linear relationship. Do the variables have a weak, moderate or strong linear relationship (correlation), a non-linear (curved) relationship, or no relationship at all? b) Write a sentence to explain the r-squared value. Were there any confounding variables that might influence the response variable other than the explanatory variable? c) Write two sentences to explain the standard deviation of the residuals.d) Give the regression line formula and write a sentence to explain the slope.e) A residual is the vertical distance each point in the scatterplot is from the regression line. Look at the histogram of the residuals. We like the histogram to be nearly normal (close to bell shaped). What is the shape of the histogram of the residuals?f) The residuals verses the x variable is often called a “residual plot”. It is often difficult to see how far points are from the regression line. Think of the residual plot graph as putting the points in the scatterplot under a magnifying glass. You can see the distances better. The residual plot should show dots that are evenly spaced from the line. This is called “Homoscedasticity”. It should not be “fan” shaped. If it is fan shaped it fails the homoscedasticity requirement. Is the residual plot fan shaped or evenly spaced?1. Open the health data. Explore the relationship between a man’s age and cholesterol? 2. Open the health data. Explore the relationship between a man’s height and weight? 3. Open the health data. Explore the relationship between a man’s weight and BMI? 4. Open the health data. Explore the relationship between a man’s systolic blood pressure and diastolic blood pressure?5. Open the Bear’s data. Explore the relationship between age and weight. 6. Open the Bear’s data. Explore the relationship between chest size and neck size? 7. Open the Bear’s data. Explore the relationship between skull length and skull width? Math 140 Regression Activity#2Using Regression lines to make Predictions1. Open the cigarette data. Let the explanatory variable represent the amount of nicotine and the response variable represent the amount of tar. Find the equation of the regression line and the standard error. Use the equation to predict the amount of tar if a cigarette contains 1.2 mg of nicotine. How much error might there be in that prediction? 2. Open the cigarette data. Let the explanatory variable represent the amount of nicotine and the response variable represent the amount of carbon monoxide (CO). Find the equation of the regression line and the standard error. Use the equation to predict the amount of CO if a cigarette contains 1.2 mg of nicotine. How much error might there be in that prediction? 3. Open the women’s health data. Let the explanatory variable represent the systolic blood pressure and the response variable represent the diastolic blood pressure. Find the equation of the regression line and the standard error. Use the equation to predict the diastolic blood pressure of a person who has a systolic blood pressure of 130. How much error might there be in that prediction? 4. Open the bear data. Let the explanatory variable represent the age of the bear in months and the response variable represent the length of the bear in inches. Find the equation of the regression line and the standard error. Use the equation to predict the length of a bear that is 24 months old. How much error might there be in that prediction? 5. Open the bear data. Let the explanatory variable represent the neck circumference of the bear and the response variable represent the weight of the bear in pounds. Find the equation of the regression line and the standard error. Use the equation to predict the weight of a bear that has a neck circumference of 24 inches. How much error might there be in that prediction? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download