CORRELATION



INVESTIGATE BIVARIATE MEASUREMENT DATA

Achievement standard: 91581 (3.9) Internal (4 credits)

I can from a given multivariate data set

Write a purpose statement which includes

o An appropriate relationship question with sensible variables from the given measurement data. The question should be in context to the purpose of the investigation (this purpose could be research driven)

o Identify the explanatory and response variables and justify.

Identify features in the data

1. Trend (linear/non-linear)

2. Association (positive or negative). As____increases so does_____.

3. Look for groupings or other obvious features.

4. Look for outliers (unusual points).

Find an appropriate model/equation

5. Describe the strength of the relationship, and relate this to the context by discussing visual aspects of scatter about the regression line.

For higher level reflection

I. Consider alternative models, if appropriate

II. Consider improving the model by removal of outliers (as long as this is justifiable) and repeating the analysis

III. Analyse separate subsets

IV. Take account of the number of data points

V. Consider contextual reasons for features. Use phrases like “ I notice that……., I think that………., I wonder if………………”

VI. Discuss relevance to a wider population

VII. Acknowledge other factors which could influence the response variable (causation).

Use the model to make a prediction-in context with sensible rounding

o Justify choice of the variable to use for predictions (This can be done if different investigations use the same response variable but different explanatory variables)

o Discuss relevance to a wider population

o Discuss precision of predictions (how well the model fits the raw data across all points)

Communicate findings in a conclusion

o Answer the question with reference to context or purpose of investigation

o State any assumptions and discuss the effect on the validity of the analysis.

The Statistical Enquiry Cycle will be used to investigate the given data.

Each component of the cycle needs to be communicated referring to visual aspects and contextual knowledge.

[pic]

Problem and Plan

Write a purpose statement

1. Research the context and investigate the precise meaning of the variables. (You may find some related studies that create potential for integration of statistical and contextual knowledge.)

2. Pose an appropriate relationship question using given measurement data

What makes a good relationship question?

• It can be answered with the data available.

• The variables of interest are specified.

• The question is related to the purpose of the investigation.

• Think about the population of interest. Can the results be extended to a wider population?

• Can predictions be made for one variable from the other?

3. Decide which variable goes on the x-axis (explanatory variable) and which goes on the y-axis (response variable). Predictions are made on the response variable using values from the explanatory variable.

EXAMPLE OF PURPOSE STATEMENT

I am investigating the classification of North Island Hector dolphins as a new subspecies to see if it can be supported by looking at relationships between variables from the data set. I will investigate a possible relationship between the rostrum width at base and the rostrum width at mid-length of the skulls of Hector’s dolphins and see if this is different for North Island dolphins. I will be using the multivariate data that has been provided from the research on Hectors dolphins in New Zealand.

The explanatory variable is rostrum width at base and the response variable is rostrum width at mid-length. This is so that I can predict the rostrum width at mid-length from the rostrum width at base.

[pic]

Select and use appropriate data displays

The scatter plot is the basic tool used to investigate relationships between two quantitative variables. We will use inZight to plot the scatter graph and give us a line of best fit (trend line) for the data.

Decide which variable goes on the x-axis and which goes on the y-axis?

It depends on the question and on the variables of interest.

Identify features in the data

Have a template for features (but allow flexibility). Describe the features using phrases like “I noticed that….., I think that……, I wonder if……..”

• Trend (linear/non-linear)

• Association (positive or negative)

• Strength (degree of scatter from trend line)

• Groupings/clusters

• Unusual observations

• Other (e.g., variation in scatter)

[pic]

Identify the features in the data (before fitting a trend line)

1. TREND

I noticed-

From the scatter plot it appears that there is a linear trend between rostrum width at base and rostrum length at mid length.

I think that-

This is a reasonable expectation because two different measures on the same body part of an animal could be in proportion to each other.

2. ASSOCIATION

I noticedthe scatter plot also shows that as the rostrum width at base increases so also the rostrum width at mid length tends to increase.

3. GROUPS

No groups were visually evident until the North and South Islands Dolphin data were indicated on the scatter plot I could see the two groups were clustered in different areas. The North Island dolphins had bigger rostrum measurements than the South Island dolphins.

4. UNUSUAL POINTS

One dolphin with rostrum width at base of 86mm had a smaller rostrum width at mid-length compared to dolphins with the same or similar rostrum widths at base.

Find an appropriate model

Because the trend is linear fit a linear model to the data. The line is a good model for the data because for each of the values of the rostrum width at the base, the number of data points above the line is about the same as the number of data points below the line. OR The line is a good model because the data is close to the trend line and spread evenly above and below it for all the explanatory values.

For non-linear trends see further on in notes.

5. STRENGTH - Describing the strength of the relationship and relating this to the context by discussing visual aspects of scatter from the trend line.

The correlation coefficient(r) can also be used to back up the observation of strength and direction of relationship.

Example:

The points on the graph are reasonably close to the fitted line so the relationship between rostrum width at mid-length and rostrum width at base is reasonably strong.

[pic]

Scatter Plots

What do I look for in scatter plots?

Trend

I notice

a linear trend shown by the or a non-linear trend

I notice

a positive association… or a negative association?

Scatter

I notice

a strong relationship… or a weak relationship?

I notice

constant scatter… or non-constant scatter?

Anything unusual

any groupings?

any outliers?

Correlation

• Correlation measures the strength of the linear association between two quantitative variables

• Get the correlation coefficient (r) from your calculator or computer

• r has a value between -1 and +1:

r = -1 r = -0.7 r = -0.4 r = 0 r = 0.3 r = 0.8 r = 1

|r value |Comment |

|Between 0.9 and 1 |Very strong positive relationship between the variables. |

|Between 0.75 and 0.9 |Strong positive relationship between the variables. |

|Between 0.5 and 0.75 |Moderate positive relationship between the variables. |

|Between 0.25 and 0.5 |Weak positive relationship between the variables. |

|Between –0.25 and 0.25 |Little or no relationship between the variables. |

|Negative values |As above but a negative or inverse relationship. |

• The correlation coefficient has no units

• Use correlation only if you have two quantitative variables

• There is an association between gender and weight but there isn’t a correlation between gender and weight!

• Use correlation only if the relationship is linear

Making Predictions

[pic]

From inZight we can get the equation of the trend line.

RWM=0.77 * RWB + -8.72

RWM= Rostrum width at mid-length

RWB= Rostrum with at base

Using the model to make a prediction

I will make predictions for a dolphin rostrum width at base of 85mm because such a dolphin could have been from either Island.

All points: RWM = 0.77 x 85 – 8.72 = 56.73mm

My prediction of Rostrum width at mid-length for dolphins with Rostrum width of 85mm at base is approximately 57mm. I think this prediction is fair as the data points are fairly close to the line of best fit (trend line) giving a strong relationship and reliable model.

Using inZight we can separate out North and South Island dolphin and look at the relationship between the variables in those two situations. We can find trend lines and get equations and correlation coefficients for the models and make predictions for the two islands, comparing them with the general prediction.

[pic]

Linear Trend North Island Linear Trend South Island

RWM = 0.48 * RWB + 19.19 RWM = 0.46 * RWB + 14.37

Correlation coefficient r=0.76 Correlation coefficient r=0.65

NI dolphins: RWM = 0.48 x 85 + 19.19 = 59.99mm

SI dolphins: RWM = 0.46 x 85 + 14.37 = 53.47mm

(All points: RWM = 0.77 x 85 – 8.72 = 56.73mm)

What do I notice about the predictions? What do I think? What do I wonder?

The relationships between the variables seem to show similar strength as the scatter from the trend line seems to be similar.

I also noticed that there were very few data points for the North Island. I would worry that the model for the North island dolphins could be unreliable.

Because the trend line is steeper I think that the Rostrum width at mid-length seems greater in relation to the Rostrum width at base compared to dolphins from South Island. This would need to be investigated further with more data for North Island. The North Island dolphins seem to be bigger than the South island dolphins.

Communicating findings in a conclusion

Example:

While scientific research shows that the NI dolphins can be classified as a new sub-species is not supported by the relationships investigated as they had positive associations of similar strength for both Islands. The only difference being that the North Island dolphins were bigger. There were only 12 North Island dolphins and this makes conclusions hard to be sure about. You need to have access to more dolphins for data and this may not be easy as the researchers may have got all the data they could by visiting different museums. Investigations of colour differences and size may contribute to supporting this classification.

For higher level reflection consider:-

Causation

Be aware that there could be a third factor influencing results.

Two variables may be strongly associated (as measured by the correlation coefficient for linear associations) but may not have a cause and effect relationship existing between them. The explanation maybe that both the variables are related to a third variable not being measured – a “lurking” or “confounding” variable.

Only talk about causation if you have well designed and carefully carried out experiments.

These variables are positively correlated:

• Number of fire trucks vs amount of fire damage

• Teacher’s salaries vs price of alcohol

• Number of storks seen vs population of Oldenburg Germany over a 6 year period

• Number of policemen vs number of crimes

For example a comparison might be made between the age of adult anglers and the number of fish that they catch. The graph shows a strong correlation between these two variables.

However no consideration has been given to the possibility that there could be another factor that is affecting results. In this case it could be "hours spent fishing". People who spent more time fishing tend to catch more fish; older people tend to spend more time fishing.

GROUPS –these may need to be analysed separately.

This example shows possible correlation between counts for the two sites.

However it can be seen that there are two distinct sub-samples in this data (it might be: males, females).

When treated separately these two sub-samples show no correlation.

In the example below the three subgroups all show a positive correlation but when put together there is a negative correlation.

NON-LINEAR TRENDS.

If this set of data were analysed for a linear relationship there would be no correlation. However there is a strong non-linear relationship when it is analysed for a quadratic relationship (x2). – Describe visually how it is a better fit.

Example of a non-linear relationship.

A forestry company wants to be able to estimate the volume of timber that it can get from a tree by taking measurements of its circumference. The company uses measurements from a number of trees and plots a scatter diagram.

Fitting a linear trend line we get a fairly good visual fit but notice the data points at the ends do not fit well and seem to curl up above the trend line.

[pic]

[pic]

We will try fitting a non-linear trend line using the quadratic model from inZight.

This fit is much better for the given data, with the data points sitting close to the trend line. This model will be suitable only for the range of given data.

Care must be taken when using a quadratic model as a quadratic function always has a turning point.

Extrapolation (predicting outside of the given range) is dangerous as the quadratic model given below has a turning point at X=!.67 so therefore it cannot be used for circumferences less than 1.7m. For circumferences less than 1.67 the volume will start increasing again.

Estimates for values above 5m could be unreliable. The model suggests a continuing rise in the volume of timber but this will reach a limit after which any estimates are not sensible.

[pic]

[pic]

OTHER DEFINITIONS

Regression line

A line that summarises the linear relationship (or linear trend) between the two variables in a linear regression analysis, from the bivariate data collected.

A regression line is an estimate of the line that describes the true, but unknown, linear relationship between the two variables. The equation of the regression line is used to predict (or estimate) the value of the response variable from a given value of the explanatory variable.

Example

The actual weights and self-perceived ideal weights of a random sample of 40 female students enrolled in an introductory Statistics course at the University of Auckland are displayed on the scatter plot below. A regression line has been drawn. The equation of the regression line is

predicted y = 0.6089x + 18.661 or predicted ideal weight = 0.6089 × actual weight + 18.661

Alternatives: fitted line, line of best fit, trend line

Least-squares regression line

The most common method of choosing the line that best summarises the linear relationship (or linear trend) between the two variables in a linear regression analysis, from the bivariate data collected.

Of the many lines that could usefully summarise the linear relationship, the least-squares regression line is the one line with the smallest sum of the squares of the residuals.

Two other properties of the least-squares regression line are:

1. The sum of the residuals is zero.

2. The point with x-coordinate equal to the mean of the x-coordinates of the observations and with y-coordinate equal to the mean of the y-coordinates of the observations is always on the least-squares regression line.

Definition - Regression

Regression relationship = trend + scatter

Observed value = predicted value + prediction error

Complete the table below

|Data Point |(8, 25) |(6, 7) |(-2, -3) |(x, y) |

|Observed y-value |25 | | |y |

|Fitted line |[pic] |[pic] |[pic] |[pic] |

|Predicted value / fitted value |21 | | |[pic] |

|Prediction error / residual |4 | | |y - [pic] |

Practise for assessment.

Research the context and variables.

Use the statistical enquiry cycle to carry out a statistical investigation of the data to determine if there is a relationship between at least one pair of variables.

Write a report describing the investigation.

1. Pose an appropriate relationship question that can be answered using variables in the data set. The variables you choose must be numerical, and the variable you use as your response variable must be continuous. You may choose to investigate more than one pair of variables.

2. Select appropriate display(s) to graph your data.

3. Identify features in the data, including the nature and strength of the relationship.

4. Find an appropriate model.

5. Use your model to make prediction(s).

6. Write a conclusion answering your question. For example use the results of your investigation to decide if the classification of North Island dolphins as a new subspecies from South Island dolphins can be supported.

7. Support your conclusion by referring to your analysis and/or features of the visual display(s). Include a reflection on your process, which could consider other relevant variables, or evaluate the adequacy of your model(s).

In writing your report, link your discussion to the context and support the statements you make by referring to statistical evidence.

|Bivariate Data 3.9 (4 credits) |

|Achieved |Merit also requires |Excellence also requires |

|Question that includes |Question with researched justification for |Other variables considered and reflection on |

|purpose, exp &resp. and source of data |variables, exp & resp and source of data. |which are most appropriate – at least 2 |

| | |pairs. |

| | |Research driven purpose. |

|Scatter graph | | |

|Describe features |Association justified with ref to graph. |Reflect on strength and nature with |

|Trend |Other features described with reference to |contextual comments on. |

|Strength with ref to scatter |graph and context (research) eg groups and |Causation |

|Association |investigation of relationships for different |Other factors impacting on variable |

| |subgroups |Number of data points considered |

|Equation |How well data fits trend line with ref to all|Improvements to model |

| |‘x’ values. |Linear to non-linear |

| | |Removing outliers |

| | |Models for sub-groups. |

|Prediction made – units and sensible rounding|Prediction interpreted in context and |Reflection on prediction including accuracy |

|required. |accuracy discussed. |and relevance to wider population. |

|Conclusion – answers question with reference |Conclusion – answers question with reference |Conclusion made from at least 2 pairs of |

|to the relationship. |to the relationship giving contextual support|variables. |

| |(evidence of research). |Contextual reference required. |

-----------------------

PROBLEM & PLAN

DATA ANALYSIS

CONCLUSION

RWM= rostrum width at mid-length

RWB= rostrum width at base

RL=Rostrum length

ZW= zygomatic width

CBL=condylobasal length

ML= mandible length

straight line pattern

as one variable gets bigger, the other gets smaller

as one variable gets bigger, so does the other

lots of scatter from the linear pattern

little scatter from the linear pattern

roughly the same amount of scatter as you look across the plot

the scatter looks like a “fan” or “funnel”

unusually far from the trend

Points fall exactly on a straight line

No linear relationship

(uncorrelated)

Points fall exactly on a straight line

No linear relationship, but

there is a relationship!

No linear relationship, but

there is a relationship!

Correlation coefficient r=0.83

[pic]

[pic]

Timber Estimates

Circumference of tree (m)

Volume of Timber (m3)

Circumference of tree (m)

Volune (m3)

Timber estimates

8

y = 5 + 2x

data point

(8, 25)

25

21

prediction error

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download