Topic 5: Relations between two quantitative variables

Topic 5: Relations between two quantitative variablesJeroen ClaesContentsDifferent types of dataWhat are correlations?Exploring correlations with plotsPearson product-moment coefficient rSpearman’s rho and Kendall’s tauReporting on correlationsExercisesReferences1. Different types of data1. Different types of dataData can be of different types:Nominal or categorical (e.g., yes vs. no)Quantitative:Ordinal-scaledInterval-scaledRatio-scaled1. Different types of dataOrdinal-scaled data:The truly meaningful information is contained not in the values itself, but in their ordering. E.g., likert-scales1 and 5 have no meaning in relation to each other other than e.g., the relative degree of ‘agree’ vs ‘disagree’ they represent.1. Different types of dataInterval-scaled data:e.g., degrees celsiusWe know the ordering of the valuesEach value is meaningful on the scale on its own. I.e., each value represents a temperatureThere is no true zero:0 does not represent absence of temperatureIt is a measurement like any other on the scale1. Different types of dataRatio-scaled data:E.g., counts of bacteria on a surfaceWe know the ordering of the valuesEach value is meaningful on the scale on its ownThere is a true zero: 0 represents absence of bacteria on the surface2. What are correlations?2. What are correlations? (1/4)Relationship between two interval-scaled or ratio-scaled variables, which consist in that they increase or decrease in parallel:If X increases with one unit, there will be a constant increase/decrease of N units in YIf X decreases with one unit, there will be a constant increase/decrease of N units in YData must be paired: for each value in X there must be a corresponding value in Y2. What are correlations? (2/4)If X increases and Y increases as well, this is called a positive correlationE.g., Age correlates positively with vocabulary size in young children: the older they get, the more words they know.If X increases and Y decreases, this is called a negative correlationE.g., Zipf (1935:25) found that word frequency is negatively correlated to word length: the more frequent a word is, the shorter it tends to be2. What are correlations? (3/4)When we analyze relationships between quantitative variables, we are interested in three aspects:The direction of the correlation: positive or negative correlation?The size of the correlation: how strong is the relationship between the two variables?Whether or not the relationship is statistically significantThe first two can be established with a correlation coefficient, the last one requires a hypothesis test based on the correlation coefficient2. What are correlations? (4/4)Observe that correlations are different from paired t-tests and Wilcoxon tests:t-test compares the means of the pairsWilcoxon test compares the medians of the pairs<–> Correlations explore the strengths of the associations between the values of the pairs, they offer no information on their means, medians, or the statistical significance of the relationships3. Exploring correlations with plots3. Exploring correlations with plots (1/3)We will be working again with the dataset by Balota et al. (2007)Research question:What is the relationship between the Length of a word and the Mean_RT in a lexical decision task?Hypothesis:Shorter words will be recognized faster than longer wordsNull hypothesis:There is no difference between short and long wordslibrary(readr)library(dplyr)dataSet <- read_csv(";)glimpse(dataSet)## Observations: 100## Variables: 4## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...3. Exploring correlations with plots (2/3)Before you do anything else, it is usually a good idea to plot the two variables and their relation in a scatterplot:geom_point plots dots for each pair of X, Y valuesgeom_smooth fits a line through these dots to inspect the relationship between the variables:Positive correlation: highest tip of the line is on the right-hand sideNegative correlation: lowest tip of the line is on the right-hand sidegeom_smooth also adds a 95% conficence interval around the line, in which the true population relationship between the variables will be situatedmethod="lm" tells ggplot to fit a linear relationship between them (linear regression)library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + geom_point() + geom_smooth(method="lm")3. Exploring correlations with plots (3/3)library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + geom_point() + geom_smooth(method="lm")4. Pearson’s product-moment r4. Pearson’s product-moment rJust by inspecting the plot we can already see that there is a linear relationship between the length of the word and the time subjects take to recognize itTo calculate how strongly the two are associated, we can calculate the Pearson product-moment correlation coefficient4.1 Assumptions of the Pearson product-moment correlation (1/4)The relationship between the two variables is monotonic:Each increase in X has a parallel increase in y; each decrease in X is followed by a parallel decrease in Y or vice versaThe relationship between the two variables is linear:Each increase of one unit in X will trigger a constant increase of N units in YThere are no outliers in the data (Levshina, 2015: 122)4.1 Assumptions of the Pearson product-moment correlation (2/4)If the data fails to meet these assumptions, the correlation coefficient will not be robustNon-linear relationships:Try a transformation to transform X or Y to a linear relationship (e.g.?square, logarithm)library(ggplot2)ggplot(dataSet, aes(x=Freq, y=Mean_RT)) + geom_point() + geom_smooth(method="lm")4.1 Assumptions of the Pearson product-moment correlation (3/4)If the data fails to meet these assumptions, the correlation coefficient will not be very robustNon-linear relationships:Try a transformation to transform X x Y to a linear relationship (e.g.?square, logarithm)ggplot(dataSet, aes(x=log(1+Freq), y=Mean_RT)) + geom_point() + geom_smooth(method="lm")4.1 Assumptions of the Pearson product-moment correlation (4/4)Non-linear relationships:Use Spearman’s rho or Kendall’s tau (see below)Outliers:Remove outliersNon-monotonic:Correlation = 0, pointless4.2 Pearson’s product-moment r (1/3)Logic of the test:The values of the two variables are scaled to Z-scoresEach scaled value of Mean_RT is multiplied with the corresponding value in LengthThe multiplied values are summed together and divided by the sample sizesum(scale(dataSet$Length)*scale(dataSet$Mean_RT))/ nrow(dataSet)## [1] 0.60859814.2 Pearson’s product-moment r (2/3)In R, you can use the cor command to calculate the Pearson product-moment r correlation coefficientcor(dataSet$Length, dataSet$Mean_RT)## [1] 0.61474564.2 Pearson’s product-moment r (3/3)The correlation coefficient is 0.6147456This tells us that:There is a positive correlation: if Length increases, Mean_RT increases too (if the correlation is negative, then the coefficient is negative)The relationship is moderately strong0: No correlation+/- 0-0.3: Weak correlations+/- 0.3-0.7: Moderate correlation+/- 0.7-1: Strong correlation+/- 1: Perfect correlation4.3 Dangers of ignoring the assumptionsIf we remove outliers, the correlation coeffecient will change, because outliers pull the line on the plots up or downdataSet <- dataSet[abs(scale(dataSet$Mean_RT))<2,]cor(dataSet$Length, dataSet$Mean_RT)## [1] 0.5723765If we fail to recognize a non-linear relationship, the correlation coefficient may be substantially diferent. Compare:# Frequency vs Mean_RT, without Log transformation of Frequencycor(dataSet$Freq, dataSet$Mean_RT)## [1] -0.4115368# Frequency vs Mean_RT, with Log transformation of Frequency cor(log(1+dataSet$Freq), dataSet$Mean_RT) ## [1] -0.61712414.4 Testing the significance of correlationsIf we want to test if a correlation is significant, a few additional assumptions should be met (Levshina, 2015: 126), besides the assumptions of the correlation coefficient.These assumptions are shared by more advanced techniques such as linear regression:The sample is randomly selected from the population it represents.Both variables are interval- or ratio-scaled. Ordinal variables won’t work!The sample size is greater than 30 and/or the Y-values that correspond to each value in X are normally distributed and vice versa (bivariate normal distribution)The relationship between the variables is homoskedastic: the strength of the relationship between the variables is equal across the boardThe values of X and Y are completely independent: there is no autocorrelation (correlation between values of X, correlation between the values of Y). E.g., Temperature decreases/increases gradually over the course of a year. The temperature on Feb 26 is correlated to the temperature of Feb 25 and Feb 27, because random jumps in temperature do not occur4.4.1 Validating the assumptionsThe sample size is greater than 30 and/or the Y-values that correspond to each value in X are normally distributed and vice versa (bivariate normal distribution):mvnorm.etest from the energy packageFirst argument:data.frame with our two variablesSecond argument:number of tests the function performs before returning a result (1000 is good practice)Null hypothesis:Data has a bivariate normal distribution (if p < 0.05, it does NOT have a bivariate normal distribution)library(energy)mvnorm.etest(dataSet[,c("Length", "Mean_RT")], 1000 )## ## Energy test of multivariate normality: estimated parameters## ## data: x, sample size 96, dimension 2, replicates 1000## E-statistic = 0.47346, p-value = 0.9214.4.1 Validating the assumptionsThe relationship between the variables is homoskedastic: the strength of the relationship between the variables is equal across the board:Heteroskedasticity will show up as a funnel-like pattern on a scatter plot 4.4.1 Validating the assumptionsHeteroskedasticity is not an issue here at first glance:library(energy)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + geom_point() + geom_smooth(method="lm")4.4.1 Validating the assumptionsHeteroskedasticity is a serious problem for correlation analysis and its big brother, linear regressionIn the package car (“Companion to Applied Regression”“) there is a function that tests for heteroskedasticity based on a linear regression model:ncvTest (for non-constant variance test)To be able to use it, we must define a linear regression model.The null hypothesis is that the data are homoskedastic (NOT heteroskedastic). If p > 0.05, heteroskedasticity is not an issuemod <- lm(Mean_RT~Length, data=dataSet)library(car)ncvTest(mod)## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 0.4630842 Df = 1 p = 0.49618614.4.1 Validating the assumptionsThe values of X and Y are completely independent: there is no autocorrelation:As heteroskedasticity, autocorrelation is also a serious issue for linear regressionThe package car includes an implementation of the Durbin-Watson testThe null hypothesis is that there is no autocorrelation. If p < 0.05 your data violates the assumption of no autocorrelationmod <- lm(Mean_RT~Length, data=dataSet)library(car)durbinWatsonTest(mod)## lag Autocorrelation D-W Statistic p-value## 1 0.02480565 1.937535 0.732## Alternative hypothesis: rho != 04.4.2 Performing the correlation testOur data has passed all of the tests. The assumptions are all satisfied.We can now calculate our correlation coefficient and check to see if it is significant:cor.test accepts the following arguments:Our two variablesalternative:‘less’ if the correlation is expected to be negative‘greater’ if the correlation is expected to be positive‘two.sided’ if the hypothesis is that there is a correlation (default)cor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")## ## Pearson's product-moment correlation## ## data: dataSet$Length and dataSet$Mean_RT## t = 6.7676, df = 94, p-value = 5.555e-10## alternative hypothesis: true correlation is greater than 0## 95 percent confidence interval:## 0.4466334 1.0000000## sample estimates:## cor ## 0.57237654.4.2 Interpreting the output of the correlation testThe output of cor.test tells us the following:p < 0.05:The null hypothesis of no correlation can be rejectedr = 0.57:The correlation is moderately strong95% confidence interval for R: 0.45 - 1The correlation will be moderately strong to very strong at the population levelcor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")## ## Pearson's product-moment correlation## ## data: dataSet$Length and dataSet$Mean_RT## t = 6.7676, df = 94, p-value = 5.555e-10## alternative hypothesis: true correlation is greater than 0## 95 percent confidence interval:## 0.4466334 1.0000000## sample estimates:## cor ## 0.57237654.4.2 Calculating the variance explained (R2)Once we have established the size and the significance of the correlation between Length and Mean_RT, we can use the correlation coefficient to estimate the amount of Mean_RT variance that is explained by Length by simply squaring the correlation coefficient (R2 or R-squared) (Urdan, 2010:87-88)variation explained = *r*^2Length explains 32.76 percent of the variance of Mean_RTAnswers the question: how well does Length explain/model/predict Mean_RT?a<-cor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")a$estimate^2## cor ## 0.32761494.4.2 A note of warningCorrelation does not imply causation:Be careful to interpret correlations in terms of cause-effect. If something is statistically correlated, it does not necessarily have to be causally related (e.g., )Statistics may uncover a link between two variables, posterior analysis/theoretical reflection has to make sense of it5. Spearman’s rho and Kendall’s thau5. Spearman’s rho and Kendall’s thauSpearman’s rho and Kendall’s thau should be used when your data does not satisfy the assumptions of Pearson’s product moment rThese tests can be used for ordinal, ratio, and interval dataThe only assumption is that the relationship is monotonic5.1 DataData from Bates & Goodman (1997):Correlation between grammatical complexity and vocabulary size for 10 children between 16 to 30 months oldResearch question:Is there a relationship between the size of language learner’s lexicon and the complexity of their grammar?Hypothesis:Grammar develops on a par with vocabulary sizeNull hypothesis:There is no correlation between grammar and vocabulary sizedataSet <-read_csv(";)glimpse(dataSet)## Observations: 10## Variables: 3## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10## $ lexicon_size <int> 47, 89, 131, 186, 245, 284, 362, 444, 553, 627## $ complexity_score <int> 0, 2, 1, 3, 5, 9, 7, 16, 25, 345.1 Exploring the correlations between lexicon_size and complexity_score (1/3)The data here are clearly monotonic: for each increase in lexicon_size there is a parallel increase in complexity_scoreThe relationship is not linear, but it is positive and monotonicggplot(dataSet, aes(x=lexicon_size, y=complexity_score)) + geom_point() + geom_smooth(method="loess") 5.1 Exploring the correlations between lexicon_size and complexity_score (2/3)The relationship may not be linear, but we could apply a log-transformation to make it linearThis is what we would do if we were to perform a regressionIf we transform it, we can use Pearson’s r if it satisfies the other assumptionsggplot(dataSet, aes(x=lexicon_size, y=log(1+complexity_score))) + geom_point() + geom_smooth(method="lm") 5.1 Exploring the correlations between lexicon_size and complexity_score (3/3)For non-linear monotonic relationships, we cannot use the parametric (and conceptually relatively simple) Pearson’s rThe non-parametric methods Spearman’s rho and Kendall’s tau are better-suited as these make no assumptions about the relationships or the shape of the dataThese tests can also be used for ordinal data (e.g., Likert scales)To use Spearman’s rho or Kendall’s tau, we simply add method="spearman" or method="kendall" to cor or cor.testKendall’s tau will generally yield less extreme correlation estimates than Spearman’s rhocor.test(dataSet$lexicon_size, dataSet$complexity_score, method="spearman", alternative="greater")## ## Spearman's rank correlation rho## ## data: dataSet$lexicon_size and dataSet$complexity_score## S = 4, p-value < 2.2e-16## alternative hypothesis: true rho is greater than 0## sample estimates:## rho ## 0.97575765.1 Exploring the correlations between lexicon_size and complexity_score (3/3)For non-linear monotonic relationships, we cannot use the parametric (and conceptually relatively simple) Pearson’s rThe non-parametric methods Spearman’s rho and Kendall’s tau are better-suited as these make no assumptions about the relationships or the shape of the dataThese tests can also be used for ordinal data (e.g., Likert scales)To use Spearman’s rho or Kendall’s tau, we simply add method="spearman" or method="kendall" to cor or cor.testKendall’s tau will generally yield less extreme correlation estimates than Spearman’s rhocor.test(dataSet$lexicon_size, dataSet$complexity_score, method="kendall", alternative="greater")## ## Kendall's rank correlation tau## ## data: dataSet$lexicon_size and dataSet$complexity_score## T = 43, p-value = 1.488e-05## alternative hypothesis: true tau is greater than 0## sample estimates:## tau ## 0.91111116. Reporting on correlationsCorrelation coefficient (r, rho or tau)Degrees of freedom (for Pearson’s r)p-value and test statistic (t for Pearson, S for Spearman, T for Kendall)type of test (one-tailed, two-tailed)7. ExercisesPlease go to and perfom the exercises.8. Questions???9. ReferencesBalota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015).Bates, E., & Goodman, J. (1997). On the inseparability of grammar and the lexicon: Evidence from acquisition, aphasia and real-time processing. Language and Cognitive Processes 12(5/6). 507-586.Levshina, N. (2015). How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia, PA: John Benjamins.Zipf, G.K. (1935).The psycho-biology of language. Boston: Houghton Mifflin. ................

