Topic 12: When things aren’t quite linear: Polynomial ...



Topic 12: When things aren’t quite linear: Polynomial regression and elements of Generalized Additive ModelingJeroen ClaesContentsWhat is polynomial regression and why is it used?How to check for polynomial effectsHow to specify polynomial regression terms in RHow to check if polynomials improve the model fitHow to interpret polynomial regression terms in RWhat is Generalized Additive Modeling and why is it used?How to specify a GAM in RHow to specify random intercepts and slopes in a GAMHow to interpret a GAMHow to assess the fit of a GAMExercisesReferences1. What is polynomial regression and why is it used?1. What is polynomial regression and why is it used? (1/6)Remember that one of the assumptions of logistic and linear regression is that the numeric independent variables are linearly related to either the dependent variable or, in logistic regression, the logit of the dependent variableOften, however, the relationship is not linear. In previous classes, we have attempted to bring these non-linear relationships to a linear form with tranformations of the independent or the dependent variablesFor certain types of non-linear relationships, however, it is much more appropriate to incoporate the non-linearity in the model specification1. What is polynomial regression and why is it used? (2/6)Let us first generate some random datadataSet <- data.frame(x=rnorm(500))dataSet$y<- rnorm(500)+ (dataSet$x+ dataSet$x^2)ggplot(dataSet, aes(x=x, y=y)) + geom_point() 1. What is polynomial regression and why is it used? (3/6)If we were to regress y ~x with a linear model, this would not be a very close fitTo solve this, we could try to find a transformation of x or y (e.g., sqrt) that would render the relationship more linearHowever, this will never be more than an approximation, which comes at the cost of losing some interpretabilityAlso transforming the dependent variable to establish a linear relationship with an independent variable will only work if there’s only one predictor needing such a transformationggplot(dataSet, aes(x=x, y=sqrt(y))) + geom_point() + geom_smooth(method="lm")1. What is polynomial regression and why is it used? (4/6)Another more valid option would be to incorporate the curvature into the model specificationTo do so, we can add a square transformation of x alongside of x in our regression equation:y = intercept + (x*effect1) + (x^2*effect1) + ... + errorIn algebra, functions of the form f(x) = 2*x + 2*x^2 are known as polynomialsPolynomials allow us to model certain non-linear relationships for what they areThis will not work for variables that need e.g., a log-transformationggplot(dataSet, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm", formula = "y ~ x + I(x^2)", color="red")1. What is polynomial regression and why is it used? (5/6)The gist of the idea is that you include the same numeric predictor a number of times:The numeric predictor as it is, this will model the bit of the curve that is actually linearThe numeric predictor raised to an exponent. If the exponent is 2, then you only include the predictor raised to the power of 2. If it is higher, than you include the predictor raised to the power of 2, the predictor raised to the power of 3, etc.1. What is polynomial regression and why is it used? (6/6)The exponent is called the order of the polynomial:Second-order polynomial: exponent 2; called quadratic. This polynomial order will turn up as a parabola shape on a plotThird-order polynomial: exponent 3, called cubic. Polynomial order will turn up as a s-shape on a plotFourth-order polynomial: exponent 4, called quartic. This polynomial will turn up as a u-shape on a plot1. What is polynomial regression and why is it used? (6/6)2. How to fit polynomial terms in R2. How to fit polynomial terms in RTo incorporate polynomials, we can include a call to the poly function in our formula specificationThis function takes two arguments:Predictor columnOrder of the polynomialHere we will take a look at how polynomials work for lm models, but they work in the same way for glm, lmer, and glmer modelsmod <- lm(y ~ poly(x, 2), dataSet)3. How to check for polynomial effects3. How to check for polynomial effects: linear modelsTo find polynomial effects in your data, in the case of linear regression, you need to plot your dependent variable vs your independent variable to see which form is most adequateggplot(dataSet, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm", formula = "y ~ x", color="red") + geom_smooth(method="lm", formula="y ~poly(x, 2)")3. How to check for polynomial effects: logistic modelsIn the logistic case, you can plot your independent variable vs.?predicted probabilities derived from a simple model that includes only your independent variableIn this case, the relationship is not linear and a second-order polynomial appears to provide the best fit# Load some datalogDataSet<-read_csv(";) %>% mutate_if(is.character, as.factor)# Specify simple logistic modellogMod<-glm(type ~ characters_before_noun, family="binomial", logDataSet)# Generate all possible values between minimum and maximum of independent variableplotData <- data.frame(characters_before_noun=min(logDataSet$characters_before_noun):max(logDataSet$characters_before_noun))# Extract predicted probabilitiesplotData$predicted<-predict(logMod, newdata = plotData, type="response")# Plotggplot(plotData, aes(x=characters_before_noun, y=predicted)) + geom_point() + geom_smooth(method="lm", formula = "y ~ x", color="red") + geom_smooth(method="lm", formula="y ~ poly(x, 2)")3. How to check for polynomial effects: the codeIn both cases, by adding a lm regression line to the plot, you can get an idea of how linear the relationships areIf you spot something that looks like it is not quite linear, you can incorporate a polynomial, by editing the formula argument of geom_smooth:e.g., formula="y ~ poly(x,2)". The mapping of x and y to columns in the dataset is made in the aes mapping of the main ggplot call4. How to check if the polynomials improve the model fit4. How to check if the polynomials improve the model fit (1/2)Which order of polynomial you include in your model is a matter of emperical adequacy.You will want to model your data as close as possible, but you want to avoid having too high a degree of polynomialityYou start with second-order polynomials. Then you plot the relationship, and you explore if a higher-order polynomial is necessaryPolynomials of orders higher than three or four are usually a bad idea4. How to check if the polynomials improve the model fit (2/2)Polynomial models are more complex than models without polynomialsModels with higher-order polynomials are more complex than models with lower-order polynomialsThis means that we can use the AIC statistic to compare if polynomials contribute to the model fitmod <- lm(y ~ x, dataSet)mod1 <- lm(y~poly(x, 2), dataSet)AIC(mod)-AIC(mod1)## [1] 528.1325mod2 <- lm(y~poly(x, 3), dataSet)AIC(mod1)-AIC(mod2)## [1] -1.6054585. How to interpret polynomial regression terms in R5. How to interpret polynomial regression terms in R (1/2)Polynomial regression terms are hard to interpret:The effect estimates are spread over two or more variables:The first order of the polynomial describes the effect of xThe second order of the polynomial describes the effect of x^2We can no longer interprete the coefficients as indicating that “for each increase in X, there’s a N-unit increase in Y”If you want to interpret the terms, it is best to plot the relationship of the independent variable to the dependent variable. This way you can interpret the shape of the relationship and describe it in plain textsummary(mod)## ## Call:## lm(formula = y ~ x, data = dataSet)## ## Residuals:## Min 1Q Median 3Q Max ## -4.2205 -1.1126 -0.2868 0.8580 8.8364 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.96545 0.07723 12.50 <2e-16 ***## x 0.99205 0.07862 12.62 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.726 on 498 degrees of freedom## Multiple R-squared: 0.2423, Adjusted R-squared: 0.2408 ## F-statistic: 159.2 on 1 and 498 DF, p-value: < 2.2e-165. How to interpret polynomial regression terms in R (2/2)For logistic models, you can plot the polynomial term vs the predicted values of the dependent variable, which can be obtained with predict(mod, type="response")logMod<-glm(type ~ poly(characters_before_noun, 2) + negation + Typical.Action.Chain.Pos + corpus + tense, family="binomial", logDataSet)# Extract predicted probabilitieslogDataSet$predicted<-predict(logMod, type="response")# Plotggplot(plotData, aes(x=characters_before_noun, y=predicted)) + geom_point() + geom_smooth(method="lm", formula="y ~ poly(x, 2)")6. What is Generalized Additive Modeling and why is it used?6. What is Generalized Additive Modeling and why is it used? (1/3)Generalized additive modeling is a technique that builds on generalized linear models (the type of model implemented in glm)Like polynomials, GAMs allow us to model non-linear relationships as non-linear relationshipsHowever, whereas polynomial regression models can only model non-linear relationships that can be described by a polynomial function, GAMs can handle any type of non-linearityThe non-linear relationships are modeled by smoothing functionsThe predictors for which we specify a linear relationship are evaluated as in a ‘normal’ regression model6. What is Generalized Additive Modeling and why is it used? (2/3)The assumptions of GAMs are the same as those that we looked at for, respectively, logistic and linear regressionThe only difference is that the linearity assumption is relaxed, i.e., as long as we specify our model in a correct way so that non-linear relationships are estimated using the smoothing functions, there is no strict linearity assumptionNote that linear relationships should be estimated using ‘normal’ regression terms6. What is Generalized Additive Modeling and why is it used? (3/3)Wieling, Nerbonne & Baayen (2011) introduced generalized additive modeling to the field of linguisticsTheir study showed that GAMs allow us to measure the non-linear influence of geography (lat, long coordinates) on a dependent variable, while controlling for other predictor variablesGAMs have mainly been used in quantitative dialectology and sociolinguistics7. How to fit a GAM in R7. How to fit a GAM in R (1/3)We will use data by Wieling et al. (2011)The dataset contains 19,446 observations of the way speakers from different Dutch localities pronounce a particular word:LD: dependent variable, which records the Levenshtein distance between the pronunciation of the speaker and standard Dutch. The Levenshtein distance measures the number of string edits that are necessary to go from one string to another. The variable was scaled and centered around zeroLocation: Data collection siteWord: Word that is being pronouncedLongitude, Latitude: Data collection site coordinatesPopCnt: Population count, centered and scaledPopAge: Mean population age of the locality, centered and scaledPopIncome: Mean population income of the locality, centered and scaledIsNoun: 1 (noun), 0 (other word)WordFreq: Word frequency, log-transformedWordLength: Word length, log-transformedlibrary(readr)library(dplyr)dataSet <- read_csv(";) %>% mutate_if(is.character, as.factor)summary(dataSet)## LD Location Word Longitude ## Min. :-0.49888 Almen Gl : 48 rijst : 422 Min. :3.438 ## 1st Qu.:-0.20164 Alphen NB : 48 schade : 421 1st Qu.:5.024 ## Median : 0.01213 Amersfoort Ut: 48 veel : 420 Median :5.735 ## Mean : 0.00000 Angerlo Gl : 48 kammen : 419 Mean :5.590 ## 3rd Qu.: 0.23721 Anloo Dr : 48 wonen : 419 3rd Qu.:6.175 ## Max. : 0.87059 Austerlitz Ut: 48 noten : 418 Max. :7.168 ## (Other) :19158 (Other):16927 ## Latitude PopCnt PopAge ## Min. :50.77 Min. :-2.6608800 Min. :-3.177204 ## 1st Qu.:51.70 1st Qu.:-0.7225091 1st Qu.:-0.601558 ## Median :52.12 Median :-0.0915865 Median :-0.044225 ## Mean :52.21 Mean : 0.0003773 Mean :-0.003588 ## 3rd Qu.:52.73 3rd Qu.: 0.6421021 3rd Qu.: 0.630994 ## Max. :53.48 Max. : 3.1117973 Max. : 3.347749 ## ## PopIncome IsNoun WordFreq ## Min. :-3.575051 Min. :0.000 Min. :-2.070331 ## 1st Qu.:-0.586729 1st Qu.:0.000 1st Qu.:-0.503609 ## Median :-0.012116 Median :0.000 Median :-0.044536 ## Mean : 0.004545 Mean :0.436 Mean :-0.009755 ## 3rd Qu.: 0.522902 3rd Qu.:1.000 3rd Qu.: 0.496171 ## Max. : 4.555094 Max. :1.000 Max. : 3.164715 ## ## WordLength ## Min. :-1.685932 ## 1st Qu.:-0.498296 ## Median : 0.422907 ## Mean : 0.005858 ## 3rd Qu.: 0.422907 ## Max. : 2.363220 ## 7. How to fit a GAM in R (2/3)GAMs can be fit in R using the mgcv packageThe main function is bamLike glm it takes three arguments:FormulaDistribution family:To fit a logistic model, we add family="binomial"To model a normally distributed dependent variable, we add family="gaussian"Data7. How to fit a GAM in R (3/3)The formula bit of a GAM is somewhat special:If you do not want to evaluate a predictor in a linear way, you wrap it with s()If you want to evaluate the joint non-linear effect of two predictors on the outcome, you include the two predictors in the brackets after s, separated by commaHere we will do this for Longitude and Latitude so we can measure the effect of geography on pronunciation distanceIf you use coordinates in a bam formula it is important that you order them as we did herelibrary(mgcv)mod <- bam(LD ~ s(Longitude, Latitude) + PopCnt + PopAge + PopIncome + IsNoun + WordFreq + WordLength, family="gaussian", dataSet)8. How to specify random intercepts and slopes in a GAM8. How to specify random intercepts and slopes in a GAMIn a GAM formula we can also specify random interceptsTo do so, we wrap the variable name with s and we add the argument bs="re" to the call to sThis way, the bam function knows that it should treat the variable as a random interceptTo include a random slope, we add another call to s and we include the intercept name and the variable name, followed by bs="re"mod <- bam(LD ~ s(Longitude, Latitude) + s(Word, bs="re") + s(Word, PopCnt, bs="re") + PopCnt + PopAge + PopIncome + IsNoun + WordFreq + WordLength, family="gaussian", dataSet)9. How to interpret a GAM9. How to interpret a GAM (1/4)The output of a bam summary has two components:Parametric coefficientsSmooth termssummary(mod)## ## Family: gaussian ## Link function: identity ## ## Formula:## LD ~ s(Longitude, Latitude) + s(Word, bs = "re") + s(Word, PopCnt, ## bs = "re") + PopCnt + PopAge + PopIncome + IsNoun + WordFreq + ## WordLength## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.050416 0.035895 -1.405 0.160176 ## PopCnt -0.011451 0.002753 -4.159 3.2e-05 ***## PopAge 0.007375 0.001900 3.883 0.000104 ***## PopIncome -0.002777 0.001963 -1.415 0.157196 ## IsNoun 0.116820 0.057465 2.033 0.042079 * ## WordFreq 0.056376 0.028608 1.971 0.048783 * ## WordLength 0.030142 0.026523 1.136 0.255783 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(Longitude,Latitude) 26.93 28.73 122.168 < 2e-16 ***## s(Word) 43.81 44.00 233.598 < 2e-16 ***## s(Word,PopCnt) 28.31 47.00 1.545 6.98e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## R-sq.(adj) = 0.446 Deviance explained = 44.9%## fREML = -427.49 Scale est. = 0.05473 n = 194469. How to interpret a GAM (2/4)The interpretation of the parametric coefficients is the same as for a glm or lm model:Estimate: regression coefficient, on the same scale as the dependent variable or on the log odds scale when we fit a logistic modelStd. Error: standard error for the regression coefficient. Large standard errors point to less certainty about the effect estimatet-value (z-value in the logistic case): test statistic value that is used to calculate the significance of the effectPr(>|t|): p-value of the effect estimate9. How to interpret a GAM (3/4)For the smooth terms we only get information on their significanceHowever, their effects on the dependent variable can be plottedOne of the main uses of GAMs in linguistics so far has been to create maps of the regions in space where you can find e.g., a more distinct pronounciation, while controlling for other variables such as population size, income, word, etc.9. How to interpret a GAM (4/4)Plotting the effects of a smooth term can be done with the vis.gam functionTo plot a map of LD over space while controlling for the other variables, we can use the following codeThe result looks like a map of The Netherlands. The red lines separate regions for which the model estimates lower/higher Levenshtein distancesvis.gam(mod, plot.type="contour", color="terrain", too.far=0.05, view=c("Longitude","Latitude"))9. How to interpret a GAM (4/4)10. How to assess the fit of a generalized additive model10. How to assess the fit of a generalized additive modelLike the summary of lm models, the summary output of bam models includes an adjusted r-squared measureThis r-squared measure is also provided for logistic modelsIn addition, for logistic models, we can calculate the c-index of concordance, using the Hmisc package#not runlibrary(Hmisc)logisticModsomers2(fitted(logisticMod), as.numeric(data$dependent)-1)11. ExercisePlease go to and perform the exercise12. ReferencesWieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially. Plos One. DOI: . ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download