COVERAGE .edu



PA 765 STUDY GUIDE FOR THE FINALSI hope this study guide is useful, but I do not guarantee that it hits every item that will appear on the finals. Still, you should be in good shape if you feel familiar with the topics and questions listed. Also, unlike the midterm, there is a finals review session in class. Note that R topics were added and PLS was dropped starting Spring 2020.– Best wishes, Dave Garson COVERAGETopics covered in class or in the readings for the finals portion of PA 765, as per syllabus:General linear models (the ANOVA family)Binary logistic regressionMultinomial logistic regressionIntroduction to R and R StudioStatistical Analytics with RClassification and Regression Trees with RRandom Forests in RThese major topics break down into subtopics. Can you explain what each is referring to?GLM / ANOVA FAMILYassumptions of anovageneral linear model (GLM)anova, ancova, manova, mancovafactors vs. covariateshomogeneity of variance Levene's testone-way anovatwo-way anovarandom effects modelswithin- and between-group designsbalanced designsrepeated measureseffect size measuresmultiple comparison testspost hoc testsmultiple range testscontrast testsBINARY LOGISTIC REGRESSIONassumptionslinearity in the logitlogitsodds odds ratios, exp(b)information theory measures, AIC, BICinterpreting parameter estimateslikelihood ratio testBox-Tidwell testHosmer-Lemeshow testsignificancemaximum likelihoodlikelihood ratio testclassification tablepseudo R-squarereference categoriesprobability (marginal) analysisresidual analysisMULTINOMIAL LOGISTIC REGRESSIONbinomial vs. multinomialnumber of dependent variablessignificance tests in multinomial regressionINTRODUCTION TO RCRANRStudioData formats that can be used in RThe assignment operator ("<-")library()require()full versus simple (short) variable nameshelp()User Librarysos packagemissing value symbol in Rcase sensitivity in Rthe c() functionassigning value labels in Rthe haven packageView()class()na.omit()vectors, factors, and data frameslapply()quote marks in R clear datasets from the R environmentswirl()STATISTICAL ANALYTICS WITH R.csv formatOutput objectsAssignment operator (<-)Comment lines (#)Use of the attach() commandStandardizing variables with scale()Use of the head() commandComputing the mean or standard deviation for an objectUse of the aggregate() commandUse of the t.test() commandUse of the leveneTest() commandThe DV in long-linear modelsUse of the lm () commandDendogramsUse of the alpha() commandUse of the clusplot() commandUse of the aov() commandUse of the factor() commandUse of MANOVAFitted valuesDeviance and its synonymsROC analysisConfusion table and accuracyThe rprocessR packageR command for generalized linear models (GZLM)Purpose of gamma regressionUse of the relevel() commandDistribution of residuals in a well-fitting modelCanonical family and link for OLS regressionCanonical family and link for binary logistic regressionTukey testEstimated marginal meansType of data for poisson regressionType of model for overdispersed count dataLeading R command for multilevel modelingLeading R command for panel data regression (PDR)Model= and effect= options in PDR Fixed effect and two-way fixed effects models in PDRVariance effects (components) table for random effects in PDRIdiosyncratic effects in a PDR random effects modelEffect on variance effects in PDR random effects models of adding or subtracting fixed effectsUse of the gee() command in terms of PDRUse of the Hausman test in PDRCLASSIFICATION AND REGRESSION TREES WITH Rclassification treeregression treenodesterminal nodeleavesbucketconstellations versus causal sequencesequifinalityre-use of predictors in treesroot noderevealing heterogeneityinteraction effect in decision treesdata level for decision treesare decision trees nonparametric?nonlinearity in decision treesvariable selection in decision trees ensemble methodscross-validationsample size for decision treespruningGINI indexinformation gainclassification error raterecursivity in decision treesCARTrpart()ctree()training and validation setsdata framessample()names()summary()library() or require()set.seed()View()rpart() syntax for listing the dv and ivs the assignment operator ( "<-")plot() and text()the confusion matrixminbucket setting listing factor levels help(rpart) versus library(help="rpart")the cex settingreading correct and incorrect classifications from the tree diagramprp()fancyRpartPlot()the CP table and meaning of its rowsrel errorxerrorrsqvariable importance listroot node errortable() in listing observations class()accuracysensitivityspecificityROC plotAUClift plotsgains plotsrate of positive predictions (RPP)true positive rate (TPR)precision/recall plotsmethod = "anova" in rpart()configurational analysisequifinalityassignment of predicted values in regression treesobject elements and how to display themresiduals()p-values as a splitting criterion in ctree()RANDOM FORESTS WITH Rensemble methodsclassification vs. regression forestsuses of random forestscategorical variables in random forestsrandom forests and being nonparametricrandom forests and nonlinearityrandom forests and multicollinearityrandom forests and equifinalityrandom factors in random forest solutionsgraphing trees in random forest solutionscross-validation, training sets, OOBmtry option in random forestsntree option in random forestsminbucket option in random forestsobtaining predicted values with the randomForest packageelements in output objects in Rerror rate plots in random forest solutions confusion tables MeanDecreaseAccuracyMeanDecreaseGiniMean square error Variable importance plots for random forest solutionsProximity coefficients in random forest solutionsMultidimensional scaling (MDS) plots for random forest solutionsMDSPlot()Class center plotsTuning a random forest modelThe uses of the randomForestExplainer packageConditional inference treesattach() and detach() in R A NON-COMPREHENSIVE LIST OF QUESTIONS TO REVIEW FOR THE FINAL EXAMWhat is the difference between ANOVA and ANCOVA?What is the difference between ANOVA and MANOVA?What is Levene's test for?Explain the use of F tests in MANOVA. Explain the use of post hoc tests in MANOVA.What link function is used in logistic regression?What is the main effect size measure in logistic regression?In logistic regression what is the EXP(B) function for?In a well-fitting logistic model, should the Hosmer-Lemeshow test be significant or non-significant?What is the Box-Tidwell test for in logistic regression?In the SPSS world, what is the most popular pseudo-R-squared measure? In multinomial logistic regression, what is the default reference category for the dependent variable?What is another label for the model-chi-square test in multinomial logistic regression?BELOW ARE QUESTIONS POSED BY STUDENTS BY EMAIL IN PREVIOUS SEMESTERS FOR FINALS REVIEW FOR PA 765, WITH ANSWERSSend your finals-related questions to garson@ncsu.edu prior to the final exam.> 1) I have in my notes something about Probap? What is it? And when would we use it?ANSWER: Possibly you mean probit. Probit is similar to binary logistic regression in that both are for binary outcome variables. Probit is used when the binary variable has an underlying normal distribution, like high and low income. You cannot exponentiate probit b coefficients to get odds ratios, so probit output is more difficult to interpret.> 2) Can you help distinguish the difference between Log Linear and Logit? ANSWER: What is logit regression in some packages is logistic regression in others. They are equivalent. Log linear analysis, on the other hand, is a non-dependent logistic procedure. Non-dependent means there is no dependent variable. Rather, the purpose is to find the least number of predictor main and interaction effects which will explain the distribution of the cell counts in a table.> 3) Is Maximum Likelihood, -2LL, Deviance, and Model Chi-square statistics all the same thing? I remember you saying these terms were interchangeable.ANSWER: -2LL, deviance, and model chi-square are all the same thing. They are a measure of error in a model estimated by maximum likelihood estimation (as opposed, for example, to OLS estimation). ML produces the likelihood (L), which, when its log is multiplied by -2 to get -2LL, conforms to the chi-square distribution and thus can be used to compute significance (p values)> 4) How is Hosmer-Lemeshow test different than Likelihood Ratio Test? Don't they both measure the overall significance of the model?ANSWER: Yes, both measure the overall significance of the model, but in different ways employing different criteria. When applied to the model and not just a particular variable, the likelihood ratio test compares the deviance in the researcher's model with the deviance in a baseline model, usually the null model (intercept only model). If the researcher's model is significant by this test, then it is reducing error compared to the baseline model. SPSS calls this the omnibus test (look in the model row, chi-square column). The Hosmer-Lemeshow (note spelling) test divides the sample into g groups, where g is usually 10. For each group it compares observed values with expected (predicted) values. It uses the usual chi-square approach: compute O-E, square it, divide by E, sum across all groups, look up the p value in a chi-square table using g-2 degrees of freedom to see if the model is significant. We want non-significance, indicating our predictions (E) are not very far from actual values (O). In general, the likelihood ratio test is preferred over the Hosmer-Lemeshow test, but the Hosmer-Lemeshow test is preferred over classification tables. Still, if one test finds the model significant and the other test finds it not significant, the take-away is that one may have a weak, marginal model.> 5) What is the difference between "Odds" and "Odds Ratio"? I'm getting confused about all of the different terminology. And what does effect size measure have to do with them?ANSWER. Probability is that chance an event occurs. In binary logistic regression, it is the chance the DV = 1. The odds is the probability of an event occurring (p), divided by the probability of it not occurring (1-p). The odds ratio is the ratio of the odds for one group compared to the odds for the other group. Logistic b coefficients can be made into odds ratios: take the natural log base e and raise it to the power of b to get the odds ratio. Consider gender as a predictor with male as the reference category, and passing a test the binary outcome variable, where 1 = passing. If the odds ratio for gender = 1.2, then we can say that being female rather than male multiplies the odds of passing by a factor of 1.2. For any relationship we would like to know if it is significant (there is less than .05 chance we would get a relationship that strong or stronger just by the chance of taking another sample) and also if it is important (typically some measure which varies from 0 to 1 like R-squared, but in the case of odds ratios, 1 is the no-effect level and so importance is the degree the odds ratio is above (positive effect) or below (negative effect) 1.0. "Effect size" refers to the value on some measure of importance.> 6) Yesterday we discussed that the highest values are by default the reference category in logistic regression. Which system or analysis does the default take the lowest value? I remember being confused about this when we talk about logistic regression a couple weeks ago.ANSWER. Statistics packages vary in whether the highest or lowest coded category is the reference category on the DV and/or IV sides. For SPSS, see the charts on pages 24 and 33 in the "Logistic Regression" digital book. For SPSS on the IV side, the highest is the reference category for factors in either binary or multinomial logistic regression, but if entered as covariates, then the lowest is the reference category. For SPSS on the DV side for binary logistic regression, the lowest (0) is the reference and the highest (1) is predicted. For SPSS on the DV side for multinomial logistic regression, however, the highest is the reference category. Most packages let you select which is to be the reference category, but SPSS does not allow this for the DV side in binary logistic regression, though it does allow change on the DV side for multinomial logistic regression. If this seems confusing, it is. The take-away is that whatever package you are using, be careful you understand what is the reference category on both the IV and DV sides. Usually it will be the highest-coded category, but not always. Stata, for instance, on the DV side for multinomial logistic regression uses the most frequent category as the default reference level.> 7) Can we assume model specification in Logistic Regression, as we do in OLS regression?ANSWER: "Model specification" refers to properly specifying the DV and the IVs. Ideally, the true causes are in the IV list and spurious causes are omitted. Ideal model specification is rarely attained. Examining the literature to see what IVs have been identified by other researchers is standard practice. Be aware that adding a previously-omitted important IV will change all the coefficients. Likewise, adding a variable which is a spurious cause will change all the coefficients. Also, dropping IVs can have the same effect. This is why I recommend having two or more models and making the research purpose to determine which model fits the data best (as opposed to seeking the one "correct" model). Model specification is an issue for most statistical procedures, including both logistic and OLS regression. It is an assumption in that any procedure will be correct only to the extent that the model is properly specified. Instrumental variables regression is one method of attempting to deal with unmeasured variables (the endogeneity problem discussed in class). > 8) What does Bonferonni adjustment mean?ANSWER: When you have multiple independent significance tests, the actual alpha significance level drops. For instance, with one test at the .05 level the significance is actually what output says it is. Bue if you have a second independent test, then .05 will no longer be good enough. The output would need to report .025 to have an actual ,05 level of significance. In general, assuming you are seeking an actual .05 level, the output should report .05/k, where k is the number of independent tests. For instance, with 5 independent tests, you would need .05/5 = .01 as the reported significance level to have an actual .05 significance test. This is the Bonferroni adjustment. There are other adjustments than Bonferroni which also penalize for multiple independent tests.> 9) When do you use MANOVA (or MANCOVA) over binary logistic or multinomial logistic regression? I am confused about the differences and when you would use GLM over Logistic Regression.ANSWER: In general, multiple procedures might be used in connection with the same data problem. However, here are some points:A. MANOVA and MANCOVA are used when one has multiple DVs of a continuous measurement nature, which you want to analyze as a set.. GLM Univariate and logistic regression are for one DV. GLM Univariate wants a DV measured at a continuous level. Logistic regression wants either a binary DV (binary logistic regression) or a categorical DV (multinomial logistic regression). If the categories are ordered, then one wants ordinal logistic regression.B. A binary variable may be considered either a categorical variable or a continuous variable. However, GLM assumes that the DV has a normal distribution, which is impossible with a binary variable. Therefore, for a binary DV, binary logistic regression is normally used. Discriminant function analysis is an alternative which has more statistical power than binary logistic regression if all the assumptions of OLS regression are met.C. One would not use binary logistic regression with a continuous DV. For a continuous DV, GLM might be used (indeed, OLS regression models are part of the GLM family), though other types of regression might be appropriate, such as types found in GZLM (generalized linear models) such a gamma regression (used for skewed DVs). > 10) Can you explain the use of post-hoc tests in MANOVA?ANSWER: Post-hoc tests are "post" in the sense that they should only be examined after the model as a whole has been found to be significant by the F test. In SPSS output, the “Tests of Between Subjects Effects” table provides this F test of the significance of the model overall (the “omnibus test”, in the “Corrected Model” row).There are many post-hoc tests for testing the difference in means between levels of a factor (e.g., males vs females). The Bonferroni test is one of the most common. In SPSS, one may select multiple tests. One selects a categorical variable (e.g., race = white, black, other). For the selected variable, the "Post Hoc Tests: Multiple Comparisons" table will test the differences in means for each categorical pairing (e.g., white vs black, white vs other, etc) for each of the selected tests (e.g., Bonferroni, Tukey HSD) for each of the dependent variables (recall MANOVA is for two or more DVs). The Bonferroni and other tests are adjust for the fact that the post hoc tests are a series of independent significance test.There are also "pairwise comparison" tests which may also be done post-hoc. These also give comparison of the difference between levels of a categorical variable with respect to each DV. The difference from the post-hoc table is that what is tested, at least as SPSS uses the term, is difference in expected marginal means (EMM). Profile plots give this information in graphical form.There are also "contrast tests", which are custom univariate and multivariate tests of particular hypotheses of interest to the researcher, also done on a post hoc basis.> 11) Can you expand on some of the differences between AIC/BIC and when we would use them.ANSWER: Logistic regression is based on maximum likelihood (ML) estimation, which generates a measure of model error called the likelihood (L). The likelihood is transformed into -2 log likelihood, called the deviance or model chi-square, because this transformation of L conforms to a chi-square distribution and thus can be used for significance testing. The -2LL value has no intrinsic meaning. Rather, it is used to compare models, with lower being less error. The comparison is done using a likelihood ratio test. The LR test can only be used when the smaller model is nested within the larger model being compared, meaning all terms in the smaller model are nested within the larger model. The null model, for instance, is always nested within the researcher's model. However, the LR test cannot be used for non-nested comparisons. In order to compare two models which are not nested, a penalty term must be added to -2LL, making it larger (suggestive of more error). AIC adds a smaller penalty term than does the more conservative BIC coefficient. There are other penalty measures, called information criterion measures. For non-nested comparisons, one uses AIC or BIC instead of -2LL, but lower is still better.> 12) Are there any reasons to use GLM over a standard regression (OLS)?ANSWER: GLM is equivalent to OLS regression. One will get the same b coefficients either way. Both require a continuous DV. To run a basic regression model in GLM, just enter only covariates (continuous variables) as predictors. If one has categorical predictors (factors), in OLS regression these must be made into sets of dummy variables, leaving out one level. The GLM algorithm does this conversion behind the scenes for you, making it easier. Also, the output will differ between the OLS regression module and the GLM module. For instance, GLM will give you eta-squared, which is a nonlinear version of R-squared. The choice is largely a matter of preference and the prevailing practice in one's field.> 13) Can you elaborate on the concept of "over-fitting" a model, and why it's a concern?ANSWER: Fitting refers to fit values, which are the estimates. Over-fitting refers to getting the best fit values but doing so by fitting even noise in the data. An over-fitted model may well not generalize well to other data. One common strategy is to employ cross-validation, which is done by developing the model for a development dataset (e.g., even-numbered observations) and testing it out of a validation set (e.g., odd-numbered observations). Some cross-validation procedures use a large number validation sets (e.g., by leaving out one observation each time), then averaging the results. > 14) Do we need to review log odds, or just odds ratios?ANSWER: The important one is odds ratios, which is the main variable-level effect size measure for logistic models. The odds ratio is the factor by which the odds of getting the target value of the DV (e.g., typically 1 in binary logistic regression) are multiplied. Thus if the odds ratio for income as a predictor is .82, then the odds that the DV=1 are multiplied by .82 when income goes up 1 unit. Odds ratios below 1.0 reduce the odds, 1.0 is no effect, and odds ratios above 1.0 increase the odds. What one is predicting in logistic regression is not the raw value of the DV but rather the log odds of the DV, called the logit. The log odds is a function equal to taking the log of the odds, where the odds in binary logistic models are the probability the DV = 1 divided by the probability the DV = 0. > 15) Can you elaborate on why violating the assumption of linearity in a logit is problematic, and what effects it will have?ANSWER: Logistic regression is part of the general linear models (GLM) family, implying that the predictor side of the equation is assumed to be linear with the outcome (dependent) side. The Box-Tidwell test of linearity in the logit is used primarily with interval-level covariates. The logit step test is used primarily with ordinal-level covariates.Violating this assumption is a form of measurement error. Therefore the effect is to inflate standard errors, thereby making significance tests invalid. One will make more Type 2 errors (false negatives) and, equivalently, tests will lack statistical power.> 16) Do we need to know which stats packages place which numbers as reference categories?ANSWER: You should know at least two things: (1) What is the reference category may be handled differently by different packages for the IV and/or the DV, and knowing what the reference category is, is absolutely crucial to making correct inference statements; and (2) most times, the highest-coded category is the reference category (e.g., in multinomial logistic regression the default reference category of the DV is the highest-coded category).> 17) What are the differences between likelihood ratio test and hosmer-lemershow?ANSWER: First, it is the Hosmer-Lemeshow test. Both the LR test and the H-L test may be used to test the overall significance of a logistic model. The LR test is based on the difference in -2LL between the researcher's model and the null model, with a finding of significance indicating a good model. The H-L test divides the sample into portions (usually 10 deciles) and compares observed and expected (predicted) values of the DV within each decile, then uses a type of averaging to get a whole-model result for which a finding of non-significance indicates a good model. Thus the L-R test would show error in the researcher's model is significantly less than in the null model, whereas the H-L test would show that model-predicted values are not significantly different from observed values. These are two different meanings of "model significance" and conceivably the researcher might want to use both. The L-R test is much more commonly used, but some feel the H-L test is more meaningful since it looks at how well the model is working within each decile and, moreover, it may be somewhat trivial just to find that the researchers model with predictors works better than the null model without predictors.> 18) Is Omnibus test and Likelihood Ratio test the same thing? Do they both measure the goodness of fit of the model? And how is Omnibus test different than Hosmer-Lemeshow test?ANSWER: See the two previous review question answers that dealt with the L-R and H-L tests. Yes, the two are different, as explained in Question 19. The L-R test is the same as what SPSS calls the "omnibus test" in the "Model" row of the "Omnibus Tests of Model Coefficients" table. > 19) What is the difference between the Wald's test and Likelihood Ratio test? Do they both measure the significance of each predictor value?ANSWER: The L-R test can be used at either the model or the variable level. At the model level it usually compares the researcher's model to the null model. At the variable level it compares models with and without a given predictor variable of interest. Wald tests are variable-level tests of the significance of each predictor variable. When Wald and L-R tests conflict in their significance findings, L-R tests are usually preferred due to certain biases of the Wald test. For instance, the Wald test assumes estimates are asymptotically (large sample) normally distributed, which is not always the case.> 20) What does Maximum Likelihood Estimation have to do with p-value?ANSWER: ML is a method of estimating the b coefficients, just as OLS is another method. After the b coefficients are estimated under either method, one can apply some significance test to them. In logistic regression, the logistic b coefficients are estimated by ML. The Wald test, discussed in 21, are one method of computing the significance of these b coefficients. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download