Modeling Observational Data - Duke University



Modeling Observational Data Workshop

Mike Babyak, PhD

The Meeting of the American Psychosomatic Society

Baltimore, MD, March 2008

Key Concepts:

• Observational studies and clinical trials probably agree more often than is believed

o Biggest threat is unmeasured variable/selection bias

• Statistics is a cumulative field, with much recent progress being made by way of simulation experiments

• Use all information from variables/data

o Use imputation techniques for missing data

o Generally a bad idea to make categories out of non-categorical variables

• Statistical adjustment makes the assumptions of:

o Parallelism

o Reasonable overlap among distributions

• Mediation and confounding cannot be distinguished mathematically

o Depends on your theoretical causal model

o Beware that confounders can actually be mediators

• Multivariable models with observational data assumes we have all the important variables measured and available

o We rarely have enough data to support such large models

o Prespecified models are the best place to start

o Special consideration is needed when there is not enough data

▪ Combine predictors

▪ Use penalization, propensity scoring

• Unless there is an enormous amount of data, better to avoid automated variable selection techniques

• Be aware that univariate prescreening also biases the model

• Better to prespecify interactions

o Poor power, but also unstable estimates

o Avoid separate within subgroups analysis

o May want to report trends in interactions, using appropriately cautious language

• We are more often in an exploratory mode than we realize

• Use “truth in advertising”

Selected references and web resources

Mike Babyak’s e-mail

michael.babyak@duke.edu

A copy of this presentation



Observational Data and Clinical Trials





Propensity Scoring

Rubin Symposium notes



Rosenbaum, P.R. and Rubin, D.B. (1984). "Reducing bias in observational studies using sub-classification on the propensity score." Journal of the American Statistical Association, 79, pp. 516-524.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference, Cambridge University Press.

Rosenbaum, P. R., and Rubin, D. B., (1983), The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrica, 70, 41-55.

Missing Data and Imputation





Mediation and Confounding

MacKinnon DP, Krull JL, Lockwood CM. Equivalence of the mediation, confounding and suppression effect. Prev Sci (2000) 1:173–81

General Modeling

A nice web tutorial on some of the concepts presented today

symptomresearch.chapter_8/

Harrell FE Jr. Regression modeling strategies: with applications to linear models, logistic regression and survival analysis. New York: Springer; 2001.

Web page with many resources related to Harrell text



Sample Size in Multivariable Models

Kelley, K. & Maxwell, S. E. (2003). Sample size for Multiple Regression: Obtaining regression coefficients that are accuracy, not simply significant. Psychological Methods, 8, 305–321.

Kelley, K. & Maxwell, S. E. (In press). Power and Accuracy for Omnibus and Targeted Effects: Issues of Sample Size Planning with Applications to Multiple Regression Handbook of Social Research Methods, J. Brannon, P. Alasuutari, and L. Bickman (Eds.). New York, NY: Sage Publications.

Green SB. How many subjects does it take to do a regression analysis? Multivar Behav Res 1991; 26: 499–510.

Peduzzi PN, Concato J, Holford TR, Feinstein AR. The importance of events per independent variable in multivariable analysis, II: accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48: 1503–10

Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: 1373–9.

Dichotomization

Harrell’s dichotomization page



Cohen, J. (1983) The cost of dichotomization. Applied Psychological Measurement, 7, 249-253.

MacCallum R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.

Maxwell, SE, & Delaney, HD (1993). Bivariate median splits and spurious statistical significance. Psychological Bulletin, 113, 181-190

Royston, P., Altman, D. G., & Sauerbrei, W. (2006) Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25,127-141.

Variable Selection

Thompson B. Stepwise regression and stepwise discriminant analysis need not apply here: a guidelines editorial. Ed Psychol Meas 1995; 55: 525–34.

Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med 2003; 8: 771–83.

Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992; 45: 265–82.

Steyerberg EW, Harrell FE, Habbema JD. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making 2001; 21: 45–56.

Cohen J. Things I have learned (so far). Am Psychol 1990; 45: 1304–12.

Roecker EB. Prediction error and its estimation for subset-selected models Technometrics 1991; 33: 459–68.

Univariate Pretesting and Transformation

Grambsch PM, O’Brien PC. The effects of preliminary tests for nonlinearity in regression. Stat Med 1991; 10: 697–709.

Faraway JJ. The cost of data analysis. J Comput Graph Stat 1992; 1: 213–29.

Validation and Penalization

Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001; 54: 774–81.

Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 2003; 58: 267–88.

Greenland S . When should epidemiologic regressions use random coefficients? Biometrics 2000 Sep 56(3):915-21

Moons KGM, Donders ART, Steyerberg EW, Harrell FE (2004): Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol 2004;57:1262-1270.

Steyerberg EW, Eijkemans MJ, Habbema JD. Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl 2001; 55:76-88.

Some simulation results relating to validation



Software

R software (free open source)



S-Plus software (commercial version of R with Windows gui)



SAS Macros for spline estimation



SAS code for bootstrap



................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download