Modeling Observational Data - Duke University
Modeling Observational Data Workshop
Mike Babyak, PhD
The Meeting of the American Psychosomatic Society
Baltimore, MD, March 2008
Key Concepts:
• Observational studies and clinical trials probably agree more often than is believed
o Biggest threat is unmeasured variable/selection bias
• Statistics is a cumulative field, with much recent progress being made by way of simulation experiments
• Use all information from variables/data
o Use imputation techniques for missing data
o Generally a bad idea to make categories out of non-categorical variables
• Statistical adjustment makes the assumptions of:
o Parallelism
o Reasonable overlap among distributions
• Mediation and confounding cannot be distinguished mathematically
o Depends on your theoretical causal model
o Beware that confounders can actually be mediators
• Multivariable models with observational data assumes we have all the important variables measured and available
o We rarely have enough data to support such large models
o Prespecified models are the best place to start
o Special consideration is needed when there is not enough data
▪ Combine predictors
▪ Use penalization, propensity scoring
• Unless there is an enormous amount of data, better to avoid automated variable selection techniques
• Be aware that univariate prescreening also biases the model
• Better to prespecify interactions
o Poor power, but also unstable estimates
o Avoid separate within subgroups analysis
o May want to report trends in interactions, using appropriately cautious language
• We are more often in an exploratory mode than we realize
• Use “truth in advertising”
Selected references and web resources
Mike Babyak’s e-mail
michael.babyak@duke.edu
A copy of this presentation
Observational Data and Clinical Trials
Propensity Scoring
Rubin Symposium notes
Rosenbaum, P.R. and Rubin, D.B. (1984). "Reducing bias in observational studies using sub-classification on the propensity score." Journal of the American Statistical Association, 79, pp. 516-524.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference, Cambridge University Press.
Rosenbaum, P. R., and Rubin, D. B., (1983), The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrica, 70, 41-55.
Missing Data and Imputation
Mediation and Confounding
MacKinnon DP, Krull JL, Lockwood CM. Equivalence of the mediation, confounding and suppression effect. Prev Sci (2000) 1:173–81
General Modeling
A nice web tutorial on some of the concepts presented today
symptomresearch.chapter_8/
Harrell FE Jr. Regression modeling strategies: with applications to linear models, logistic regression and survival analysis. New York: Springer; 2001.
Web page with many resources related to Harrell text
Sample Size in Multivariable Models
Kelley, K. & Maxwell, S. E. (2003). Sample size for Multiple Regression: Obtaining regression coefficients that are accuracy, not simply significant. Psychological Methods, 8, 305–321.
Kelley, K. & Maxwell, S. E. (In press). Power and Accuracy for Omnibus and Targeted Effects: Issues of Sample Size Planning with Applications to Multiple Regression Handbook of Social Research Methods, J. Brannon, P. Alasuutari, and L. Bickman (Eds.). New York, NY: Sage Publications.
Green SB. How many subjects does it take to do a regression analysis? Multivar Behav Res 1991; 26: 499–510.
Peduzzi PN, Concato J, Holford TR, Feinstein AR. The importance of events per independent variable in multivariable analysis, II: accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48: 1503–10
Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: 1373–9.
Dichotomization
Harrell’s dichotomization page
Cohen, J. (1983) The cost of dichotomization. Applied Psychological Measurement, 7, 249-253.
MacCallum R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.
Maxwell, SE, & Delaney, HD (1993). Bivariate median splits and spurious statistical significance. Psychological Bulletin, 113, 181-190
Royston, P., Altman, D. G., & Sauerbrei, W. (2006) Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25,127-141.
Variable Selection
Thompson B. Stepwise regression and stepwise discriminant analysis need not apply here: a guidelines editorial. Ed Psychol Meas 1995; 55: 525–34.
Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med 2003; 8: 771–83.
Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992; 45: 265–82.
Steyerberg EW, Harrell FE, Habbema JD. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making 2001; 21: 45–56.
Cohen J. Things I have learned (so far). Am Psychol 1990; 45: 1304–12.
Roecker EB. Prediction error and its estimation for subset-selected models Technometrics 1991; 33: 459–68.
Univariate Pretesting and Transformation
Grambsch PM, O’Brien PC. The effects of preliminary tests for nonlinearity in regression. Stat Med 1991; 10: 697–709.
Faraway JJ. The cost of data analysis. J Comput Graph Stat 1992; 1: 213–29.
Validation and Penalization
Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001; 54: 774–81.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 2003; 58: 267–88.
Greenland S . When should epidemiologic regressions use random coefficients? Biometrics 2000 Sep 56(3):915-21
Moons KGM, Donders ART, Steyerberg EW, Harrell FE (2004): Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol 2004;57:1262-1270.
Steyerberg EW, Eijkemans MJ, Habbema JD. Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl 2001; 55:76-88.
Some simulation results relating to validation
Software
R software (free open source)
S-Plus software (commercial version of R with Windows gui)
SAS Macros for spline estimation
SAS code for bootstrap
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- template for modules of the revised handbook
- researchgate
- running head working with missing values
- missing data analysis making it work in the real world
- article plus material for the jaacap web site
- missing data analysis making it work methodology center
- modeling observational data duke university
- 10 statistical issues that researchers unnecessarily
- statistical analysis of data sets with missing values
- creation of dataset and screening program
Related searches
- duke university nonprofit certificate program
- duke university nonprofit management program
- duke university certificate in nonprofit
- is duke university a 501c3
- duke university executive leadership program
- duke university nonprofit leadership
- duke university ein number
- duke university hospital leadership
- duke university directions
- duke university certificate program online
- duke university graduate certificate programs
- duke university undergraduate application