Advanced Data Analysis from an Elementary Point of View

Advanced Data Analysis from an Elementary Point of View

Cosma Rohilla Shalizi

3

For my parents and in memory of my grandparents

Contents

Introduction

12

Introduction

12

To the Reader

12

Concepts You Should Know

15

Part I Regression and Its Generalizations

17

1 Regression Basics

19

1.1 Statistics, Data Analysis, Regression

19

1.2 Guessing the Value of a Random Variable

20

1.3 The Regression Function

21

1.4 Estimating the Regression Function

25

1.5 Linear Smoothers

30

1.6 Further Reading

41

Exercises

41

2 The Truth about Linear Regression

43

2.1 Optimal Linear Prediction: Multiple Variables

43

2.2 Shifting Distributions, Omitted Variables, and Transformations

48

2.3 Adding Probabilistic Assumptions

57

2.4 Linear Regression Is Not the Philosopher's Stone

60

2.5 Further Reading

61

Exercises

62

3 Model Evaluation

63

3.1 What Are Statistical Models For?

63

3.2 Errors, In and Out of Sample

64

3.3 Over-Fitting and Model Selection

68

3.4 Cross-Validation

72

3.5 Warnings

76

3.6 Further Reading

79

Exercises

80

4 Smoothing in Regression

86

4.1 How Much Should We Smooth?

86

4

15:21 Sunday 21st March, 2021 Copyright c Cosma Rohilla Shalizi; do not distribute without permission updates at

Contents

5

4.2 Adapting to Unknown Roughness

87

4.3 Kernel Regression with Multiple Inputs

94

4.4 Interpreting Smoothers: Plots

96

4.5 Average Predictive Comparisons

97

4.6 Computational Advice: npreg

98

4.7 Further Reading

101

Exercises

102

5 Simulation

115

5.1 What Is a Simulation?

115

5.2 How Do We Simulate Stochastic Models?

116

5.3 Repeating Simulations

120

5.4 Why Simulate?

121

5.5 Further Reading

127

Exercises

127

6 The Bootstrap

128

6.1 Stochastic Models, Uncertainty, Sampling Distributions

128

6.2 The Bootstrap Principle

130

6.3 Resampling

141

6.4 Bootstrapping Regression Models

143

6.5 Bootstrap with Dependent Data

148

6.6 Confidence Bands for Nonparametric Regression

149

6.7 Things Bootstrapping Does Poorly

149

6.8 Which Bootstrap When?

150

6.9 Further Reading

151

Exercises

152

7 Splines

154

7.1 Smoothing by Penalizing Curve Flexibility

154

7.2 Computational Example: Splines for Stock Returns

156

7.3 Basis Functions and Degrees of Freedom

162

7.4 Splines in Multiple Dimensions

164

7.5 Smoothing Splines versus Kernel Regression

165

7.6 Some of the Math Behind Splines

165

7.7 Further Reading

167

Exercises

168

8 Additive Models

170

8.1 Additive Models

170

8.2 Partial Residuals and Back-fitting

171

8.3 The Curse of Dimensionality

174

8.4 Example: California House Prices Revisited

176

8.5 Interaction Terms and Expansions

180

8.6 Closing Modeling Advice

182

8.7 Further Reading

183

Exercises

183

6

Contents

9 Testing Regression Specifications

193

9.1 Testing Functional Forms

193

9.2 Why Use Parametric Models At All?

203

9.3 Further Reading

207

10 Weighting and Variance

208

10.1 Weighted Least Squares

208

10.2 Heteroskedasticity

210

10.3 Estimating Conditional Variance Functions

219

10.4 Re-sampling Residuals with Heteroskedasticity

227

10.5 Local Linear Regression

227

10.6 Further Reading

232

Exercises

233

11 Logistic Regression

234

11.1 Modeling Conditional Probabilities

234

11.2 Logistic Regression

235

11.3 Numerical Optimization of the Likelihood

240

11.4 Generalized Linear and Additive Models

241

11.5 Model Checking

243

11.6 A Toy Example

244

11.7 Weather Forecasting in Snoqualmie Falls

247

11.8 Logistic Regression with More Than Two Classes

259

Exercises

260

12 GLMs and GAMs

262

12.1 Generalized Linear Models and Iterative Least Squares

262

12.2 Generalized Additive Models

268

12.3 Further Reading

268

Exercises

268

13 Trees

269

13.1 Prediction Trees

269

13.2 Regression Trees

272

13.3 Classification Trees

281

13.4 Further Reading

287

Exercises

287

Part II Distributions and Latent Structure

293

14 Density Estimation

295

14.1 Histograms Revisited

295

14.2 "The Fundamental Theorem of Statistics"

296

14.3 Error for Density Estimates

297

14.4 Kernel Density Estimates

300

14.5 Conditional Density Estimation

306

14.6 More on the Expected Log-Likelihood Ratio

307

Contents

7

14.7 Simulating from Density Estimates

310

14.8 Further Reading

315

Exercises

317

15 Principal Components Analysis

319

15.1 Mathematics of Principal Components

319

15.2 Example 1: Cars

326

15.3 Example 2: The United States circa 1977

330

15.4 Latent Semantic Analysis

333

15.5 PCA for Visualization

336

15.6 PCA Cautions

338

15.7 Random Projections

339

15.8 Further Reading

340

Exercises

341

16 Factor Models

344

16.1 From PCA to Factor Models

344

16.2 The Graphical Model

346

16.3 Roots of Factor Analysis in Causal Discovery

349

16.4 Estimation

351

16.5 The Rotation Problem

357

16.6 Factor Analysis as a Predictive Model

358

16.7 Factor Models versus PCA Once More

361

16.8 Examples in R

362

16.9 Reification, and Alternatives to Factor Models

366

16.10 Further Reading

373

Exercises

373

17 Mixture Models

375

17.1 Two Routes to Mixture Models

375

17.2 Estimating Parametric Mixture Models

379

17.3 Non-parametric Mixture Modeling

384

17.4 Worked Computating Example

384

17.5 Further Reading

400

Exercises

401

18 Graphical Models

403

18.1 Conditional Independence and Factor Models

403

18.2 Directed Acyclic Graph (DAG) Models

404

18.3 Conditional Independence and d-Separation

406

18.4 Independence and Information

413

18.5 Examples of DAG Models and Their Uses

415

18.6 Non-DAG Graphical Models

417

18.7 Further Reading

421

Exercises

422

Part III Causal Inference

423

8

Contents

19 Graphical Causal Models

425

19.1 Causation and Counterfactuals

425

19.2 Causal Graphical Models

426

19.3 Conditional Independence and d-Separation Revisited

429

19.4 Further Reading

430

Exercises

432

20 Identifying Causal Effects

433

20.1 Causal Effects, Interventions and Experiments

433

20.2 Identification and Confounding

435

20.3 Identification Strategies

437

20.4 Summary

452

Exercises

453

21 Estimating Causal Effects

455

21.1 Estimators in the Back- and Front- Door Criteria

455

21.2 Instrumental-Variables Estimates

462

21.3 Uncertainty and Inference

464

21.4 Recommendations

464

21.5 Further Reading

465

Exercises

466

22 Discovering Causal Structure

467

22.1 Testing DAGs

468

22.2 Testing Conditional Independence

469

22.3 Faithfulness and Equivalence

470

22.4 Causal Discovery with Known Variables

471

22.5 Software and Examples

476

22.6 Limitations on Consistency of Causal Discovery

482

22.7 Pseudo-code for the SGS Algorithm

482

22.8 Further Reading

483

Exercises

484

Part IV Dependent Data

485

23 Time Series

487

23.1 What Time Series Are

487

23.2 Stationarity

488

23.3 Markov Models

493

23.4 Autoregressive Models

497

23.5 Bootstrapping Time Series

502

23.6 Cross-Validation

504

23.7 Trends and De-Trending

504

23.8 Breaks in Time Series

509

23.9 Time Series with Latent Variables

510

23.10 Longitudinal Data

518

23.11 Multivariate Time Series

518

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download