Advanced High School Statistics

Advanced High School Statistics

First Edition

David M Diez

david@

Christopher D Barr

Yale School of Management chris@

Mine C? etinkaya-Rundel

Duke University mine@

Leah Dorazio

San Francisco University High School leah@

Copyright ? 2015 OpenIntro, Inc. First Edition. Printing: July 27th, 2015.

This textbook is available under a Creative Commons license. Visit for a free PDF, to download the textbook's source files, or for more information about the license.

AP? is a trademark registered and owned by the College Board, which was not involved in the production of, and does not endorse, this product.

Contents

1 Data collection

7

1.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Overview of data collection principles . . . . . . . . . . . . . . . . . . . . . 15

1.4 Observational studies and sampling strategies . . . . . . . . . . . . . . . . . 19

1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Summarizing data

45

2.1 Examining numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.2 Numerical summaries and box plots . . . . . . . . . . . . . . . . . . . . . . 55

2.3 Considering categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2.4 Case study: gender discrimination (special topic) . . . . . . . . . . . . . . . 79

2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3 Probability

100

3.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.3 The binomial formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

3.6 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

4 Distributions of random variables

164

4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4.2 Sampling distribution of a sample mean . . . . . . . . . . . . . . . . . . . . 181

4.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

4.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

4.5 Sampling distribution of a sample proportion . . . . . . . . . . . . . . . . . 200

4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

5 Foundation for inference

217

5.1 Estimating unknown parameters . . . . . . . . . . . . . . . . . . . . . . . . 218

5.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

5.3 Introducing hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 227

5.4 Does it make sense? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

3

4

CONTENTS

6 Inference for categorical data

251

6.1 Inference for a single proportion . . . . . . . . . . . . . . . . . . . . . . . . . 251

6.2 Difference of two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 263

6.3 Testing for goodness of fit using chi-square . . . . . . . . . . . . . . . . . . . 273

6.4 Homogeneity and independence in two-way tables . . . . . . . . . . . . . . . 285

6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

7 Inference for numerical data

311

7.1 Inference for a single mean with the t-distribution . . . . . . . . . . . . . . 312

7.2 Inference for paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

7.3 Difference of two means using the t-distribution . . . . . . . . . . . . . . . . 335

7.4 Comparing many means with ANOVA (special topic) . . . . . . . . . . . . . 346

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

8 Introduction to linear regression

374

8.1 Line fitting, residuals, and correlation . . . . . . . . . . . . . . . . . . . . . 376

8.2 Fitting a line by least squares regression . . . . . . . . . . . . . . . . . . . . 384

8.3 Types of outliers in linear regression . . . . . . . . . . . . . . . . . . . . . . 394

8.4 Inference for the slope of a regression line . . . . . . . . . . . . . . . . . . . 396

8.5 Transformations for nonlinear data . . . . . . . . . . . . . . . . . . . . . . . 404

8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

A End of chapter exercise solutions

425

B Distribution tables

447

B.1 Random Number Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

B.2 Normal Probability Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

B.3 t Probability Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

B.4 Chi-Square Probability Table . . . . . . . . . . . . . . . . . . . . . . . . . . 454

Preface

Advanced High School Statistics is ready for use with the AP? Statistics Course.1

This book may be downloaded as a free PDF at .

We hope readers will take away three ideas from this book in addition to forming a foundation of statistical thinking and methods.

(1) Statistics is an applied field with a wide range of practical applications. (2) You don't have to be a math guru to learn from real, interesting data. (3) Data are messy, and statistical tools are imperfect. But, when you understand

the strengths and weaknesses of these tools, you can use them to learn about the real world.

Textbook overview

The chapters of this book are as follows: 1. Data collection. Data structures, variables, and basic data collection techniques. 2. Summarizing data. Data summaries and graphics. 3. Probability. The basic principles of probability. 4. Distributions of random variables. Introduction to key distributions, and how the

normal model applies to the sample mean and sample proportion. 5. Foundation for inference. General ideas for statistical inference in the context of

estimating the population proportion. 6. Inference for categorical data. Inference for proportions using the normal and chi-

square distributions. 7. Inference for numerical data. Inference for one or two sample means using the t

distribution, and comparisons of many means using ANOVA. 8. Introduction to linear regression. An introduction to regression with two variables. Instructions are also provided in several sections for using Casio and TI calculators.

Videos

The icon indicates that a section or topic has a video overview readily available. The icons are hyperlinked in the textbook PDF, and the videos may also be found at

stat/videos.php

1AP? is a trademark registered and owned by the College Board, which was not involved in the production of, and does not endorse, this product.

5

6

CONTENTS

Examples, exercises, and appendices

Examples and guided practice exercises throughout the textbook may be identified by their distinctive bullets:

Example 0.1 Large filled bullets signal the start of an example.

Full solutions to examples are provided and often include an accompanying table or figure.

Guided Practice 0.2 Large empty bullets signal to readers that an exercise has been inserted into the text for additional practice and guidance. Students may find it useful to fill in the bullet after understanding or successfully completing the exercise. Solutions are provided for all within-chapter exercises in footnotes.2

There are exercises at the end of each chapter that are useful for practice or homework assignments. Many of these questions have multiple parts, and odd-numbered questions include solutions in Appendix A.

Probability tables for the normal, t, and chi-square distributions are in Appendix B, and PDF copies of these tables are also available from for anyone to download, print, share, or modify.

OpenIntro, online resources, and getting involved

OpenIntro is an organization focused on developing free and affordable education materials. OpenIntro Statistics, our first project, is intended for introductory statistics courses at the high school through university levels.

We encourage anyone learning or teaching statistics to visit and get involved. We also provide many free online resources, including free course software. Most data sets for this textbook are available on the website and through a companion R package.3 OpenIntro's resources may be used with or without this textbook as a companion.

We value your feedback. If there is a particular component of the project you especially like or think needs improvement, we want to hear from you. Provide feedback through a link provided on the textbook page:

stat/textbook.php

Acknowledgements

This project would not be possible without the dedication and volunteer hours of all those involved. No one has received any monetary compensation from this project, and we hope you will join us in extending a thank you to the project's volunteers listed at

about

and also to the many students, teachers, and other readers who have provided feedback to the project.

2Full solutions are located down here in the footnote! 3Diez DM, Barr CD, C? etinkaya-Rundel M. 2015. openintro: OpenIntro data sets and supplement

functions. OpenIntroOrg/openintro-r-package.

Chapter 1

Data collection

Scientists seek to answer questions using rigorous methods and careful observations. These observations ? collected from the likes of field notes, surveys, and experiments ? form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. It is helpful to put statistics in the context of a general process of investigation:

1. Identify a question or problem. 2. Collect relevant data on the topic. 3. Analyze the data. 4. Form a conclusion.

Statistics as a subject focuses on making stages 2-4 objective, rigorous, and efficient. That is, statistics has three primary components: How best can we collect data? How should it be analyzed? And what can we infer from the analysis?

Researchers from a wide array of fields have questions or problems that require the collection and analysis of data. Let's consider three examples.

? Climate scientists: how will the global temperature change over the next 100 years? ? Psychology: can a simple reminder about saving money cause students to spend less? ? Political science: what fraction of Americans approve of the job Congress is doing?

What questions from current events or from your own life can you think of that could be answered by collecting and analyzing data? While the questions that can be posed are incredibly diverse, many of these investigations can be addressed with a small number of data collection techniques, analytic tools, and fundamental concepts in statistical inference.

This chapter focuses on collecting data. We'll discuss basic properties of data, common sources of bias that arise during data collection, and several techniques for collecting data through both sampling and experiments. After finishing this chapter, you will have the tools for identifying weaknesses and strengths in data-based conclusions, tools that are essential to be an informed citizen and a savvy consumer of information.

7

8

CHAPTER 1. DATA COLLECTION

1.1 Case study: using stents to prevent strokes

Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice.

In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke.1 Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start by writing the principal question the researchers hope to answer:

Does the use of stents reduce the risk of stroke?

The researchers who asked this question collected data on 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups:

Treatment group. Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.

Control group. Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.

Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Table 1.1. Patient outcomes are recorded as "stroke" or "no event", representing whether or not the patient had a stroke at the end of a time period.

Patient 1 2 3 ... 450 451

group treatment treatment treatment

... control control

0-30 days no event

stroke no event

... no event no event

0-365 days no event

stroke no event

no event no event

Table 1.1: Results for five patients from the stent study.

Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. Table 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study. For instance, to identify the number of patients in the treatment group who had a stroke

1Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:9931003. doi/full/10.1056/NEJMoa1105335. NY Times article reporting on the study: 2011/09/08/health/research/08stent.html.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download