AP Statistics Exam Review



AP Statistics Exam Review

I. Exploring Data and the Normal Distribution (Chapters 1 and 2)

Exploring Data

Be sure that you are familiar with the terminology & how to make & interpret graphs for one- variable data. When analyzing one-variable data, always discuss shape, center, & spread. Look for patterns in the data, then any deviations from those patterns.

Quantitative Variables

✓ Discrete vs. continuous

✓ Frequency tables – frequency, relative frequency, cumulative relative frequency

✓ Graphs

- dotplot – small amount of data

- stemplot – small amount of data – don’t forget the key!

- histogram – large amount of data – select intervals that make sense! (bars should touch!)

- boxplot – large amount of data – always do modified boxplot that shows outliers

- ogive (left endpoint of class interval vs. relative cumulative frequency) – visual of %iles

- timeplot

✓ Numerical

- center – mean, median, mode; median resistant to outliers, mean is sensitive to outliers

Don't confuse median and mean. They are both measures of center, but for a given data set, they may differ by a considerable amount.

(a) If distribution is skewed right, then mean is greater than median.

(b) If distribution is skewed left, then mean is less than median.

Mean > median is not sufficient to show that a distribution is skewed right.

Mean < median is not sufficient to show that a distribution is skewed left.

- spread – variance, standard deviation (how far the typical value is away from the mean),

range, IQR (spread of middle 50% of data)

Don't confuse standard deviation and variance. Remember that standard deviation units are the same as the data units, while variance is measured in square units.

- distribution shape – symmetric, skewed, bimodal, uniform (not the same as symmetric), cluster

- outliers – 1.5 IQR rule – subtract from Q1 and add to Q3 to check for any outliers

✓ Summary Statistics

- for symmetrical distributions: use mean, standard deviation

- for skewed distributions: use 5 number summary (min, Q1, med, Q3, max)

✓ Comparing Distributions

- back-to-back stemplots

- histograms made using the same scale

- side-by-side boxplots – be sure to label which one is which

- compare shape, center, and spread

Categorical (Qualitative) Variables

✓ Graphs

- pie

- bar (bars do not touch!)

- make tables of counts, proportions, or percents for each category

Normal Distribution

The normal distribution is a density curve described by mean (() and a standard deviation ((). The 68-95-99.7 (empirical) rule is an approximation of the percent of observations that lie within one, two and three standard deviations of the mean. Only applies to symmetrical, mound-shaped distributions. Don’t classify a distribution as “normal” just because it looks symmetric uni-modal.

A percentile is the % of the distribution that is at or to the left of the observation.

Standard Normal Distribution is N(0,1) with ( = 0 and ( = 1. The standardized value z of an observation x gives the number of standard deviations from the mean: [pic]

Normal probability plot provides good assessment of how close a set of data is to a normal distribution.

If the plot is roughly linear, the data is approximately normal. (Provide a sketch!)

Effects of Changing the Data Set (i.e. changing units)

Adding or subtracting the same number n from each data value:

mean increases or decreases by n, standard deviation does not change (data values are shifted up or down, but the spread remains the same)

Multiplying or dividing each data value by the same number n:

mean and standard deviation are multiplied or divided by n.

II. Linear and Nonlinear Regression (Chapters 3, 4, and 14)

Scatterplots

• Graph response variable (y) vs. explanatory variable (x) and LABEL YOUR AXES!

• Look at graph before proceeding – describe strength (weak, moderate, strong), form (linear or nonlinear, curved, cluster), and direction (positive or negative)

positive association (as x increases, y increases)

negative association (as x increases, y decreases)

Look for deviations/outliers – point that falls outside of the overall pattern of the relationship

Least Squares Regression Line (LSRL)

• Line that minimizes the sum of the squares of the vertical distances of the data values (the errors) from the line

• [pic] [pic]is the predicted y value (a = y-intercept, b = slope)

• Stat-Calc-8: LinReg (a + bx)

• Formulas [pic] (a change of one stnd. dev. in x corresponds to a change of r stnd. dev. in y)

[pic] (Both of these formulas are given on the exam!)

• The LSRL does require an explanatory/response distinction … i.e., switching the x and y variables gives you two different LSRL’s. However, the r value would be the same no matter what.

• LSRL contains the point [pic] ([pic] ( mean of x’s, [pic]( mean of y’s)

• Interpretation of slope – ON AVERAGE, the amount that y increases or decreases for every 1 unit of increase in x

• Interpretation of y-intercept – predicted y-value when x = 0

• Residual = actual y – predicted y-hat

• Extrapolation (predicting values of y for values of x beyond the range of the data) is risky!

Correlation Coefficient (r)

• A measure of the strength and direction of a linear relationship between two quantitative variables

• Not dependent on the explanatory/response designation of the variables … “r” would not change if you switched x & y.

• r is between –1 and 1; the closer to –1 or to 1, the stronger the relationship, equal to -1 or 1 is a perfect negative or perfect positive relationship (all data points on the LSRL)

• not affected by a change in the units of measurement; r has no units!

• formula: [pic] (This is given on exam)

• remember to turn “Diagnostics On” on the calculator to see r when you use LinReg!

• A correlation coefficient near 0 doesn't necessarily mean there are no meaningful relationships between the two variables. Plot the points … there may be a strong curved relationship!

Coefficient of Determination (r2)

• ____ % of the variation in y can be explained by the linear relationship b/t x & y (add context!!)

• Remember that r2 > 0 doesn't mean r > 0. For instance, if r2 = 0.81, then r = 0.9 or r = -0.9.

Residual Plot (know the difference b/t this and a scatterplot!)

• Graph the x values vs. residual

• if a line is an appropriate model for the data, the residuals will be balanced (+/-) showing no systematic pattern in the plot

• a point is an outlier if its residual is an outlier

• a point is an influential point if its removal significantly changes the slope and/or correlation coefficient if it is removed from the set of data (these are often outliers in the x direction)

Cautions About Regression

correlation and regression describe only linear relationships

LSRL and r are not resistant to outliers

High correlation does not imply causation! (Remember, # of ice cream sales & # of drownings are highly correlated, but one does not cause the other!)

Causation – change in x causes change in y; Common Response – changes in x and y are caused by changes in a lurking variable z; Confounding – the effect of x on y is confounded with the effect of lurking variable z (we’re not sure whether x or z is causing change in y)

Transformations to Achieve Linearity

Exponential model

• [pic]

• If original data is approx. exponential, then x vs. log y (or x vs. ln y) will be roughly linear.

• exponential growth increases by a fixed percentage of the previous value (linear growth increases by a fixed amount for each equal value increase in x)

Power model

• [pic]

• If original data is best modeled by a power model, then log x vs. log y (or ln x vs. ln y)

will be roughly linear.

• r measures the strength of the linear relationship of the linearized data, not how well the model fits the original data!

Inference for Regression

• true regression equation: [pic] where (,(, and ( are unknown population parameters, ( is the standard deviation of the response variable y for each x

• LSRL: [pic] where a, b, and s are the sample statistics (estimates of the population parameters); a different sample will produce different values of a, b, and s

• Assumptions: 1) The true relationship is linear. Check the scatterplot of the data; look at the residuals; outliers and influential points should make you cautious in interpreting results. 2) The standard deviation of y about the true line (() is the same for all values of x. Check the residual plot – are the residuals evenly scattered about the regression line? 3) For any fixed value of x, the response y varies according to a normal distribution. Check the residuals for departures from normality – normal probability plot should be roughly linear or stemplot/histogram should not show skewness or major departures from normality. 4) Responses (y) are independent of each other.

• Formula for s: [pic] or equivalently s = [pic]

• Confidence interval for (: b ( t*SEb with n – 2 degrees of freedom

• Hypothesis testing for slope:

Ho: ( = 0 (There is no true linear relationship between ___ and ___)

Ha: ( ( 0, ( > 0, or ( < 0 (There is a (pos/neg) linear relationship between ___ and ___)

Test statistic t = [pic] OR Stat – Test – Lin RegT Test (this gives you t, p, df, s, r, a, b)

Bivariate Categorical Data

• Organized in a two-way table

• Marginal distribution (distribution of each variable separately) vs. conditional distribution (distribution of one variable based on one condition of the other variable)

• Graphs – segmented bar graph, side-by-side bars in bar graphs, side by side circle graphs

• Simpson’s Paradox – direction of the relationship between the variables is reversed when two sets of data are combined – confounding influence

III. Producing Data (Chapter 5)

Types of Data

• Anecdotal – based on haphazardly selected cases; based on a few individual cases.

• Available data – data produced in the past for some other purpose.

• Biased data – systematically favors certain outcomes.

Types of Studies

• Observational study – observes subjects and measures variables of interest, but does not impose a treatment or try to cause any changes. An observ. study doesn’t give good evidence of causation

• Experiment – imposes some treatment on experimental units or subjects in order to observe a response; only a well-designed experiment can assess causation; objective is to compare treatments

Sampling – goal is to study part of the population to gain information about the whole population (as

opposed to a census that attempts to contact every individual in the entire population). A census is most appropriate and realistic for small groups.

• Population – the entire group you want information about.

• Sample – the part of the population that is actually examined.

• Sampling frame – the list of individuals from which a sample is actually selected (like a phone book).

Sample Design – method used to choose a sample from the population.

• Convenience sample – choosing people who are easy to reach (e.g. people at the mall)

• Voluntary response sample – participants select themselves (call-in survey, internet poll)

• Simple random sample (SRS) – every GROUP of size n from the population has an equal chance to be selected. SRS refers to how you obtain your sample; random allocation is what you use in an experiment to assign subjects to treatment groups. They are not synonyms.

• Stratified random sample– separate the pop’n into groups (strata); then select a SRS from ea stratum

• Systematic sample – randomly list the population, then pick every nth person on the list.

• Multistage sample – divide the population into groups, subdivide the groups, select a random sample of the subdivisions, then pick a random sample from the selected subdivisions, etc.

Sample Bias

• Undercoverage – some groups are left out from being selected and therefore are under-represented.

• Nonresponse – selected to participate, but do not respond or refuse to cooperate.

• Response bias – lying on survey, among others.

• Question wording bias – question may be vague or leading.

• Voluntary response bias – over-representing people with potentially strong opinions.

• Hidden bias – unaware that the bias exists.

Experimental Design – refers to the choice of trtmts. & the manner in which units are assigned to trtments.

• Describe the factors (explanatory variables), response variable, treatments, how treatments are assigned; reminder: factors can have levels.

• Treatment group – the subjects receiving the treatment.

• Control group – the subjects receiving no treatment or a placebo treatment.

• Placebo – “dummy” treatment.

• Placebo effect – the tendency for subjects to respond (usually favorably) to a dummy treatment.

• Types of design

▪ Comparative – compare the responses of subjects.

▪ Randomized – all experimental units are randomly assigned to treatments.

▪ Blocking – group subjects together who are similar in some way that may affect the response (age, gender, income level, growing conditions, etc.); randomly assign treatments within each block, compare responses within each block.

▪ Matched pairs (a special case of blocking) – uses similar subjects (or experimental units) receiving the same treatment or the same subject receiving both treatments

▪ Blind – the subject does not know which treatment s/he is receiving

▪ Double-blind – neither the subject nor any person in contact with the subject (who would be evaluating the results) knows which treatment a subject receives. (a 3rd party knows!)

Principals of Experimental Design – memorize these!

• Randomization – use of chance to assign experimental units into treatment groups. Randomization refers to the random allocation of subjects to treatment groups, and not to the selection of subjects for the experiment. Randomization is an attempt to "even out" the effects of lurking variables across the treatment groups. Randomization helps avoid bias.

• Replication – using many subjects reduces natural chance variation (In science, “replicate” often means do the experiment again).

• Control – control the effect of lurking variables and reduce confounding by comparing several treatments in the same environment. Note: Control is not synonymous with "control group.” Blocking is a form of control.  Stratifying(sampling Blocking(experiment

Confounding Variables: Suppose that subjects in an observational study who eat an apple a day get significantly fewer cavities than subjects who eat less than one apple each week. A possible confounding variable is overall diet. Members of the apple-a-day group may tend to eat fewer sweets, while those in the non-apple-eating group may turn to sweets as a regular alternative. Since different diets could contribute to the disparity in cavities between the two groups, we cannot say that eating an apple each day causes a reduction in cavities.

Experimental bias – lack of REALISM (bad!), hidden bias

Simulation

• using random digits from a table, calculator, or computer to imitate chance behavior

• uses a model that accurately reflects the experiment under consideration

• remember to assign digits to represent outcomes and then use the TABLE (or other means) to generate outcomes

IV. Probability Chapter 6

VOCABULARY:

Random Complement Tree diagram Conditional probability

Probability Union Venn diagram Sample Space

Event Intersection Disjoint and Joint events Independent

RULES OF PROBABILITY

• Multiplication principle: If you can do one task in A number of ways and a second task in B number of ways, then both tasks can be done in A x B number of ways.

• Legitimate probability distribution:

For any event A[pic]

P(sample space) = 1

• Set notation: [pic]: OR [pic]: AND

• Addition Rules (OR):

General (on formula sheet): [pic]

Disjoint: [pic] (if they share no outcomes!)

• Multiplication Rules (AND):

General: [pic]

ONLY IF Independent: [pic]

• Conditional Probability: The probability of B given that A happened. (This is how the general multiplication rule is given on the formula sheet.)

[pic]

• The concepts of independent and disjoint events are topics that are often confused, or thought to be equivalent, but they’re not!

• Independence: Conceptually, events are independent when the occurrence (or non-occurrence) of one event does not INFLUENCE the probability of another event occurring.

• Formulas for independence….

By definition, [pic] … whether or not A occurred does not influence probability of B.

or equivalently, [pic], the independent multiplication rule.

• Disjoint (a.k.a. mutually exclusive) – Events are disjoint when the two events CANNOT OCCUR at the same time, i.e. rolling a 2 and rolling an odd number on the same die roll.

Intersection of disjoint events: [pic]

V. Random Variables: Binomial, Geometric, and Sampling Distributions (Ch. 7 – 9)

Chapter 7: Random Variables

Random variable – a variable whose value is a numerical outcome of a random phenomenon.

Discrete random variable – has a countable number of possible values.

• For instance, rolling a die or flipping a coin.

Continuous random variable – takes on all values in an interval (infinite # of possible values).

• For instance, variables that follow a normal distribution (or any density curve), like height, SAT scores.

Mean = expected value = E(x) = [pic] (Given on Exam)

Variance = [pic] (Given on Exam)

Probability distribution of a random variable x – tells all possible values of x and the probability of getting each value.

Law of large numbers – as the number of observations increases, the long-run relative frequency of the independent events approaches the true mean relative frequency [pic] of the population.

• A sample mean, [pic], approaches the population mean, [pic].

• A sample proportion, [pic], approaches the population proportion, [pic].

• A sample standard deviation, [pic], approaches the population standard deviation, [pic].

Multiplying and/or adding CONSTANT values:

For CONSTANTS a and b, and for random variable x,

[pic] and [pic]

Combining random VARIABLES:

For random variables x and y,

[pic]

IF x and y are independent, separate VARIANCES will ADD to the total variance, whether the variables are added OR subtracted.

**This is a commonly missed concept.**

[pic] NOTICE: Add the variances, even if the variables are subtracting!

THIS IS FALSE: σ(x + y) = σ x + σ y (Standard deviations don’t add, but variances do!)

Chapter 8: Binomial & Geometric Distributions

BINOMIAL DISTRIBUTIONS

• Conditions

➢ Each observation falls into 1 of 2 possible outcomes – success and failure

➢ The probability of success, p, is the same for each observation.

➢ Observations are independent.

➢ We are counting the number of successes in a FIXED number of trials.

• Notation: B(n, p) where n is the number of trials and p is the probability of success.

• Possible values (number of successes) are whole numbers from 0 to n.

• Binomial formula (given on exam): [pic]

Calculator commands:

➢ P(x = k) = binompdf(n,p,k) (math ( prob)

➢ P(x[pic]k) = binomcdf(n,p,k)

• Expected # successes: E(x) = [pic] (Given on exam)

• Standard deviation: [pic] (Given on exam)

• Normal approximation of binomial distribution

➢ When n is sufficiently large so that [pic] and [pic], a binomial distribution is approximately NORMAL with a mean [pic] and [pic]

➢ If [pic]or [pic] is not true, an exact binomial calculation should be used.

GEOMETRIC DISTRIBUTIONS

• Conditions (1st 3 are the same as binomial)

➢ 2 outcomes – success and failure

➢ The probability of success, p, is the same for each observation.

➢ Observations are independent.

➢ We are counting the number of trials needed to reach the FIRST success.

• Possible values (number of trials to get the first success) are whole numbers from 1 to “infinity.”

• Calculator commands:

➢ P(x = k) = geometpdf(p,k)

➢ P(x[pic]k) = geometcdf(p,k)

• Expected # trials to reach first success: E(x) = [pic]

Chapter 9: Sampling Distributions

• Parameter – names a characteristic of the population.

• Statistic – names a characteristic of a sample.

• Sampling distribution of a statistic – the distribution of values taken by the statistic in all possible samples of a particular size from the same population.

➢ This is a theoretical distribution since it usually unfeasible to take the actual number of samples needed to create this.

• Unbiased – a statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated.

➢ [pic]is an unbiased estimate of [pic]

➢ [pic]is an unbiased estimate of [pic]

➢ this does NOT mean that every sample mean, [pic], is equal to [pic]

or that every sample proportion, [pic] is equal to [pic]

• Variability in the sampling distribution is reduced by increasing the sample size. As long as the population is much larger than the sample (10 times as large), the spread of the sampling distribution is approximately the same for any POPULATION size.

➢ Let’s say that again, the spread of the sampling distribution depends on the SAMPLE size, not the POPULATION size.

• Sampling distribution of sample proportions [pic]:

➢ When [pic] and [pic], the distribution of sample proportions is approximately normal with [pic]and [pic],

where p is the population proportion and n is the sample size.

➢ The standard deviation formula is valid if the population is at least 10 times the sample size.

• Sampling distribution of sample means [pic]:

➢ If the sample size is large ([pic]) or the original population is normal, then the sampling distribution of [pic]is approximately normal with [pic]and [pic],

where [pic]is the population mean and [pic]is the population standard deviation.

➢ The standard deviation formula is valid if the population is at least 10 times the sample size.

• Sample size – both [pic]and [pic]are proportional to [pic]

➢ To divide [pic] or [pic] by 2 , the sample size must be increased by a factor of 4.

➢ To divide [pic] or [pic] by 3, the sample size must be increased by a factor of 9.

• CENTRAL LIMIT THEOREM: Regardless of the shape of the original population distribution, the distribution of the sample means [pic]is approx. normal when the sample size is large ([pic]).

➢ This allows us to use normal (or “z”) calculations even when the original population is not normally distributed.

VI. HYPOTHESIS TESTING STUDY GUIDE (also includes conditions necessary for Confidence

Intervals)

MEANS

A. Unknown ( and known ( (this happens only rarely)

1. z-test of population mean: H[pic] : [pic] (a numerical value)

Calculation of z test statistic: [pic]

Conditions: SRS, normal population or n ( 30(Central Limit Theorem)

B. Unknown ( and unknown ( (this is the case almost all of the time)

1. One-sample t-procedure

t-test of population mean: H[pic] : [pic] (a numerical value)

Calculation of test statistic: [pic] df(degrees of freedom) = n -1

Conditions: SRS from popn of interest;

n ≤ 15 population must be normal

15 < n < 40 data are not tremendously non-normal (no outliers or extreme skewness)

n ≥ 40 okay to use t-procedures no matter what data looks like

NOTE: Matched pairs t-tests are a subset here (apply to the difference of the two measures on one sample)

2. Two-sample t-procedure

Test of means from 2 independent samples

H[pic]: [pic] ( [pic]

Calculation of t test statistic [pic] df = smaller of [pic] or [pic]

Conditions: SRS from popns of interest; samples are independent of each other (if not, then it’s matched pairs); same sample size rules as in #1 above for n1 + n2

Note: If the two samples are dependent, then it is not a two-sample situation anymore; this is called matched pairs, and it reverts to the 1-sample t-procedure described above.

PROPORTIONS

1. Large sample z test of population proportion:

H[pic] : [pic] (a numerical value)

Calculation of test statistic:

[pic]

Conditions: SRS from population of interest, large population (( 10 times sample),

[pic]

2. Two Sample Proportions

Test of proportions from 2 independent samples: H[pic]: [pic] ( [pic]

Calculation of z test statistic:

[pic]

[pic] (this is the pooled sample prop.)

Conditions: independent SRS’s from popns of interest, large populations (( 10 times samples) for each, and [pic]

SLOPES AND REGRESSION

A. Testing a Slope Value or a Prediction Value: [pic]

Calculation of test statistic from n data points (“LinReg t-test” on calculator):

[pic], where b is the LSRL (sample) slope, with df = n-2

Conditions: The mean response [pic]has a true straight-line relationship with x (graph the points and see if they’re roughly linear); the standard deviation of the response y about the true line is the same everywhere (the residuals should not be increasing or decreasing systematically with x – check the residual plot); the response varies normally about the true regression line (make a histogram or dotplot of the residuals and check for violation of normality assumptions).

CHI-SQUARED

1. Goodness of Fit Test

H[pic] = the actual population percents are equal to the hypothesized percentages

Calculation of [pic] test statistic:

[pic] with df = n – 1 (where n = the number of categories)

Conditions: all individual Expected counts are at least 1 and no more than 20% of the Expected counts are less than 5.

2. Test for Homogeneity of Populations or Test for Independence (sometimes called “two-way tables”)

Homogeneity of popns (multiple samples):

H[pic] = there are no differences among the proportions of success (i.e. [pic])

Independence (one sample, two categorical variables recorded for each individual:

H[pic] = there is no relationship between two categorical variables

Calculation of [pic] test statistic:

[pic] with df = (r-1)(c-1) (where r = # rows & c = # columns)

Expected counts = [pic]

Conditions: all individual Expected counts are at least 1 and no more than 20% of the Expected counts are less than 5. SRS (independence) and independent SRS’s (homog).

Chi-square test on calculator: enter observed counts in matrix A

perform the test (this will report your X2, p-value, and df)

expected counts will be put in matrix B

Be familiar with the concepts of Type I error, Type II error, and Power of a test.

[pic]

Type II error: Accepting a null hypothesis when it is false.

Power of a test: Probability of correctly rejecting a null hypothesis for a particular value in Ha

Power = 1 - P(Type II error).

You can increase the power of a test by increasing the sample size or increasing the significance level (the probability of a Type I error).

-----------------------

Calculator : [pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download