Math 54 Worksheet #1



1 Probability

Axioms:

1) For any event A, [pic]

2) If S is the whole sample space, then P(S) = 1

3) If two events A and B are disjoint, then [pic]

Theories:

1) Equally Likely Outcomes: Symmetry and indistinguishable outcomes.

2) Frequency: Experimental data.

3) Subjective: Credence and “experience”.

If [pic] then events A and B are independent.

Complements: P(A) + P(not A) = 1, so P(A) = 1 – P(not A)

Inclusion-Exclusion Formula: [pic]

Set Theory:

S

A B C

2

3

4

Combinations and Permutations:

[pic] : How many ways we can select r items out of n items.

[pic] : How many ways we can order r items out of n items.

5 Expected Value:

For a random variable x, we can calculate the ‘expected value’ of x. This will be equal to the long run average of x, if applicable.

[pic]

[pic]

IF INDEPENDENT: [pic]

6 Problems:

1. A fair coin is flipped twice.

a. If you are told that at least one of the flips came up heads, what is the probability that both are heads? 1/3

b. If you are told that the first coin came up heads, what is the probability that both are heads? 1/2

2. A lottery has a $6,000,000 grand prize with probability of winning 1 in 3,000,000. It also has a $10 consolation prize with probability of winning 1 in 1000. What is the fair price of your $5 lottery ticket? $2.01

3. In an urn are 5 blue, 3 red, and 2 yellow balls. If you draw 3 balls, what’s the probability that less than 2 will be red if—

a. You draw with replacement? (7/10)3+3*(7/10)2*(3/10)

b. You draw without replacement? (7/10)*(6/9)*(5/8)+3*(7/10)*(6/9)*(3/8)

4. There are 20 people who work in an office together. Four of these people are selected to go to the same conference together. How many such selections are possible? 20C4

5. Which is larger, 2008C32 or 2008P32? 2008P32

6. How many distinct ways are there to arrange the letters B, E, R, K, E, L, E, Y? 8!/3!

7 Logic and Truth Tables:

Truth tables are a convenient way to represent combinations of statements that may be either true or false.

The ways we know to combine them are NOT, OR, AND, XOR, IMPLIES, and IFF.

8 Conditional Probability:

[pic]

Bayes Rule:

[pic]

9 Problems:

1. Fill out the truth table for: p IMPLIES (p OR q)

2. A doctor has a 90% chance of correctly diagnosing a disease if you have it, a 20% chance of correctly diagnosing you if you don’t. Everyone who has a nosebleed out of his or her left nostril has a 20% chance of actually having the disease. You develop a nosebleed out of your left nostril.

a. What is the chance that you have the disease, given that the doctor says you do?

9/17

b. If 100 people have nosebleeds, what is the expected number of people that are diagnosed with the disease and actually have it? 18

2 Looking at Data

A test statistic is analogous to a random variable in probability

The test statistic can be quantitative or qualitative. Don’t be fooled, not all numbers are quantitative data, and non-numbers can sometimes be treated as quantitative.

1 Center and Spread:

We know 3 ways of measuring the center of a set of data:

Mean: The sum of all the measurements, divided by the number of measurements

Mode: The most frequent value

Median: The middle value of the list

Percentile: Median is the 50th percentile of the data. In general, we say that X is the Yth percentile of a set of data if Y% of the data is less than X.

The 25th and 75th percentiles are called the 1st and 3rd quartiles, respectively.

The IQR is the distance between the 1st and 3rd quartiles, and is one way of measuring how spread out a set of data is.

The standard deviation is another measure of spread, and is one we use more often.

[pic] where the x’s are all the labels in the box, and [pic] is the average of the box.

Shortcut: If your box has only two types of labels, A and B, [pic] where p is the probability of drawing A, and there are only tickets labeled A and B.

Standard error is an estimate of the standard deviation we use when the real standard deviation is impossible to calculate.

In general, we can find the SD of a list or a box, and we use the SE for the sample mean or sum, or when estimating the population from a sample.

[pic]

In words, SE is the square-root of the EXPECTATION of (x-E(x))2.

If independent:[pic]

SE(sum) = n½ × SD(box)

SE(mean) = SD(box) / n½

Note that median and IQR are very resistant to change, while mean and SD can be affected greatly by just one value.

Data can be represented graphically by either a boxplot or a histogram. A scatterplot can be used when there are two variables.

2 Problems:

Answer the following questions based on the histogram:

1. What are the mean, median, and mode of this data? Is it skewed, and if so, in which direction? Mean=18.35, Median=16-17 (say 16.5), Mode=15

2. Find the 15th, 50th, and 95th percentiles of the data.

15, 16-17 (say 16.5), 19-20 (say 19.5)

3. Is it possible to find the standard deviation of this data? If not, what other information do you need? Yes

4. If you took a sample of size 5 with replacement from this data, what would the expected value of the mean be? The expected value of the sum?

E(mean)=18.35, E(sum)=91.75

3 Predicting Data

1 Markov’s and Chebychev’s inequalities:

These inequalities can tell us useful information about a set of data even if we do not know all the values.

Markov’s inequality: If the random variable X is nonnegative, then [pic]

Gives an upper bound for the proportion of values above a certain value; if there were more, the mean would be higher.

Chebychev’s inequality: [pic]

Gives an upper bound for the proportion of values a certain distance away from the mean; if there were more, the standard error would be larger.

Chebychev’s inequality gives more accurate bounds, but we can only use it if we have the SE of the data.

2 Law of Large Numbers:

As we increase the number of trials, the percent error of the measured value from the expected decreases, (but the absolute error increases).

3 Regression:

Regression is used to predict a value of Y when given a value of X. The regression equation is: Y=mX+b where m is determined by rSDy/SDx and b is determined by plugging in the point of averages, r being the regression coefficient.

[pic], X and Y in standard units.

For regression to work, the data must have a linear relationship, be homoscedastic, have no outliers, and should be for interpolation only (mostly).

Scatter Plots and Residuals:

A residual plot shows the difference between the data points and the regression line.

Regression done incorrectly: Regression line was computed incorrectly (i.e. residual plots follow a pattern, all positive, etc.).

Regression does not apply: Scatter plot is not football shaped (i.e. heteroscedastic, non-linear, or with outliers).

4

5

6

7

1 Problems:

1. When flipping 100 coins, the number of heads has expected value 50 and SE 5.

a. Find an upper bound for the probability of getting more than 75 heads using Markov’s inequality. 2/3=67%

b. Find an upper bound for the probability of getting more than 75 heads using Chebyshev’s inequality. 1/25=4%

2 Distributions:

3 Binomial(n,p)

[pic]

“Find the probability of x successes in n trials, with replacement.”

[pic]

4 Geometric(p)

[pic]

“Find the probability that the first success happens at the

x-th trial, with replacement”

[pic]

5 Negative Binomial(p,r)

[pic]

“Find the probability that it takes x trials to get r successes, with replacement”

[pic]

6 Hypergeometric(N,G,n)

[pic]

“Find the probability that you get x “good things” in n draws when there are G total good things, N total things, without replacement.”

[pic] (note p = G/N)

7 Problems:

1. Identify the distribution of the following random variables as binomial, geometric, hypergeometric, negative binomial, or none of the above, and give the parameters, if possible.

a. The number of questions a student will get right by randomly guessing on a 50-question multiple choice test with 5 choices per answer. Binomial, n=50, p=.2

b. The number of rounds it takes until a player rolls either a 7 or an 11 on a pair of dice. Geometric, p=8/36=1/3=.3333

c. A variable with a negative binomial distribution and parameters p=.5, r=1

Geometric, p=.5

8 Normal Curve

The normal distribution, or bell curve, is a symmetric continuous probability distribution with parameters µ and σ.

1 Central Limit Theorem:

The distribution of the sample mean and sample sum of a box of numbers approaches the normal distribution as the sample size increases, regardless of the numbers inside the box. (for a simple random sample)

If the numbers inside the box are skewed, this takes longer, but it still happens.

2 Scaling:

The standard normal distribution is a normal distribution with mean at 0 and standard error of 1.

For any random variable x with normal distribution having expected value µ and SE σ, we can convert x to standard units, or a z-score, with the formula

[pic]

3 Estimating:

A z-score can be calculated from any distribution where we know the mean and SE. We can then use the normal curve to find an approximation for the probability that the random variable is less than this value. If it is a discrete distribution, remember the ½ offset.

Because of the central limit theorem, whenever we have a simple random sample, we can use the normal distribution as an approximation, using the sample mean and SE.

When sampling without replacement, multiply the SE with replacement by [pic].

Remember [pic]if only tickets labeled a and b.

Bootstrap Method: Use the p-value of the sample as an approximation for the p-value of the population: f × ( φ × (1 - φ) )½/n½

Conservative method: assume the probability of success is 0.5 to get the largest SD.

Use [pic] if doing a sample of continuous variables for [pic], where s* is the standard deviation of your sample.

Confidence Interval: A range of values “you think” the population parameter is in. The interval is said to “cover” the parameter if the parameter falls in the interval.

P%-Confidence Interval: The method you use in creating the interval has a P% chance of including the population parameter. The higher P is, the larger the interval.

Interval: Statistic +/- k*SE (k is z-score/standard units)

Conservative method: an interval that has confidence level P or higher. (The actual confidence level you get from this method is probably higher than the P you are using.)

1. Use the conservative estimate of the standard error

2. Use Chebychev’s inequality to calculate k: [pic]

Approximate method: an interval that is your “best guess” for a confidence level P.

1. Use the bootstrap method to estimate standard error.

2. Use the normal curve to calculate k.

3. Works best when normal approximation for the probability histogram of the sample average or sample percent applies.

4 Experiment Design

When designing an experiment, look out for factors that may cause bias.

In order to minimize the impact of bias on our results, we set up controls.

5 Sampling

Random: The method of choosing something in which each possibility is equally likely to be chosen.

Systematic: A fixed method of choosing the next subject having chosen the one before. NOT random.

Stratified: Partitioning a population into disjoint groups and then sampling (not necessarily equally) from each one.

Cluster: Partitioning a population into disjoint groups, then choosing one, and then taking the data of EVERYTHING in that group.

Multi-stage: Doing things in multiple stages.

6 Hypothesis Testing

1. We are always gathering evidence to see if it is likely that a theory is not true. We determine beforehand either a rejection region for the statistic or a significance level for the P-value.

2. Null hypothesis

a. Conceptually: This is the theory that we are gathering evidence against, or that we assume to be true.

b. Practically: This is the theory that gives a predicted/expected number, and we’re testing if our value (ie statistic) is far enough away from this expected number to reject that theory. The null hypothesis usually says that the values being tested are the same, that nothing out of the ordinary happened, or that the parameter we are changing has no effect on the one we are measuring.

3. Alternate hypothesis

a. This is the theory that we take in lieu of the null hypothesis, if the test statistic or P-value is in the region for us to reject the null hypothesis. The alternate hypothesis should answer the question being asked.

b. This can be very general (the die is not fair) or very specific (the die gives too many 6’s; the long run fraction of getting a 6 is actually 1/5)

c. Determines if the test is one-sided or two-sided.

4. Confidence interval: An X% confidence interval is the interval centered around the mean where we have an X% chance of finding the value. Easiest to find with the normal curve.

5. Significance level: Probability of rejecting a true null (also the cutoff for the P-value)

6. Threshold: The cutoff for the test statistic

7. Rejection region: These are the actual outcomes that, when they happen, the null hypothesis is rejected. If your outcome is in the rejection region, you reject the null hypothesis

8. P-Value: The probability that we get our result or worse (farther away from the null) assuming the null hypothesis is true. We reject the null when this is smaller than the desired significance level.

9. Type I error: Rejecting a true null hypothesis. The probability of getting one is equal to the significance level.

10. Type II error: Failing to reject a false null hypothesis. The probability of getting one is the complement of the power.

11. Power of a test: Probability that the test rejects the null hypothesis assuming that the alternate hypothesis is true (so the null is false).

7 Additional Problems:

1. Suppose there is a large city with different districts. Below is a chart that contains how much money was spent on health care in a year, per capita, in each district, as well as how many people died in the district, per 100 people.

|District |Money spent on health care per person per |Number of deaths per 100 |

| |year |people |

|A |$250 |4.5 |

|B |$400 |8 |

|C |$450 |7 |

|D |$300 |4 |

|E |$150 |2 |

|F |$350 |5 |

A reporter looking at this data does a news article and says that “Clearly, spending money on health care is hazardous to your health.” What do you think of this conclusion?

This conclusion is not supported by the data, less healthy districts will naturally spend more on healthcare. Correlation does not imply causation.

2. A student in a physics class scored a 40 on the midterm, which had an average of a 50 and an SD of a 15. On the final, the student scored a 50, which had an average of a 55 and an SD of a 5. After calculating the grade breakdowns for the curve of the class, this student fell right on the borderline between a C- and a C. In the course syllabus, the professor said that if you show improvement in the course due to your hard work, and you end up borderline, you would be bumped up to the higher grade. The correlation coefficient between the midterm and the final is a 0.6.

a. Would you bump up this student? Explain. No. Compare the z-scores, -.6 to -1. He got worse.

b. Suppose the student earned a 53. Would you bump up this student now? Explain.

Yes, now the z-score improved from -.6 to -.4

3. Fill out the truth table for: p IMPLIES (p OR q) This is identical to the problem in the logic section

4. Suppose ten fish in your fish tank have a disease. There are 30 fish in your fish tank (it’s a big tank).

a. You draw five fish from the tank, and determine that three have the disease. Can you calculate the probability of this occurring using the binomial distribution?

No, the draws are without replacement, so use hypergeometric

b. What is the chance of drawing 40 fish with replacement and having 10 with the disease?

40C10*(1/3)10*(2/3)30=.075

c. You wish to determine if the disease has spread. There are only two possibilities: either the disease has spread or it has not. Moreover, based on the odd nature of the disease, if it has spread, then it must have spread to exactly ten more fish, making the total twenty fish with the disease. You draw ten fish, with replacement, and count the number of fish that have the disease. What does that number need to be for you to conclude that the disease has spread, using a 5% significance level? 7: P(x>6)=7.656%, P(x>7)=1.966%

d. Suppose in the situation in c) you find that 8 of the 10 fish you draw actually have the disease. What is the P-Value for this situation? Based on this, what do you conclude? p-value=.34%, reject the null hypothesis: the disease has spread

e. What is the power of this test? 55.927%

f. Suppose you do this test many different times, on different tanks that all began with 30 fish with ten diseased fish. If the disease spreads 60% of the time, what fraction of the times when predict that the disease had spread has it actually spread? .9771 – Use Bayes rule

5. Suppose you have a tank with 300 fish, and you wish to get an idea of how many fish are diseased. You draw a simple random sample of 5 fish and find that 3 are sick.

a. Give your best guess for 95% confidence interval of the fraction of fish that are diseased.

.6 +/- .435

b. What if you drew 50 fish and 30 were sick?

.6 +/- .127

6. Suppose you have a tank of 300 fish, and you want to know the average weight of the fish. You draw 25 fish, and find that the average weight is 100 grams with a standard deviation of 5 grams. Give an approximate 95% confidence interval for the average weight of the fish in your tank.

90-110

7. The following probability distribution is given.

|X |P(X) |

|1 |0.5 |

|2 | |

|3 |0.04 |

|4 |0.02 |

a. Fill in the blank. .44

b. What is E(x)? 1.58 SE(x)? . 333

c. Suppose you wish to do a test against the hypothesis that the probability distribution actually has a higher expected value. If you wish to use a significance level of 5%, what values of X would you get for you to reject the null hypothesis? Greater than 2.246

8. You have a box with tickets labeled –1, 2, 2, 3, 4.

a. Suppose you draw 3 tickets, with replacement, and add up the numbers. What is E(x) and SE(x)?

E(X)=2 SE(X)=1.67

b. Suppose you draw 3 tickets, with replacement, and average the numbers. What is E(x) and SE(x)?

E(X)=2 SE(X)=1.18 (multiply by 5)

c. Let X be the number of even numbers you pull from the box, after drawing 3 tickets with replacement. What is E(X)? SE(X)?

E(X)=1.8 SE(X)=.489

9. You wish to do a study on family size in a given city. In this city, there are no homeless people, and everyone has a phone. For each scheme below, determine whether it is a probability method, the type(s) of probability sampling, and if the probability of selection is the same for each member of the population. If possible, determine any biases, and how they might affect your results.

a. You divide the city into blocks. In each block, you select five households at random. You repeatedly go to these households until you find someone who is home, and determine their family size. Stratified random sample, smaller blocks get more weight

b. You divide the city into blocks. You randomly choose five of the blocks. In those blocks, you interview every household on those blocks, determining household size in each house.

Cluster sample, no bias, but very large chance error

c. You randomly call every house in the city until you get 1000 people who answer their phones. You conduct the survey with those 1000 people.

Bad random sample, bigger households more likely to answer phones

d. You divide the city into blocks. You randomly choose five blocks. In those five blocks, you randomly choose five households. You repeatedly go to these households until you find someone who is home, and determine their family size.

Multistage random sample, no bias, but large chance error

e. You go to every house that has a prime number in the last digit of their address.

Systematic sample, unknown bias

10. You are conducting a test to see whether a coin is fair (P(heads)=.5). You decide to flip the coin 100 times.

a. For what values should you reject the null hypothesis to achieve a desired significance level of 5%? Less than 40 or more than 60 (SE is approx. 5)

b. What is the power of the test against the alternate hypothesis that there is really a 75% chance of getting heads? 99.931%

c. What is the power of the test against the alternate hypothesis that there is really a 60% chance of getting heads? 46.209%

-----------------------

In this diagram, we have that B implies A and B and C are mutually exclusive. Note that we cannot say anything about the independence of A and C but we do know that B and C are dependent.

[pic]

The thick lines represent the threshold values.

Type II Error

Type I Error

Power

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download