Practice Midterm Exam - Duke University



QUESTIONS 1 – 7 REFER TO THE DATA SET DESCRIBED BELOW

Psychologists are interested in measuring perception of risk, since it is an important component in any decision-making process. Carlstrom et al. (2000) asked 611 participants to provide a numerical value of risk for several activities using a scale from 0-100 (0 being no risk, and 100 being high risk). The participants were also asked questions to identify their world view. Participants are classified either as hierarchicalists (“Everyone has his/her place in society, and societal status is hierarchical”), individualists (“I control my environment and destiny”), or egalitarians (“I have little respect for any decisions not made by the group”). For this study, participants who were not classified in one of these groups were called “unclassifiable.”

DESCRIPTION OF THE DATA

========================

The 611 participants in this study were recruited between 1997 and 1998 from five sources: UCLA psychology undergraduate classes, campus and community organizations, community and college newspaper advertisements, a paid consultant, and posted flyers. The following are five activities each of the participants were asked to rate on risk value:

DOC: Work as a family physician in rural area

SWAT: Work as a member of a SWAT police team

POOL: Swim in indoor public pool each weekend

NUC: Live near nuclear power station

PLANE: Fly on commercial airplanes every month

Other variables in the data set include:

Race 1 = Caucasian, 2 = African-American, 3 = Mexican-American, 4 = Taiwanese-American.

Gender 0 = Female, 1 = Male.

Age Age of participant.

Worldview 0 = Unclassifiable, 1 = Individualist, 2 = Hierarchicalist, 3 = Egalitarian

There are no problems on this page. The next two pages display output from exploratory data analyses that you should use to answer exam questions. The questions begin on page 5.

Age of study participants

Group Number of participants

Women 385

Men 226

Caucasian 158

African-American 147

Mexican-American 140

Taiwanese-American 166

Unclassifiable 384

Individualist 51

Hierarchicalist 98

Egalitarian 78

Distribution of NUC Distribution of SWAT Distribution of NUC minus SWAT

[pic][pic][pic]Mean = 77.86, Mean = 72.25 Mean = 5.61

Median = 90.00, Median = 80.00 Median = ?

SD = 26.30 SD = 22.60 SD = 29.60

Correlations among selected variables

| |DOC |NUC |PLANE |POOL |SWAT |

|Caucasians |79 |10 |24 |45 |158 |

|African Americans |111 |10 |16 |10 |147 |

|Mexican Americans |91 |9 |35 |5 |140 |

|Taiwanese-Americans |103 |22 |23 |18 |166 |

| |384 |51 |98 |78 |611 |

There are no exam problems on this page. Exam problems begin on the next page.

EXAM PROBLEMS BEGIN HERE

1. (2 points per part) For problems 1a and 1b, write numbers for each answer. Ranges (e.g., “between 64.3 and 68.5”) will receive no credit. For parts 1c-1e, circle the correct answer.

a) Estimate the average age of the study participants. 27.1- 31 full credit.

b) Estimate the percentage of people in the study under age 25. 52 - 65% full credit

c) Circle the number that is closest to the SD of age: 15

d) The study has no one under age 17. True

e) About 68% of the participants’ ages are within 1 SD of the average age.

False (data don’t follow a normal curve)

2. (2 points per part). For 2a – 2d, circle the appropriate answer. For 2e, write a number for your answer. Ranges will receive no credit.

a) Which one of the following three scatter plots portrays the relationship between NUC and SWAT most accurately. Circle the letter of the correct plot.

It was plot B, because that plot had lots of high scores for both SWAT and NUC.

b) Which variable has the strongest linear association with AGE in these data:

NUC has the strongest association (largest correlation in magnitude)

c) When used as the predictor in a simple regression, which variable explains the most variation in PLANE scores in these data:

POOL (This is the definition of R-squared. The regression with the largest R-squared is the one that uses the predictor that is most strongly correlated with PLANE.

d) When using a simple regression with NUC as the outcome and AGE as the predictor, the typical deviation around the regression line is at least 26.3.

False. The RMSE is always less than the SD of the outcome variable.

e) Predict the SWAT score for a person who rates NUC as a 90. Show your work for full credit.

Use the regression method to make a prediction. The slope = correlation (SD of Y / SD of X).

The correlation equals 0.2682. The SD of Y equals 22.6. The SD of X equals 26.3. So, the slope equals 0.2305. The intercept can be found by plugging in the point of averages:

Avg. Y = slope (Avg. X) + Intercept

72.25 = 0.2305 (77.86) + Intercept.

Intercept = 54.30.

So, the predicted value equals: 0.2305(90)+54.30 = 75.04.

3. (3 points) Like many psychology studies, this is not a random sample from a well-defined target population. List one way that the researchers’ methods of selecting study participants could result in untrustworthy conclusions about the relationships between risk perception and the other variables in the study (e.g., with age, gender, or world view). Assume the wording of questions is not a problem, i.e. focus on the data collection, not the question wording.

Any answer mentioning selection bias, such as volunteers may have different opinions than general population, college students may have different opinions, the researchers may have picked people of certain types, etc.

4. Comparisons of the perceived risks of living near a nuclear power plant (NUC) to the those of being on a police SWAT team (SWAT).

a) (2 points) Estimate the median for the difference variable, “NUC minus SWAT.”

Any answer in 1 – 9 full credit.

b) (3 points) Give an interval for the average ranking of SWAT in the target population of all people affiliated with the UCLA psychology department. Use a 95% confidence level.

72.25 ± 1.96 (22.60/ square root(611))

c) (14 points) Test whether people in this target population on average rate SWAT and NUC differently. Write the null and alternative hypotheses, the value of the test-statistic, the p-value, and the conclusions. Write conclusions in at most two sentences using language that someone who doesn’t know statistics would understand. Consider p-values less than 0.10 as small.

We use the variable “NUC – SWAT” as the data, since this is a matched pairs setting (both variables are measured on the same person.) The null hypothesis is that people do not rate NUC and SWAT differently. Alternative hypothesis is that they do rate them differently.

z-stat = (5.61 – 0) / (29.6/root(611)) = 4.68.

The p-value for this z-stat is very small, less than .0001. Hence, we reject the null hypothesis. The difference in ratings of NUC and SWAT seen in these data are not likely to be due to chance, so that there is evidence that people rate them differently.

d) (5 points) In addition to random sampling, what condition must hold for the test in part c to be valid? Do you think this condition holds in these data? Make sure to address both questions in your answer.

The central limit theorem must hold. Since the sample size is large (611 people), and since the histogram of the differences is symmetric without any major outliers, the CLT should hold.

5. Comparisons of risk perceptions across gender for DOC, POOL, and PLANE.

a) (3 points) Give an interval for the difference in the population average of DOC for women minus the population average of DOC for men. Use a 99% confidence level.

(23.15 – 21.09) ± 2.57 [pic]

= 2.06 ± 2.57 (1.84)

b) (3 points) Is there enough evidence to say with 99% confidence that, in this target population, on average women consider being a doctor in a rural area to be a riskier activity than men consider it to be? Justify your answer based on part a.

No, there is not. The 99% CI includes negative and positive numbers, so that we can’t really be confident that women have a higher average than men.

c) (2 points) The summary statistics for POOL by gender and for PLANE by gender are not reported on page 4. They are reported here in two tables. Write the variable name (POOL or PLANE) that corresponds to each set of statistics to the right of each table.

| |Mean |Std Dev |

|Women |38.46 |26.15 |

|Men |31.45 |27.30 |

PLANE

| |Mean |Std Dev |

|Women |25.73 |22.60 |

|Men |21.91 |22.60 |

POOL

d) (7 points) Which null hypothesis has stronger evidence against it in these data: (i) the population average of PLANE for women is equal to the population average of PLANE for men; or, (ii) the population average of POOL for women is equal to the population average of POOL for men? Defend your answer with statistical arguments.

There is more evidence against the (i) hypothesis. The z-stat for the hypothesis test associated with PLANE is larger than the z-stat for the hypothesis test associated with POOL, which makes it have a smaller p-value. You can calculate the z-stats, or you can say that the SDs for PLANE and POOL are similar, but the difference in the averages for PLANE is nearly twice as large as the difference in the averages for POOL.

6. World views of different groups

a) (3 points) Give an interval for the percentage of women in this target population who are egalitarian. Use a 95% confidence level.

38/385 ± 1.96 square root (38/385 * 347/385) / square root (385)

b) (15 points) Test whether the population percentage of men who are egalitarians differs from the population percentage of women who are egalitarians. Show your null and alternative hypotheses, the value of the test statistic, the p-value, and conclusions. Write conclusions in at most two sentences using language that someone who doesn’t know statistics would understand. Consider p-values less than 0.10 as small.

IF YOU CANNOT DETERMINE THE SAMPLE PERCENTAGES OF EGALITARIANS, use 36% for women and 25% for men. These are made-up (not correct) percentages, and you should use the correct ones for full credit. Using the made-up percentages can earn a max of 10 points.

The percentage of men who are egalitarians equals 40/226. You get the 38 from the fact that there are 78 egalitarians, 38 of whom are women.

Null hypothesis: The percentage of women who are egalitarians equals the percentage of men who are egalitarians.

Null hypothesis: The percentage of women who are egalitarians is different from the percentage of men who are egalitarians.

z-stat = (40/226 – 38/385) / [pic]

= 2.65

p-value = 0.008

We reject the null hypothesis. The difference in the sample percentages seen in these data are not likely to be due to chance error. The evidence suggests the population percentages are different.

c) (5 points) A chi-squared test is performed to test whether or not race and world view are independent. The chi-squared test statistic equals 69.0, and the p-value is less than 0.0001. Explain (i) how to interpret the p-value, and (ii) your conclusion about the relationship between race and world view. Make sure to explain both (i) and (ii) in your answer.

There is a .0001 chance that we’d get a value of the chi-squared test statistics as or more extreme than 69, when race and world view are independent. Equivalently, there is a 0.0001 chance of seeing these results by random chance. We conclude that the evidence suggests a relationship between world view and race.

d) (5 points) How much does the entry for Mexican-American hierarchicalists contribute to the value of the chi-squared test statistic? That is, calculate the piece of the chi-squared test statistic derived from this cell of the relevant contingency table.

The expected count equals (98)(140)/611 = 22.45. The chi-squared test statistic piece equals:

(35 – 22.45)2 / 22.45 = 7.02

Expected counts rounded to 22 also were accepted, although 22.45 should be used.

7. Some conceptual questions on the results.

a) (4 points) For each of the following data points, what happens to the correlation between NUC and SWAT when it alone is added to the data? Circle the appropriate answer for each point.

SWAT NUC Effect on correlation when data point is added (circle one answer for each)

100 100 increases slightly

0 0 increases slightly

0 100 decreases slightly

100 0 decreases slightly

Adding points that are (large, large) or (small, small) makes the correlation larger, whereas adding points that are (large, small) or (small, large) makes it smaller.

b) (6 points) Suppose you sample thirty more men and thirty more women. Amazingly, all thirty men rank DOC as a 21, and all thirty women rank DOC as a 23. Using all 613+60 = 673 people, you obtain correctly the p-value for a test for the difference in average DOC score between men and women. Here’s the question: would this p-value be larger or smaller than the p-value for a similar hypothesis test based on only the original 611 people? Justify your answer, using numerical arguments in your defense.

The averages wouldn’t change much at all. But, the SDs for both men and women would decrease since we’re adding points all very close to the average values. Plus, the sample size increases. Hence, the SE decreases. This makes the z-statistic increase, which in turn means the p-value is lower.

Full credit given to calculations of approximate z-stats, followed by the logic to get to a lower p-value.

d) (5 points) Suppose you make a contingency table of just Caucasians and Taiwanese Americans who are unclassifiable or individualists. The table is thus:

Unclassifiable Individualist

Caucasians 79 10

Taiwanese Americans 103 22

Add 10 people however you want so that there is clearly no association between race and world view. For your answer, fill in the contingency table below, showing your new counts.

Many answers work. The goal is to make the conditional frequencies as similar as possible. For example, in the data table, 10/89 = 11% of Caucasians are individualists, and 22/125 = 17.6% of Taiwanese Americans are individualists. If we add 7 individualist Caucasians and 3 unclassifiable Caucasians, then we get 17/99 = 17.2% of Caucasians being individualists. This is close to 17.6%. This type of contingency table would clearly imply no association.

We accepted any answer that made the row (or column) percentages in the table similar.

8. If you used these dice in Vegas…well, let’s just say I wouldn’t recommend it.

“Ace-six flats” are a type of crooked dice where the cube is shortened in the one-six direction, the effect being that the 1s and the 6s are more likely than 2s, 3s, 4s, and 5s. Suppose that

Pr(roll a 1) = Pr(roll a 6) = 1/4, and Pr(roll a 2) = Pr(roll a 3) = Pr(roll a 4) = Pr(roll a 5) = 1/8.

For the ace-six flats dice described, the chance that the sum of two dice is 7 equals 0.1875. For regular, fair six-sided dice, the chance that the sum of two dice is 7 equals 0.1667.

a) (5 points) You can choose to roll two ace-six flats dice 1000 times, or to roll two regular dice 100 times. If you roll more than 20% sevens, you win one million dollars. Which choice gives you the better chance of winning the million dollars? Justify your answer.

You should choose the fair dice. You can find the chances using the central limit theorem. The z-stat for rolling more than 20% sevens with the ace-six flats dice equals:

z = (.20 - .1875) / square root (.1875 * .8125 / 1000) = 1.01.

The z-stat for rolling more than 20% sevens with the fair dice equals:

z = (.20 - .1667) / square root (.1667 * .8333 / 100) = 0.89.

Since there is more area under the normal curve to the right of 0.89 than there is to the right of 1.01, there is a higher chance of rolling more than 20% sevens using the regular dice.

b) (4 points) In the casino game craps, you roll two dice. You win if the sum of the two dice is a seven or an eleven. You roll a pair of dice one time. Calculate the chances that you win with (i) the ace-six flats dice, and (ii) fair dice. Show the chances and your work for both types of dice.

Ace-six flats: .1875 + (1/8)(1/4) + (1/4)(1/8) = 0.25

Fair dice: .1667 + (1/6)(1/6) + (1/6)(1/6) = .222

c) (5 points) Pretend that you are the owner of the casino. You see a gambler who you suspect is using ace-six flats dice rather than regular ones. She has played 100 times and obtained 30 wins by throwing a seven or eleven on the first roll of the dice. For the ace-six-flats dice, calculate the chance she would get at least 30 wins. Show work.

Use the central limit theorem to calculate the chances. The 30 is a sum, so we use the SE for a sum here.

z = (30 – 25) / square root(100) * square root(.25 * .75)

= 1.15

Area under the normal curve to the right of 1.15 equals 12.5%.

d) (1 point) Do you think the person in part c is using the ace-six flats dice or the fair dice? Very briefly say why.

I suspect it is the ace-six flats dice, because that die has a better chance of coming up 7 or 11, and she has rolled more than one would expect under either die. We accepted any answer getting to this idea.

9. Come on… be a Bayesian. Everyone is doing it.

A Stat 101 savvy student seeks to learn about the percentage of current Duke undergraduate students who have jobs this summer. He surveys a random sample of 50 Duke undergraduate students. Of the 50 students, 35 say that they have a summer job, and 15 say they do not. Because the sample size is reasonably large, the student uses a normal curve to approximate the likelihood function, with mean and SE based on those from the data.

Based on information from the Duke career counselors, the student has a prior belief that the percentage of Duke students with summer jobs will be around 50%, give or take 10%. To represent his prior beliefs, he uses a normal curve with a mean of 0.50 and an SD of 0.10. He then proceeds to use Bayesian statistics to estimate the percentage of current Duke students who have summer jobs.

a) (8 points) The graph below shows four curves: the likelihood function, the prior distribution, the posterior distribution, and a completely unrelated curve. Label the likelihood function with the phrase “Like”, the prior with the phrase “Prior”, the posterior with the phrase “Post”, and the unrelated curve with the phrase “Unrel.” Write labels at the top of the curve.

The leftmost curve is the prior distribution. This is centered at 0.50, which is the prior mean. The rightmost curve is the likelihood curve. It is centered at 35/50 = 0.7, which is the maximum likelihood estimate. It also is sharply peaked, because the SE for this curve is 0.065 (obtained from square root ( (35/50)(15/50)/50) ).

The curve closest to the likelihood curve is the posterior curve. It is centered at 0.68, which is the weighted average of the likelihood mean and the prior mean. Also, it is the narrowest curve. As discussed in class and in the Supplement, the SD of the posterior curve is always less than the SD of the prior curve, and also less than the SD of the likelihood curve.

The curve closest to the prior curve was the unrelated curve. It was centered at some random number, and it had an SD very similar to the prior SD, which is arbitrary.

b) (3 points) Can you determine the maximum likelihood estimate of the population percentage of Duke students who have a summer job? For your answer, either write your estimate, or say you need more information.

35/50 = 0.70

c) (4 points) Suppose the student wants to reduce the influence of his prior beliefs on the estimates. Which of the following steps would accomplish this? Circle ALL the steps that would reduce the influence of his prior beliefs.

Increase the sample size. This makes the SE go down, which diminishes the effect of the prior beliefs.

Increase the SD of the normal curve for the prior beliefs. This makes the prior beliefs more vague, which will diminish the impact of the prior beliefs.

10. True or False (6 points per part).

For each statement, if you think the statement is always true, just say it is true. If you think the statement is always false or sometimes false, say it is false and explain why or when it is false in two or less sentences.

a) When the p-value is 0.015, you should reject the null hypothesis because there is only a 1.5% chance that the null hypothesis is true.

FALSE. The p-value is not the probability that the null hypothesis is true. It is the chance of seeing a value of the test statistic as or more extreme than what was observed in the data, assuming the null hypothesis is true.

c) In the study on risk perception presented on this exam, the correlation between DOC and RACE is a valid summary of the association between DOC and RACE. (The correlation is not shown, but you still can answer the question as true or false.)

FALSE. Correlations are not appropriate for RACE, which is a nominal variable and not a continuous one.

b) The senior survey at Duke is sent to all 1500 seniors, who are asked to respond to various questions about their Duke experience. Out of the 500 seniors who return it, 400 cite parking as a “serious problem that negatively affected my Duke experience.” True or false: A 95% confidence interval for the percentage of Duke seniors rating parking as a serious problem is: 0.80 ± .035.

FALSE. There is serious nonresponse in the data collection. The 500 respondents likely are not representative of all Duke seniors, so the confidence interval is not valid.

d) In the U.S., a 95% confidence interval for the difference in population average years of education for men and population average years of education for women stretches from 0.25 to 0.29. True or false: The p-value for a hypothesis test of whether there is any difference in the population averages for men and women is smaller than 0.05.

TRUE. (Explanation not needed, but the idea is that since the CI doesn’t cover zero, it indicates a real difference between men and women. Hence, the p-value for a similar hypothesis test will be small.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download