CLRES 2020 Lab 1



CLRES 2020, Lab 2

Tuesday 1pm-5pm, July 19, 2004

GSCC 126

Instructors:

Joyce Chang, PhD

Doris Rubio, PhD

Maria Mor, PhD

Teaching Assistants:

Fiona Callaghan MA MS

Bill Clark

David Corcoran

Vinay Mehta

Goals for Lab 2

1. Normal Distribution.

2. Confidence interval for the sample mean, known standard deviation of the population.

3. t-distribution.

4. Confidence interval for the sample mean, unknown standard deviation of the population.

5. Confidence interval for the estimate of the variance

6. One-sample t-tests.

Instructions – How to follow this lab sheet.

Whenever you see a check-mark [pic] that means that you are required to perform some action. Whenever some words are in this font it means that this is

a command that you should type in the command window of STATA. And whenever you see an > it refers to going to a series of drop-down windows, as in

“All Programs>Mathematics>STATA”. There are generally two ways to do most things in STATA: using commands that you type in the command window, or using drop-down menus, as in SPSS. Whenever possible, we will give you both ways of doing things in STATA, but you are only required to do it the way you feel most comfortable. On the back of this handout are some questions that you are required to fill in.

The questions that you have to answer to get credit for this lab are enclosed in a box like this.

You will answer these questions as you go through the lab and hand them in at the end for credit, so remember to write your name on them! If you experience trouble at any time, just raise your hand to let a TA or an instructor know that your need help. Let’s get started!

Getting Started

First we will log on to the computer. To do this you will need your University of Pittsburgh user id and your password.

✓ You should see a space on the screen to enter your user id. Type it in and press return.

✓ Now enter your password and press return. You should now be logged on to the computer.

We will open a folder in which to save our work, and then we will open STATA and enter a data set into STATA.

✓ Right-click somewhere on the desktop and select “New Directory”. Name your folder “Lab2”, or some other name that makes sense to you. We will save all our work in this folder.

✓ Go to the web page:

✓ Scroll down to find the data sets and right-click on “calcium.dta” and select “Save Link As” and save the file in “/scratch/username/Desktop/Lab2”. To do this, double click on “Desktop” and then “Lab2” in the main window (you should only have to do this once; the computer will remember where you are saving your files later on). Click “Save”. The “username” is your University of Pittsburgh email id (the part of your University of Pittsburgh email address that comes before the “@” e.g. “fmc2” is the id from the email address fmc2@pitt.edu).

✓ Now open STATA. Go to the programs icon on the bottom left of your screen (this is the “Start Applications” menu) and click. Go to the menu Mathematics>STATA. Click on STATA and STATA should start up.

✓ We wish to tell the STATA to save anything we do from now on in our “Lab2” file. To do this, in the command window type: cd “/scratch/username/Desktop/Lab2”.

✓ Open a log file to save your computer session. To start a log file called “logLab2”, type log using logLab2.log and press return, or use the drop-down menus.

✓ Type use calcium in the command window, and press return. You can also enter your data using a drop down window. Go to “File>Open…” and select the calcium.dta data set and click “Open”. Your data set should now be in STATA.

✓ You should see some words in the “Variables” window – “treatment”, “begin”, “end” and “decrease”. Click on the Data Editor button (or type edit in the command window). You should see 4 columns of numbers and some labels at the top of those columns. Click on the red button with the white cross at the top right of the screen [pic] to get rid of the Data Editor window. If your data does not look right, ask a TA for help.

About the Data

Does increasing calcium intake reduce blood pressure? Observational studies suggest that there is a link, and that it is strongest in African-American men. Twenty-one African-American men participated in an experiment to test this hypothesis. Ten of the men took a calcium supplement for 12 weeks while the remaining 11 men received a placebo. Researchers measured the blood pressure of each subject before and after the 12-week period. The experiment was double-blind.

Datafile Name: Calcium

Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Lyle, Roseann M., et al., "Blood pressure and metabolic effects of calcium supplementation in normotensive white and black men," JAMA, 257(1987), pp. 1772-1776,

Authorization: contact authors

Description: Results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks, and a control group of 11 men received a placebo during the same period. All subjects had their blood pressure tested before and after the 12-week period.

Number of cases: 21

Variable Names:

1. Treatment: Whether subject received calcium or placebo

2. Begin: seated systolic blood pressure before treatment

3. End: seated systolic blood pressure after treatment

4. Decrease: Decrease in blood pressure (Begin - End)

The Normal Distribution

Much of the introductory statistics that we learn in this course is based on the assumption that the underlying data is distributed normally. Usually data sets give us data that is not perfectly symmetrical or ‘normal’ but are close enough to the ideal for our purposes. Assuming normality of our data allows us to make comparisons with populations that we know are normally distributed.

✓ Type the command hist begin, normal bin(6) and press return. This is a histogram of the variable “begin” but with a normal plot printed over the graph. This helps us to compare the data to a normal distribution with the same mean and standard deviation.

✓ Type summarize begin and press return.

Question 1: What is the mean and standard deviation of begin ?

Question 2: Does the distribution of begin look normally distributed? (We will assume that the population that this data comes from is normally distributed with the same mean and standard deviation as our sample for the rest of the lab).

You should find that the mean and standard deviation for “begin” is 114.048 and 9.708 (rounded), respectively. If we assume that the overall beginning blood pressure for the total population of African-American males is distributed normally with mean 114.048 and standard deviation 9.708, we can make inferences about African-American male blood pressure for subjects outside our study. Below are some worked examples.

Example 1

What is the probability that a subject in the population would have a blood pressure less than 110?

We want to find P(X < 110). Firstly we convert our x-value into a z-value: z = (110-114.048)/9.708. We can do this using STATA:

✓ Type display (110-114.048)/9.708 and press return. The answer is -0.417.

The area (probability) that we are trying to find is highlighted below:

[pic]

✓ To find P(Z 0.6131) = 1- P(z < 0.6131). Type display 1-norm(0.6131) and press return. This should give the answer 0.27. So the answer is 27% or 0.27.

Example 3

What is the probability that a subject has a blood pressure between 105 and 120?

Calculate the z-scores:

✓ Type display (105-114.048)/9.708 and press return. Type display (120-114.048)/9.708 and press return The z-scores should be -0.932 and 0.613, (rounded).

The area (probability) that we are trying to calculate is highlighted below:

[pic]

✓ Type display norm(0.613)-norm(-0.932) to get P(z 130.0162) = 0.05 and we also have P(X < 130.0162) = 0.95.

Question 7: Calculate the 99th percentile of the standard normal distribution z0.99

Question 8: What value of x do 1% of males in this population have a blood pressure equal to or greater than? i.e. Find the “?” in P(X < ?) = 0.99. Sketch a normal curve and plot the X-value that you found and shade in the area that corresponds to the highest 1% of blood pressures.

Question 9: Suppose you define anyone with a blood pressure in the top 1% as having an unusually high blood pressure for people in this population. Is the reading for the unidentified subject who got 133 now considered “unusual”?

Note: The probability “cut-off” point that we choose to decide whether some data is “unusual” or not is often called the “alpha level”. The alpha level in Question 8 was 1%, or α = 0.01.

Confidence Intervals

No matter how well a study has been carried out or how carefully the data has been collected, there will always be some uncertainty as to how accurate our conclusions are. This is simply due to the fact that we have taken a sample of subjects, rather than recording results for every possible subject in our entire population. What statistics can do is try to quantify how much error is in our estimates, so that we at least have some idea of how accurate our results are.

The most common estimate that we are interested in is the mean of our sample. We would often like to know what the sample mean is AND a range of plausible values that we are fairly sure the real (true) population mean is between. This is a confidence interval. First we have to decide how “confident” we want to be that the true mean lies between these values. A common figure is 95%, although we can also calculate 99%, 90% or 96.4% confidence intervals if we like. Usually the higher the confidence level the better, but we must balance this against the fact that the more confident that we want to be about our interval, the wider the interval will be. We will learn about three types of confidence interval, and all need to assume that our data is normally distributed in order to be valid.

Calculating a confidence interval if we know the population’s standard deviation σ.

If we know the standard deviation of the population (from previous research) then the formula for the confidence interval for the sample mean is:

Sample Mean ± z1-α/2×(σ/√n)

Example

Suppose we know the standard deviation of blood pressure in the population of normotensive African-American males is 10 and we know that the population is normally distributed. Calculate a 95% confidence interval for our sample mean.

We know that σ = 10, n = 21, sample mean = 114.048, confidence = 0.95. Therefore our α = 0.05 because the alpha level is always 1-confidence level. Now, 1-α/2 = 1-(0.05/2) = 1-0.025 = 0.975. So we need to find z0.975. Using STATA,

✓ Type display 1-(0.05/2) to get 1-α/2 = 0.975.

✓ Type display invnorm(0.975)

You should find that z0.975 = 1.96. Now we can put all this information into the formula and calculate the confidence interval:

✓ Type display 114.048+1.96*(10/sqrt(21)) and press enter, and then type display 114.048-1.96*(10/sqrt(21)).

Your 95% confidence interval is (109.77, 118.33).

Question 10: Often IQ tests and other “standardized” tests are designed to have a known standard deviation. Suppose we give an IQ test to 25 students and they have a mean of 115.4. The test is known to produce a standard deviation of 15. What is the 95% confidence interval for our sample mean?

Question 11: The population mean for this IQ test is 100. Do you think there is any evidence from the confidence interval that suggests that these students have a higher mean IQ than the general population? Explain.

STATA does not have an easy way to calculate this confidence interval. It is uncommon to know for sure what a population’s standard deviation is, so this formula does not get used much (even though it produces more accurate confidence intervals). We use another formula to work out the confidence interval when we do NOT have prior knowledge about the population standard deviation. To calculate the confidence interval in this case, we use the sample standard deviation s, and we use the T-distribution, rather than the normal distribution (so we talk of “t-values” rather than “z-values”).

The t-distribution

The t-distribution looks very similar to the normal distribution, but it is usually “flatter” and has “thicker tails”. The t-distributions we are going to use, all have a mean of 0 and are symmetrical around 0, just like the standard normal distribution. Also, just as there are many normal distributions depending on which mean and standard deviation we specify, there are also many different t-distributions depending on the “degrees of freedom” (df) we select. Normal distributions have 2 “parameters” (mean and sd) which determine the center and the shape of the distribution, but the t-distribution only has one parameter (df) that determines the shape only. Otherwise, we use the t-distribution in a very similar way to the normal distribution. The degrees of freedom do not have an intuitive interpretation, like the mean and standard deviation do for the normal distribution. However, for our purposes, the df is usually related to how many observations we have in our sample, n.

[pic]

Suppose we want to calculate the 95% percentile of the t distribution with 7 degrees of freedom. We denote this t0.95(7) ; this is similar notation to the zα for the normal distribution. STATA does not give us the percentile directly:

✓ Type display invttail(7,0.95) and press enter.

You should get about -1.89. This is the value where 5% of the area of the graph is below this value and 95% is above it (see the shaded region in the graph below).

[pic]

To get the 95% percentile, we just take 1.89 = t0.95(7), because the graph is symmetrical.

Or we could type:

✓ display invttail(7, 0.05)

Either way, we still get the value where 95% of graph is less than that point, and 5% is greater.

[pic]

Question 12: What is the 0.975th percentile of the t distribution with 2 degrees of freedom? i.e. find = t0.975(2)

Question 13: What is the 0.975th percentile of the t distribution with 1000 degrees of freedom? i.e. find = t0.975(1000)

Note: As the degrees of freedom gets large, the t-distribution becomes closer and closer to a standard normal distribution (which is why t0.975(1000) is very close to z0.975 = 1.96).

Calculating a confidence interval when the population standard deviation is unknown.

If it is possible to know the sd of the population, then it is better to use the formula for a confidence interval based on normal percentiles, because we will get a narrower confidence interval. But if we do not know the population sd then we have to estimate the sd from our sample using the sample standard deviation. The formula for the confidence interval in this case is:

Sample mean ± t1-α/2(n-1)×(s/√n)

The degrees of freedom for the t-distribution = the number of observations -1, and “s” is the sample standard deviation.

Example

Find the 90% confidence interval for the sample mean for the initial blood pressure of the subjects in the calcium study.

We know n=21, s= 9.708, sample mean = 114.048. Therefore, the α level is 0.10 and the df = 20-1 = 20.

✓ Type display 1-(0.1/2) which tells us that we want to find the 95% percentile.

✓ Type display invttail(20, 0.05) to get the value of t0.95(20) = 1.725

✓ Type display 114.048+1.725*9.708/sqrt(20) and then display 114.048-1.725*9.708/sqrt(20)

Your confidence interval is (110.3, 117.8). There is an easier way in STATA to calculate this confidence interval!

✓ Type ci begin, level(90) or you could also use the drop-down menus Statistics>Summaries, tables, & tests>Summary Statistics>Confidence Intervals. Type “begin” in the variable box and select the confidence level from the “Confidence Level” box.

✓ Suppose you know your data has a sample mean of 114.048 and a sample standard deviation of 9.708, and 21 observations. Then we could type: cii 21 114.048 9.708 and this would also give us a confidence interval. We could also go to Statistics>Summaries, tables, & tests>Summary Statistics>Normal CI calculator and enter in our information to get the same result. The confidence interval should be very similar to the one you calculated previously (any difference is due to the fact that we have rounded our mean and sd).

Note that when STATA gives you the option of “Normal Variables” in Statistics>Summaries, tables, & tests>Summary Statistics>Confidence Intervals, you want that box to be checked, because we are assuming that our data is normally distributed. It is NOT the same as calculating a confidence interval with standard normal percentiles zα, like we did before. Also, Statistics>Summaries, tables, & tests>Summary Statistics>Normal CI calculator uses the same formula as Statistics>Summaries, tables, & tests>Summary Statistics>Confidence Intervals and they calculate the same thing, they just ask for slightly different information. To repeat, this does NOT calculate the confidence intervals we did in the previous section, because it requires you to enter the sample standard deviation, not the population standard deviation, and so it does not use zα (percentiles based on the normal distribution); both these STATA methods use t1-α/2(n-1) values instead.

If we want to calculate confidence intervals separately for the calcium group and the Placebo group then we can do the following:

✓ Type bysort treatment : ci begin, level(90) and you can also use Statistics>Summaries, tables, & tests>Summary Statistics>Confidence Intervals and click on “by/if/in” and check the box for “repeat command for groups defined by” and type “treatment”.

Question 14: What is the 99% confidence interval for the decrease?

Question 15: What is the 99% confidence interval for the decrease for each treatment group separately?

Question 16: Do any of these confidence intervals include the value 0 in their range of values? Why might this be of interest to us?

Question 17: Do the confidence intervals for each treatment group over-lap much (have values in common)? Why might this be of interest to us?

Confidence interval for the sample variance

So far, we have found two ways to calculate a confidence interval for the sample mean. We can also calculate a confidence interval for the sample standard deviation. We use another continuous distribution called the Chi-squared Distribution. It also has one parameter (degrees of freedom), like the t-distribution, but it is NOT symmetrical. A Chi-squared distribution with, say, 10 degrees of freedom is denoted (2(10). Here is an example of the Chi-Squared Distribution:

[pic]

We can find percentiles of the Chi-squared distribution. The 90% percentile of the Chi-squared distribution with 7 degrees of freedom is denoted (20.90(7). We can find this using STATA:

✓ Type display invchi2(7,0.90) and press enter. You should get a value of 12.017. So 90% of the graph is less than this point. See the diagram below.

[pic]

The formula for the confidence interval for the sample variance is:

Upper Limit = (n-1)×s2/(2α/2(n-1)

Lower Limit = (n-1) )×s2/(21-α/2(n-1)

Example

Calculate the 95% confidence interval for the sample variance of the variable “begin”.

We know s2 = 9.7082 = 94.245 and n = 21, so df = n-1 = 20, and α = 0.05. We need to find the percentiles of the Chi-squared distribution.

✓ Type display invchi2(20, 1-0.05/2) and display invchi2(20, 0.05/2) which should give us the values 34.17 and 9.59.

✓ Type display 20*94.245/34.17 and display 20*94.245/9.59 which should give values of 55.16 and 196.55. This is our confidence interval.

To find the confidence interval for the standard deviation, you just take the square root of both values. There is no single command in STATA (that I could find) that gives the confidence interval for the variance (or standard deviation).

Question 18: What is the 90% confidence interval for the sample variance of the “decrease” variable?

One Sample t-test (one-sided and two-sided)

We may want to investigate whether the mean that we calculate from our sample is significantly different to some known benchmark or some number of special interest. The men in this study are described as normotensive, meaning that they their blood pressure is normal to begin with. Suppose that we wanted to verify that this was true and we defined “elevated” blood pressure as any value above 120. We would like to perform a test to compare the mean blood pressure of our sample against 120. So we will assume that our mean is less than or equal to 120, and try and gather evidence to disprove this hypothesis. The notation for the two competing hypotheses is given below:

Ho: μ ≤ 120 versus Ha: μ > 120

“Ho” is called the “null hypothesis” and in this case is “Our population mean is less than or equal to 120”. “Ha” is called the “alternative hypothesis” and in this case is “The population mean is greater than 120”. The point of the test is to decide between these two alternatives. We will perform what is called a “one-sided” test, because the alternative hypothesis has an inequality only going in one direction.

You will learn more about t-tests in class, but for now you need to know that we calculate a t value from our data (called a t-statistic) and then we compare it to the appropriate t-distribution to see how unusual the value is. We do this by calculating what the probability is of observing that t-statistic or something even more extreme, if we assume the null hypothesis. This probability value is called the “p-value”. We also decide on an α level to compare our p-value to, in order to decide whether it small enough to be considered to come from an “unusual” event.

If our p-value is less than α, then we say that we “reject” the null hypothesis. If our p-value is larger (or equal to) the α level, then we “fail to reject” the null hypothesis.

The logic behind the test goes something like this: if the p-value is small, this tells us that if we assume the null hypothesis is true then the probability of observing our data is very unlikely. Hence we conclude that there is evidence that null hypothesis is false.

✓ Type ttest begin == 120 and press return. This will give you some summary statistics, confidence intervals for the means, and three t-tests corresponding to the three possible alternative hypotheses. The one we are interested in is the one labeled “Ha: mean > 120”.

The output in the results window should look like this:

[pic]

I have circled the part of the output that is of most interest to us. Suppose that we choose α = 0.05 for all the following tests.

Question 19: What is the t-statistic and the p-value of this test? Do we reject or fail to reject the null hypothesis? Does the evidence suggest that this group of men have normal or elevated blood pressure?

Question 20: Suppose we had Ho: μ ≥ 120 and Ha: μ < 120. What is the t-statistic and the p-value of this test? Do we reject or fail to reject the null hypothesis? Does the evidence suggest that this group of men have normal or elevated blood pressure?

A two-sided test is when we have something like this:

Ho: μ = 120 versus Ha: μ ≠ 120

Another way of writing this is:

Ho: μ = 120 versus Ha: μ < 120 or Ha: μ > 120

It is called two-sided because we are allowing two possibilities for the alternative hypothesis. We use this kind of hypothesis when we are checking whether the mean is equal to 120, but we are open to the possibility that the sample mean may be greater or less than 120.

Question 21: What is the t-statistic and the p-value of this test? Do we reject or fail to reject the null hypothesis? Does the evidence suggest that this group of men have an average blood pressure of 120?

The End!

Saving the Lab

At the end of the session, follow the following procedure so that you can save any files you may want to review later on (e.g. your log file). These are the instructions if you are saving your files onto a floppy disk. If you have a zip disk, just do the same steps but with the "Zip" folder on the Desktop rather than the "Floppy" folder.

✓ Insert floppy disk (or zip disk).

✓ Right click on the "Floppy" icon on the Desktop and select "Mount". We can now save files onto this disk. If you do not “Mount” the disk, then your files may not save properly.

✓ Close your "Lab2" folder if it is open. Click on the "Lab2" icon on the Desktop and drag the whole folder to the floppy disk icon on your Desktop. You should get a small menu giving you a choice to "Move" or "Copy" the documents. Click on "Copy". Your files should now be on your floppy disk.

✓ Double click on the floppy disk icon to check that there is now a "Lab2" folder on your floppy disk.

✓ Now close the floppy disk window, and right click on the floppy disk icon and select "Unmount". You must do this in order to take your disk out of these machines and still have your files saved.

✓ Now press the button on your computer to eject the floppy disk.

It is very important to save a backup on the university computer in case something happens to the disk.

✓ Click on the “Lab2” folder icon and drag the whole folder to the “AFS” folder on your desktop. You should get a small menu giving you a choice to "Move" or "Copy" the documents. Click on "Copy". Your files are now stored on the University of Pittsburgh computer system and can be accessed from any computer with an internet connection. See the instructions below on how to access these documents from your home computer.

✓ You have finished.

Accessing the files from home from the University of Pittsburgh computer system

Here are some instructions FYI to help you access your backup copy in case there is some problem with your floppy disk or zip, when you get out of here. To access your backup copies from your home or office (Microsoft Windows!) computer do the following steps:

✓ Open Netscape Navigator or Internet Explorer. Type and go to this destination. (eg. Using my username, I would type ).

✓ After a few seconds, Internet Explorer will ask you for your University of Pittsburgh username and password. Enter these and press return.

✓ After the screen has loaded, you should see a list of files and one of them should be your “Lab2”. Just drag and click that file to wherever you want to put it on your home computer. Close Internet Explorer.

Answer Sheet – Lab2 CLRES 2020 Summer 04.

NAME and DATE:

Question 1:

Question 2:

Question 3:

Question 4:

Question 5:

Question 6:

Question 7:

Question 8:

Question 9:

Question 10:

Question 11:

Question 12:

Question 13:

Question 14:

Question 15:

Question 16:

Question 17:

Question 18:

Question 19:

Question 20:

Question 21:

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download