ANALYSIS OF CATEGORICAL DATA - Winona



18 - ANALYSIS OF CATEGORICAL DATA

Goodness-of-Fit Tests, 2 X 2 Tables (Comparing Two Population Proportions), Tests of Homogeneity, Tests of Independence

[pic]

GOODNESS OF FIT EXAMPLE – PEA SEEDS (A genetics experiment)

It is hypothesized that the ratio of yellow-smooth to yellow-wrinkled to green-smooth to green-wrinkled seeds 9:3:3:1. Using the data below do the four seed phenotypes appear to follow the hypothesized 9:3:3:1 ratio?

|Yellow Smooth |Yellow Wrinkled |Green Smooth |Green Wrinkled |Total |

|152 |39 |6 |53 |250 |

1. State Hypotheses

Ho:  The sample comes from a population having a 9:3:3:1 ratio of yellow-smooth to yellow-wrinkled to green-smooth to green-wrinkled seeds.

(i.e. [pic])

Ha:  The sampled comes from a population not having a 9:3:3:1 ratio of the above four seed phenotypes. (i.e. at least one of the proportions above is not correct)

2. Pearson Chi-Square Goodness-of-Fit Test

The chi-square test statistic[pic] is given by:

[pic]

where,

k = # of categories and Ei = [pic].

If the observed frequencies differ substantially from the expected frequencies the[pic]test statistic will be “BIG”. How do we define “BIG”? We find the probability of getting a test statistic value as extreme or more extreme than the one observed if the null hypothesis were true, i.e. we find the p-value associated with our test statistic. If the null hypothesis is true the [pic]goodness-of-fit test statistic follows a Chi-squared distribution with degrees of freedom df =[pic]where k = # of categories.

Chi-Square Distribution with p-value

The larger the [pic]test statistic value, the smaller the p-value. As always if our p-value is less than .05 (typically) we reject the null hypothesis in favor of the alternative.

3. Compute Test Statistic

Calculations for this example:

| |Yellow Smooth |Yellow Wrinkled |Green Smooth |Green Wrinkled |Total |

|Observed Frequency |152 |39 |6 |53 |250 |

|(O) | | | | | |

|Expected Frequency | | | | | |

|(E) | | | | | |

4. Compute p-value (“by hand” approach)

(use either the Chi-square Probability Calculator in JMP or the Chi-square Table in book)

Table C.5 contains a Chi-square table (handed out in class). If our observed test statistic value exceeds the value in Area in Lower Tail 0.950 column we reject the null hypothesis in favor of the alternative at the α’.05 level.

Area in Lower Tail

df 0.90 0.95 0.975 0.99 0.995

1 2.71 3.84 5.02 6.63 7.88

2 4.61 5.99 7.38 9.21 10.6

3 6.25 7.81 9.35 11.3 12.8

. … … … … …

Because our test statistic value is greater than ___________ we reject the null hypothesis. We can use the table to put an upper bound on the p-value by noting that the largest value our test statistic value exceeds is __________. This says that our p-value < __________.

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value = .0297.

[pic]

5. Make Decision and Interpret

We conclude that…

Performing a Goodness of Fit Test in JMP

Entering data in JMP involves setting up two columns, one column is categorical/nominal containing the seed phenotype and one column is numerical/continuous containing the number of seeds of that type observed in the sample of 250 plants. Make sure these numbers are interpreted as frequencies by using the Preselect Role... option obtained by right-clicking at the top of the column in the JMP spreadsheet. When completed the data table should look like:

[pic]

To analyze these data in JMP do the following:

First select Distribution from the Analyze menu and place Seed Type in the Y box and click OK.  The resulting bar graph and frequency distribution table are shown below.

[pic]

The sample proportions are bit different than the hypothesized proportions, but are they far enough away to convince us the hypothesized 9:3:3:1 ratio is not followed for this population? We can obtain 95% confidence intervals for the individual proportions for each season by selecting Confidence Intervals with the appropriate confidence level from the pull-down menu for Seed Type.   The resulting confidence intervals are given in the table below.

[pic]

It is interesting to note that the confidence interval for the proportion of green-wrinkled is the only one that does not contain its hypothesized proportion (1/16 = .0625).  Finally to obtain the goodness of fit test for these data select Test Probabilities from the pull-down menu for Pea Seeds.   The table below will then appear in the output window.  We need to enter in hypothesized values for the probability of each phenotype in the column labeled Hypoth Prob.  (9/16 = .5625, 3/16 = .1875, 1/16 = .0625)

[pic]

When you are finished entering the hypothesized probabilities click the Done button and the following test information will be displayed.

[pic]

The p-value for the Pearson Chi-Square Goodness-of-Fit test is less than .05 so we reject the null hypothesis which states that the four seed phenotypes occur in a 9:3:3:1 ratio.  

COMPARING TWO POPULATION PROPORTIONS EXAMPLE -

AGE AT FIRST PREGNANCY AND CERVICAL CANCER (a Case-Control Study)

These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer.  In a case-control study a random sample of cases (i.e. people with the disease in question) and controls (i.e. people similar to those in the case group, except they do not have the disease) and the proportion of people with some potential risk factor are compared across the two groups.  In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer.  The data is presented in the table below:

                                                                                               

|  |Age at First |Age at First |Row Totals |

| |Pregnancy 25 |(fixed) |

| |(risk factor present) |(risk factor absent) | |

|Cervical Cancer (Case) |42 |7 |49 |

|Control |203 |114 |317 |

|Column Totals |245 |121 |n=366 |

|(random) | | | |

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are independent.

or

The distribution of the risk factor is the same for both cases and controls

or

The proportion of women with the risk factors is same for both groups, i.e.

[pic].

Ha: There IS an association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are NOT independent.

or

The distribution of the risk factor is NOT the same for both cases and controls.

or

The proportion of women with the risk factor is not the same for both groups, i.e.

[pic].

Development of the Test Statistic

If the null hypothesis is true we expect the proportion of women with the risk factor in the case and control groups to be the same. We can also think of this in terms of conditional probabilities. Two events A and B are said to be independent if

[pic]

i.e. knowledge about the occurrence of B tells you nothing about the occurrence of A.

Consider the following generic representation of the contingency table for this example.

| | 1st Preg. Age [pic] | 1st Preg. Age > 25 | |

| |(Risk Factor Present) |(Risk Factor Absent) |ROW TOTALS |

|Case |a |b | R1=(a+b) |

|(Disease Present) | | | |

|Control |c |d | R2=(c+d) |

|(Disease Absent) | | | |

|COLUMN TOTALS |C1=(a+c) |C2=(b+d) | n |

From this table we can calculate the conditional probability of having the risk factor given the disease status of the subject as follows:

P(Risk|Disease) = [pic]

which, if risk and disease are independent, should be equal to

[pic]

Setting these two expressions equal to one another gives:

[pic]( [pic].

This gives what we expect a, the number of women that have the risk factor in the

disease group, to be if the null hypothesis is true.

In a similar fashion we could find what we expect c, the number of women with the

risk factor in control group, to be. If the null hypothesis is true we expect

[pic] to be equal to [pic]( We expect [pic].

We also can look at absence of the risk factor in the same way which gives the following expected values for b and d.

[pic]and [pic]

Notice there is a general pattern here, the expected value for frequency in the ith row and the jth column of the table is found by taking the row total for that row ([pic]) times the column total for that column ([pic]) and then dividing by the total sample size (n), i.e.

[pic].

Our test statistic looks at the difference or discrepancy between what we observe when our data is collected and what we expect to see if the null hypothesis of independence is true. Intuitively, if the observed frequencies are far away from what we expect to see if the variables in question were independent, then we will reject the null hypothesis and conclude a significant relationship between the variables exists.

Pearson’s Chi-Square Test Statistic

[pic]

where the expected frequencies for the cells are given by the formula:

[pic] , [pic]

If the observed frequencies differ substantially from the expected frequencies the[pic]test statistic will be “BIG”. How do we define “BIG”? We find the probability of getting a test statistic value as extreme or more extreme than the one observed if there was truly no association between the two variables in question, i.e. we find the p-value associated with our test statistic. If the null hypothesis is true the [pic]test statistic follows a Chi-squared distribution with degrees of freedom df = [pic]. Here, r = # of rows and c = # of columns in the contingency table. Again the larger the [pic]test statistic value, the smaller the p-value.

EXAMPLE (cont’d) CONDUCTING THE CHI-SQUARE TEST FOR CONTINGENCY TABLES

|  |Age at First |Age at First |Row Totals |

| |Pregnancy 25 |(fixed) |

| |(risk factor present) |(risk factor absent) | |

|Cervical Cancer (Case) |42 |7 |[pic]49 |

|Control |203 |114 |[pic]317 |

|Column Totals |[pic]245 |[pic]121 |n=366 |

|(random) | | | |

CONDUCTING THE TEST

1. State Hypotheses

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. percentage/proportion of women with risk factor is the same both cases and controls.

Ha: There is an association between age at first pregnancy and cervical cancer, i.e. percentage/proportion of women with risk factor is NOT the same both cases and controls.

2. Determine Test Criteria

Choose [pic]

Test Statistic

[pic]

3. Compute Test Statistic

a) Find expected frequencies and put them in the contingency table beneath the

observed frequencies in parentheses (see above).

b) Calculate the Chi-Square statistic (see above).

[pic][pic] ____________ df = 1

4. Compute p-value

(use either the Chi-square Probability Calculator in JMP or the Chi-square Table in book)

Area in Lower Tail

df 0.90 0.95 0.975 0.99 0.995

1 2.71 3.84 5.02 6.63 7.88

2 4.61 5.99 7.38 9.21 10.6

. … … … … …

Because our test statistic value is greater than ___________ we reject the null and conclude that age at first pregnancy and cervical cancer status are not independent. We can use the table to put an upper bound on the p-value by noting that the largest value our test statistic value exceeds is 6.63. This says that our p-value < __________.

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value = .0026836.

[pic]

Yate’s Correction for Continuity

When we have a 2 X 2 contigency table the Chi-square test outlined above is not really appropriate particularly when the same size (n) is “small”. When working with 2 X 2 tables it is often times preferable to use Yate’s Correction when calculating the[pic]test statistic or use Fisher’s Exact Test which, as the name suggests, is an exact test!

Test Statistic: Pearson’s Chi-Square Test with Yate’s Correction for Continuity in 2 X 2 Contingency Tables.

[pic]

where [pic].

Analysis in JMP

Enter these data into JMP as shown below.

[pic]

The contingency table in JMP is obtained by selecting Fit Y by X from the Analyze menu and placing Disease in the X box and Preg. Age in the Y box.   The resulting

mosaic plot and contingency table are shown below.

[pic]

The Row %'s give the proportion of women with the potential risk factor in each group.  From these data we can see that proportion of women who had their first pregnancy at or before age 25 in the cervical cancer group is .8571 or 85.71% vs. .6404 or 64.04% for the women in the control group.  It certainly appears that the proportion of women with the potential risk factor is higher in the cervical cancer (case) group, i.e. there is a relationship between the potential risk factor and cervical cancer.

Chi-square test results are given automatically, however this test does NOT use Yate’s correction for continuity.

[pic]

There is strong evidence that the proportion of women with the risk factor differs significantly between the cases and the controls (p = .0027).

Instead of the chi-square test we can use the results of Fisher's Exact Test, which is included in the JMP output whenever we are working with a 2 X 2 table, are shown below.

[pic]

The three p-values given are for testing the following:

(1) Left,  p-value = .9996 is for testing if the proportion of women with the potential risk factor is larger for the control group. Had this been significant that would suggest that having a first pregnancy at or before the age of 25 reduces your risk of developing cervical cancer.  This is clearly not supported as the p-value >> .05.

(2) Right, p-value = .0014 is for testing if the proportion of women with the potential risk factor is larger for the cervical cancer (case) group.  The fact this p-value is significant suggests that having a first pregnancy at or before the age of 25 increases your risk of developing cervical cancer.  This was the research hypothesis for the doctors who conducted this study.

(3) 2-Tail, p-value = .0029 is for testing if the proportion of women with the potential risk factor differs between the two groups.  The fact this p-values is significant suggests that the proportion of women having a first pregnancy at or before the age of 25 is not the same for both groups.  Because the sample proportion is larger for the case group we can again conclude that early age at first pregnancy increases risk of cervical cancer.

TEST OF INDEPENDENCE EXAMPLE - HISTOLOGICAL TYPE of HODGKIN’S DISEASE and RESPONSE TO TREATMENT

Is there a relationship between type of Hodgkin’s and response to treatment? To answer this question researchers randomly sampled medical records for 538 patients who had been classified as having some form of Hodgkin’s disease and then cross-classified these patients according to the histological type of Hodgkin’s they have and their response to treatment.

Response to Treatment

| |None |Partial |Positive |Row Totals |

|Type of Hodgkin’s | | | |(Random) |

|LD |44 |10 |18 |72 |

|LP |12 |18 |74 |104 |

|MC |58 |54 |154 |266 |

|NS |12 |16 |68 |96 |

|Column Totals |126 |98 |314 |N = 538 |

|(Random) | | | | |

[pic]

Pearson Chi-Square Test for r x c Contingency Tables

[pic]

where r = # of rows, c = # of columns, and

[pic].

Analysis in JMP

To perform a Chi-square Test of Independence in JMP select the Fit Y by X option from the Analyze menu to examine the relationship between Histological Type (X) and Response (Y). The mosaic plot and contingency table for these data are shown below.

Mosaic Plot of Response vs. Histological Type

[pic]

From the mosaic plot we can clearly see that individuals with LP or NS have the highest proportion of subjects with positive response to treatment, while the LD has the largest percentage of patients with no response to treatment. The contingency table for these data is shown below with only Row% added to each cell (i.e. the Total % and Col % options have been unselected.).   These percentages can be interpreted as the conditional chance/probability of each response type given the histological type. (i.e. P(Positive| LP) = .7115 while P(None|LD)=.6111)

Contingency Analysis of Response By Histological type

[pic]

Contingency Table with Expected Frequencies and Cell Chi-Square

[pic]

Chi-square Test of Independence

[pic]

The p-value is less than .0001 which provides very strong evidence against the assumption of independence. Thus we conclude that histological type and response to treatment are not independent amongst Hodgkin’s disease patients.  The nature of this relationship can be examined graphically by using the mosaic plot and/or the results of a correspondence analysis.

[pic]

TEST OF HOMOGENEITY EXAMPLE – Blister Rust and Age of Trees

Is the susceptibility to blister rust related to the age of the tree? To answer this question researchers grafted trees in four age classes (4, 10, 20 and 40 years) with blister rust and then recorded whether or not the tree ultimately became diseased or not. The results are shown below.

What is a test of homogeneity and how is it different from a test of independence?

|Age of Tree |Contracted Blister Rust (Diseased) |Did Not Contract Blister Rust (Healthy)|Row Totals (FIXED) |

|4 years |14 |7 |21 |

|10 years |11 |6 |17 |

|20 years |5 |11 |16 |

|40 years |8 |15 |23 |

|Column Totals (Random) |38 |39 |N = 77 |

The Hypotheses:

[pic]

Pearson Chi-Square Test for r x c Contingency Tables

[pic]

where r = # of rows, c = # of columns, and

[pic].

Contingency Analysis of Reaction By Age Class

Reaction

|Age Class |Diseased |Healthy |Row Totals |

|4 |14 |7 |21 |

|10 |11 |6 |17 |

|20 |5 |11 |16 |

|40 |8 |15 |23 |

|Column Totals|38 |39 |n = 77 |

Some calculations:

Analysis in JMP

[pic]

To begin our analysis select Fit Y by X from the Analyze menu and place Age Class in the X, Factor box and Reaction in the Y, Response box.  The resulting output is on the following page.

[pic]

Goodness-of-Fit for Numeric/Continuous Data:

Assessing Normality (Shapiro-Wilk Test)

This is a goodness-of-fit test that looks at whether the distribution of a numeric variable, e.g. Hg levels found in walleyes from Island Lake, is normal. Simply speaking, how good does our observed data fit a normal distribution?

[pic]

[pic]

• Hg Level (p = .0004) ( Reject Ho, conclude that Hg levels are not normally distributed.

• Log10(Hg Level) (p = .9402) ( Fail to Reject Ho, normality of the Hg levels in the log base 10 scale seems plausible.

To obtain the output above select the options shown below

[pic] [pic]

-----------------------

Note: The basic form of the corrected test statistic is still essentially the same in that it measures the discrepancy between what we observe and what we expected to see if the Ho were true.

[pic]

We conclude that there is an association between age of the tree and the result of a blister rust graft (p = .0426). In particular, younger trees are more susceptible to blister rust than older trees.

|Treatable forms of Hodgkin's |

|Disease |

|(MC,LP & NS) |

|  |

This pull-down menu allows you to choose which numbers will be displayed in each cell of the table. Here only Count and Row % have been selected. You can also select Expected to get the expected frequencies(Eij), Deviation to get[pic], and Cell Chi Square to get [pic]for each cell in the table (see table on the next page).

This pull down menu allows you to select what numbers will be displayed in each cell of the table.  Here only Count and Row % are selected.

From the Contigency Table pull-down menu we have selected Expected to get the expected frequencies (Eij) and Cell Chi Square to get [pic]for each cell in the table.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download