ANALYSIS OF CATEGORICAL DATA - Winona



ANALYSIS OF CATEGORICAL DATA

Goodness-of-Fit Tests, 2 X 2 Tables (Comparing Two Population Proportions), Tests of Homogeneity, Tests of Independence

[pic]

GOODNESS OF FIT EXAMPLE – PEA SEEDS (A genetics experiment)

It is hypothesized that the ratio of yellow-smooth to yellow-wrinkled to green-smooth to green-wrinkled seeds 9:3:3:1. Using the data below do the four seed phenotypes appear to follow the hypothesized 9:3:3:1 ratio?

|Yellow Smooth |Yellow Wrinkled |Green Smooth |Green Wrinkled |Total |

|152 |39 |6 |327 |250 |

Ho:  The sample comes from a population having a 9:3:3:1 ratio of yellow-smooth to yellow-wrinkled to green-smooth to green-wrinkled seeds.

(i.e. [pic])

Ha:  The sampled comes from a population not having a 9:3:3:1 ratio of the above four seed phenotypes. (i.e. at least one of the proportions above is not correct)

The Details: Pearson Chi-Square Goodness-of-Fit Test

The chi-square test statistic[pic] is given by:

[pic]

where,

k = # of categories and Ei = [pic].

Performing a Goodness of Fit Test in JMP

Entering data in JMP involves setting up two columns, one column is categorical/nominal containing the seed phenotype and one column is numerical/continuous containing the number of seeds of that type observed in the sample of 250 plants. Make sure these numbers are interpreted as frequencies by using the Preselect Role... option obtained by right-clicking at the top of the column in the JMP spreadsheet. When completed the data table should look like:

[pic]

To analyze these data in JMP do the following:

First select Distribution from the Analyze menu and place Season in the Y box and click OK.  The resulting bar graph and frequency distribution table are shown below.

[pic]

The sample proportions are bit different than the hypothesized proportions, but are they far enough away to convince us the hypothesized 9:3:3:1 ratio is not followed for this population? We can obtain 95% confidence intervals for the individual proportions for each season by selecting Confidence Intervals with the appropriate confidence level from the pull-down menu for Season.   The resulting confidence intervals are given in the table below.

[pic]

It is interesting to note that the confidence interval for the proportion of green-wrinkled is the only one that does not contain its hypothesized proportion (1/16 = .0625).  Finally to obtain the goodness of fit test for these data select Test Probabilities from the pull-down menu for Pea Seeds.   The table below will then appear in the output window.  We need to enter in hypothesized values for the probability of each phenotype in the column labeled Hypoth Prob.  (9/16 = .5625, 3/16 = .1875, 1/16 = .0625)

[pic]

When you are finished entering the hypothesized probabilities click the Done button and the following test information will be displayed.

[pic]

The p-value for the Pearson Chi-Square Goodness-of-Fit test is less than .05 so we reject the null hypothesis which states that the four seed phenotypes occur in a 9:3:3:1 ratio.  

COMPARING TWO POPULATION PROPORTIONS EXAMPLE -

AGE AT FIRST PREGNANCY AND CERVICAL CANCER (a Case-Control Study)

These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer.  In a case-control study a random sample of cases (i.e. people with the disease in question) and controls (i.e. people similar to those in the case group, except they do not have the disease) and the proportion of people with some potential risk factor are compared across the two groups.  In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer.  The data is presented in the table below:

                                                                                               

|  |Age at First |Age at First |Row Totals |

| |Pregnancy 25 |(fixed) |

| |(risk factor present) |(risk factor absent) | |

|Cervical Cancer (Case) |42 |7 |49 |

|Control |203 |114 |317 |

|Column Totals |245 |121 |n=366 |

|(random) | | | |

[pic]

Test Statistic Details: Pearson Chi-Square Test for Contingency Tables with Yate’s correction continuity for 2 X 2 tables.

[pic]

where r = # of rows, c = # of columns, and [pic].

Analysis in JMP

Enter these data into JMP as shown below.

[pic]

The contingency table in JMP is obtained by selecting Fit Y by X from the Analyze menu and placing Disease in the X box and Preg. Age in the Y box.   The resulting mosaic plot and contingency table are shown below.

[pic]

The Row %'s give the proportion of women with the potential risk factor in each group.  From these data we can see that proportion of women who had their first pregnancy at or before age 25 in the cervical cancer group is .8571 or 85.71% vs. .6404 or 64.04% for the women in the control group.  It certainly appears that the proportion of women with the potential risk factor is higher in the cervical cancer (case) group, i.e. there is a relationship between the potential risk factor and cervical cancer.

Chi-square test results are given automatically

[pic]

There is strong evidence that the proportion of women with the risk factor differs significantly between the cases and the controls (p = .0027).

Instead of the chi-square test we can use the results of Fisher's Exact Test, which is included in the JMP output whenever we are working with a 2 X 2 table, are shown below.

[pic]

The three p-values given are for testing the following:

(1) Left,  p-value = .9996 is for testing if the proportion of women with the potential risk factor is larger for the control group. Had this been significant that would suggest that having a first pregnancy at or before the age of 25 reduces your risk of developing cervical cancer.  This is clearly not supported as the p-value >> .05.

(2) Right, p-value = .0014 is for testing if the proportion of women with the potential risk factor is larger for the cervical cancer (case) group.  The fact this p-value is significant suggests that having a first pregnancy at or before the age of 25 increases your risk of developing cervical cancer.  This was the research hypothesis for the doctors who conducted this study.

(3) 2-Tail, p-value = .0029 is for testing if the proportion of women with the potential risk factor differs between the two groups.  The fact this p-values is significant suggests that the proportion of women having a first pregnancy at or before the age of 25 is not the same for both groups.  Because the sample proportion is larger for the case group we can again conclude that early age at first pregnancy increases risk of cervical cancer.

TEST OF HOMOGENEITY EXAMPLE – Blister Rust and Age of Trees

Is the susceptibility to blister rust related to the age of the tree? To answer this question researchers grafted trees in four age classes (4, 10, 20 and 40 years) with blister rust and then recorded whether or not the tree ultimately became diseased or not. The results are shown below.

|Age of Tree |Contracted Blister Rust (Diseased) |Did Not Contract Blister Rust (Healthy)|Row Totals (FIXED) |

|4 years |14 |7 |21 |

|10 years |11 |6 |17 |

|20 years |5 |11 |16 |

|40 years |8 |15 |23 |

|Column Totals (Random) |38 |39 |N = 77 |

Hypotheses:

[pic]

Test Statistic Details: Pearson Chi-Square Test for r x c Contingency Tables

[pic]

where r = # of rows, c = # of columns, and

[pic].

Analysis in JMP

[pic]

To begin our analysis select Fit Y by X from the Analyze menu and place Age Class in the X, Factor box and Reaction in the Y, Response box.  The resulting output is on the following page.

[pic]

TEST OF INDEPENDENCE EXAMPLE - HISTOLOGICAL TYPE of HODGKIN’S DISEASE and RESPONSE TO TREATMENT

Is there a relationship between type of Hodgkin’s and response to treatment? To answer this question researchers randomly sampled medical records for 538 patients who had been classified as having some form of Hodgkin’s disease and then looked at their response to treatment.

Response to Treatment

| |None |Partial |Positive |Row Totals |

|Type of Hodgkin’s | | | |(Random) |

|LD |44 |10 |18 |72 |

|LP |12 |18 |74 |104 |

|MC |58 |54 |154 |266 |

|NS |12 |16 |68 |96 |

|Column Totals |126 |98 |314 |N = 538 |

|(Random) | | | | |

[pic]

Test Statistic Details: Pearson Chi-Square Test for r x c Contingency Tables

[pic]

where r = # of rows, c = # of columns, and

[pic].

Analysis in JMP

To perform a Chi-square Test of Independence in JMP select the Fit Y by X option from the Analyze menu to examine the relationship between Histological Type (X) and Response (Y). The mosaic plot and contingency table for these data are shown below.

Mosaic Plot of Response vs. Histological Type

[pic]

From the mosaic plot we can clearly see that individuals with LP or NS have the highest proportion of subjects with positive response to treatment, while the LD has the largest percentage of patients with no response to treatment. The contingency table for these data is shown below with only Row% added to each cell (i.e. the Total % and Col % options have been unselected.).   These percentages can be interpreted as the conditional chance/probability of each response type given the histological type. (i.e. P(Positive| LP) = .7115 while P(None|LD)=.6111)

Contingency Analysis of Response By Histological type

[pic]

Chi-square Test of Independence

[pic]

The p-value is less than .0001 which provides very strong evidence against the assumption of independence. Thus we conclude that histological type and response to treatment are not independent amongst Hodgkin’s disease patients.  The nature of this relationship can be examined graphically by using the mosaic plot and/or the results of a correspondence analysis.

Assessing Normality (Shapiro-Wilk Test)

This is a goodness-of-fit test that looks at whether the distribution of a numeric variable, like Hg levels found in walleyes from Island Lake, is normal.

[pic]

[pic]

Hg Level (p = .0004) ( Reject Ho, conclude that Hg levels are not normally distributed.

Log10(Hg Level) (p = .9402) ( Fail to Reject Ho, normality of the Hg levels in the log base 10 scale seems plausible.

To obtain the output above select the options shown below

[pic] [pic]

-----------------------

|Treatable forms of Hodgkin's |

|Disease |

|(MC,LP & NS) |

|  |

This pull-down menu allows you to choose which numbers will be displayed in each cell of the table. Here only Count and Row % have been selected. You can also select Expected to get the expected frequencies(Eij), Deviation to get[pic], and Cell Chi Square to get [pic]for each cell in the table.

This pull down menu allows you to select what numbers will be displayed in each cell of the table.  Here only Count and Row % are selected.

Basic form of the test statistic is the same, it measures the discrepancy between what we observe and what we expected to see if the Ho were true.

[pic]

We conclude that there is an association between age of the tree and the result of a blister rust graft (p = .0426) In particular younger trees are more susceptible to blister rust than older trees.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download