ANALYSIS OF CATEGORICAL DATA



11 - ANALYSIS OF CATEGORICAL DATA

In this handout we cover several different methods for analyzing categorical data.

The methods we will examine are:

• 2 [pic]2 Contingency Tables

(we have seen these earlier when we discussed conditional probability, RR’s, [pic])

• r [pic]c Contingency Tables

• McNemar’s Test (dependent samples comparison of p1 vs. p2)

COMPARING TWO POPULATION PROPORTIONS EXAMPLE -

AGE AT FIRST PREGNANCY AND CERVICAL CANCER (a case-control Study)

These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer.  In a case-control study a random sample of cases (i.e. people with the disease in question) and controls (i.e. people similar to those in the case group, except they do not have the disease) and the proportion of people with some potential risk factor are compared across the two groups.  In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer.  The data is presented in the table below:                         

|  |Age at First |Age at First |Row Totals |

| |Pregnancy 25 |(fixed) |

| |(risk factor present) |(risk factor absent) | |

|Cervical Cancer (Case) |42 |7 |49 |

|Control |203 |114 |317 |

|Column Totals |245 |121 |n=366 |

|(random) | | | |

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are independent.

or

The distribution of the risk factor is the same for both cases and controls

or

The proportion of women with the risk factors is same for both groups, i.e.

[pic].

Ha: There IS an association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are NOT independent.

or

The distribution of the risk factor is NOT the same for both cases and controls.

or

The proportion of women with the risk factor is not the same for both groups, i.e.

[pic].

Development of the Test Statistic

If the null hypothesis is true we expect the proportion of women with the risk factor in the case and control groups to be the same. We can also think of this in terms of conditional probabilities. Two events A and B are said to be independent if

[pic]

i.e. knowledge about the occurrence of B tells you nothing about the occurrence of A.

Consider the following generic representation of the contingency table for this example.

| | 1st Preg. Age [pic] | 1st Preg. Age > 25 | |

| |(Risk Factor Present) |(Risk Factor Absent) |ROW TOTALS |

|Case |a |b | R1=(a+b) |

|(Disease Present) | | | |

|Control |c |d | R2=(c+d) |

|(Disease Absent) | | | |

|COLUMN TOTALS |C1=(a+c) |C2=(b+d) | n |

From this table we can calculate the conditional probability of having the risk factor given the disease status of the subject as follows:

P(Risk|Disease) = [pic]

which if risk and disease are independent should be equal to

[pic]

Setting these two expressions equal to one another gives:

[pic]( [pic].

This gives what we expect a, the number of women that have the risk factor in the

disease group, to be if the null hypothesis is true.

In a similar fashion we could find what we expect c, the number of women with the

risk factor in control group, to be. If the null hypothesis is true we expect

[pic] to be equal to [pic]( We expect [pic].

We also can look at absence of the risk factor in the same way which gives the following expected values for b and d.

[pic]and [pic]

Notice there is a general pattern here, the expected value for frequency in the ith row and the jth column of the table is found by taking the row total for that row ([pic]) times the column total for that column ([pic]) and then dividing by the total sample size (n), i.e.

[pic].

Our test statistic looks at the difference or discrepancy between what we observe when our data is collected and what we expect to see if the null hypothesis of independence is true. Intuitively, if the observed frequencies are far away from what we expect to see if the variables in question were independent, then we will reject the null hypothesis and conclude a significant relationship between the variables exists.

Pearson’s Chi-Square Test Statistic

[pic]

where the expected frequencies for the cells are given by the formula:

[pic] , [pic]

If the observed frequencies differ substantially from the expected frequencies the[pic]test statistic will be “BIG”. How do we define “BIG”? We find the probability of getting a test statistic value as extreme or more extreme than the one observed if there was truly no association between the two variables in question, i.e. we find the p-value associated with our test statistic. If the null hypothesis is true the [pic]test statistic follows a Chi-squared distribution with degrees of freedom df = [pic]. Here, r = # of rows and c = # of columns in the contingency table.

Chi-Square Distribution with p-value

The larger the [pic]test statistic value, the smaller the p-value.

EXAMPLE (cont’d) CONDUCTING THE TEST OF INDEPENDENCE

|  |Age at First |Age at First |Row Totals |

| |Pregnancy 25 |(fixed) |

| |(risk factor present) |(risk factor absent) | |

|Cervical Cancer (Case) |42 |7 |[pic]49 |

|Control |203 |114 |[pic]317 |

|Column Totals |[pic]245 |[pic]121 |n=366 |

|(random) | | | |

1. State Hypotheses

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are independent.

Ha: There is an association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are NOT independent.

2. Determine Test Criteria

Choose [pic]

Test Statistic

[pic]

Conditions for the Test Statistic to be Valid

We should have no cells with expected frequencies less than 1 and at least 80% of the expected frequencies should be greater than 5. If either of these conditions are violated you have two options:

• Increase sample size (n) to that the expected frequencies increase, assuming additional data could be gathered under the same experimental conditions.

Combine sparse categories, which increase the cell frequencies and associated row/column totals which will therefore increase expected frequencies

3. Compute Test Statistic

a) Find expected frequencies and put them in the contingency table beneath the

observed frequencies in parentheses.

b) Calculate the Chi-Square statistic.

4. Compute p-value

(Use either the Chi-square Probability Calculator in JMP or the Chi-square Table included at the end of this section of notes, pg. 127.)

Table VII - If our observed test statistic value exceeds the value in Area in Upper Tail 0.050 column we reject the null hypothesis in favor of the alternative at the α ’ .05 level.

Area in Upper Tail

df 0.100 0.050 0.025 0.010 0.005

1 2.706 3.841 5.024 6.635 7.879

2 4.605 5.991 7.378 9.210 13.597

. … … … … …

Because our test statistic value is greater than ___________ we reject the null and conclude that age at first pregnancy and cervical cancer status are not independent. We can use the table to put an upper bound on the p-value by noting that the largest value our test statistic value exceeds is 6.635. This says that our p-value < __________.

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value = .0026836.

[pic]

Conducting the Analysis in JMP

Enter these data into JMP as shown below.

[pic]

The contingency table in JMP is obtained by selecting Fit Y by X from the Analyze menu and placing Disease in the X box and Preg. Age in the Y box.   The resulting mosaic plot and contingency table are shown below.

[pic]

The Row %'s give the proportion of women with the potential risk factor in each group.  From these data we can see that proportion of women who had their first pregnancy at or before age 25 in the cervical cancer group is .8571 or 85.71% vs. .6404 or 64.04% for the women in the control group.  It certainly appears that the proportion of women with the potential risk factor is higher in the cervical cancer (case) group, i.e. there is a relationship between the potential risk factor and cervical cancer.

Chi-square test results are given automatically

[pic]

There is strong evidence that the proportion of women with the risk factor differs significantly between the cases and the controls, i.e. we have strong evidence that cervical cancer and age at first pregnancy are not independent. (p = .0027).

Instead of the chi-square test we can use the results of Fisher's Exact Test, which is included in the JMP output whenever we are working with a 2 X 2 table, are shown below.

[pic]

The three p-values given are for testing the following:

(1) Left,  p-value = .9996 is for testing if the proportion of women with the potential risk factor is larger for the control group. Had this been significant that would suggest that having a first pregnancy at or before the age of 25 reduces your risk of developing cervical cancer.  This is clearly not supported as the p-value >> .05.

(2) Right, p-value = .0014 is for testing if the proportion of women with the potential risk factor is larger for the cervical cancer (case) group.  The fact this p-value is significant suggests that having a first pregnancy at or before the age of 25 increases your risk of developing cervical cancer.  This was the research hypothesis for the doctors who conducted this study.

(3) 2-Tail, p-value = .0029 is for testing if the proportion of women with the potential risk factor differs between the two groups.  The fact this p-values is significant suggests that the proportion of women having a first pregnancy at or before the age of 25 is not the same for both groups.  Because the sample proportion is larger for the case group we can again conclude that early age at first pregnancy increases risk of cervical cancer.

What other analyses could we perform for these data?

Example 2: Type of Skin Melanoma and Site on Body

Is there a relationship between the type of skin melanoma and where on the body the melanoma appeared? To answer this question n = 400 patients were cross-classified according to type of melanoma and where the melanoma appeared. The data collected are summarized in the contingency table below.

| |Site of Melanoma | |

| | | | | |

|Type of Melanoma |Head & Neck |Trunk |Extremities |Row Totals |

|Hutchinson’s Melanomic | | |10 | |

|Freckle |22 |2 | |R1 = 34 |

|Superficial Spreading | | | | |

|Melanoma |16 |54 |115 |R2 = 185 |

|Nodular | | | | |

| |19 |33 |73 |R3 = 125 |

|Indeterminate |11 | |28 | |

| | |17 | |R4 = 56 |

|Column Totals | C1 = 68 | |[pic] | |

| | |C2 = 106 | |n = 400 |

1. State Hypotheses:

[pic]

2. Determine Test Criteria

Choose [pic]

Test Statistic: Pearson Chi-Square Test for r x c Contingency Tables

[pic]

where r = # of rows, c = # of columns, and [pic].

3. Compute Test Statistic

a) Compute expected frequencies and place them in table beneath the observed

frequencies in parentheses.

| |Site of Melanoma | |

| | | | | |

|Type of Melanoma |Head & Neck |Trunk |Extremities |Row Totals |

|Hutchinson’s Melanomic | | | | |

|Freckle |22 |2 |10 |R1 = 34 |

| |(5.78) |(9.01) |(19.21) | |

|Superficial Spreading | | | | |

|Melanoma |16 |54 |115 |R2 = 185 |

| |( ) |(49.03) |(104.53) | |

|Nodular | | | | |

| |19 |33 |73 |R3 = 125 |

| |(21.25) |(33.13) |(70.62) | |

|Indeterminate | | | | |

| |11 |17 |28 |R4 = 56 |

| |(9.52) |( ) |(31.64) | |

|Column Totals | C1 = 68 | |[pic] | |

| | |C2 = 106 | |n = 400 |

b) Compute Test Statistic

[pic]

4. Compute p-value

Using Chi-square table on pg. 127

Area in Upper Tail is Denoted by Subscript

df .100 .050 .025 .010 .005

1 2.7055 3.841 5.024 6.635 7.879

2 4.605 5.991 7.378 9.210 10.597

... … … … … …

6 10.645 12.592 14.449 16.812 18.548

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value [pic]

[pic]

Analysis in JMP

We can enter these data into JMP as shown below.

[pic]

To begin our analysis select Fit Y by X from the Analyze menu and place Type of Melanoma in the X, Factor box and Location in the Y, Response box.  Here we could also reverse the roles X and Y, because we could imagine looking at the distribution of the location of the melanoma conditional on the type, or we could look at the distribution of the type of melanoma given its location. The resulting output is shown below.

[pic]

[pic]

Example 3: Histological Type and Response to Treatment for Hodgkin’s Patients

Is there a relationship between type of Hodgkin’s and response to treatment? To answer this question researchers randomly sampled medical records for 538 patients who had been classified as having some form of Hodgkin’s disease and then looked at their response to treatment.

Response to Treatment

| |None |Partial |Positive |Row Totals |

|Type of Hodgkin’s | | | |(Random) |

|LD |44 |10 |18 |72 |

|LP |12 |18 |74 |104 |

|MC |58 |54 |154 |266 |

|NS |12 |16 |68 |96 |

|Column Totals |126 |98 |314 |N = 538 |

|(Random) | | | | |

[pic]

Analysis in JMP

To perform a Chi-square Test of Independence in JMP select the Fit Y by X option from the Analyze menu to examine the relationship between Histological Type (X) and Response (Y). The mosaic plot and contingency table for these data are shown below.

Mosaic Plot of Response vs. Histological Type

[pic]

From the mosaic plot we can clearly see that individuals with LP or NS have the highest proportion of subjects with positive response to treatment, while the LD has the largest percentage of patients with no response to treatment. The contingency table for these data is shown below with only Row% added to each cell (i.e. the Total % and Col % options have been unselected.).   These percentages can be interpreted as the conditional chance/probability of each response type given the histological type. (i.e. P(Positive| LP) = .7115 while P(None|LD)=.6111)

[pic]

Chi-square Test of Independence

[pic]

The p-value is less than .0001 which provides very strong evidence against the assumption of independence. Thus we conclude that histological type and response to treatment are not independent amongst Hodgkin’s disease patients.  The nature of this relationship can be discussed by using the mosaic plot and/or the row percentages.

McNemar’s Test for Comparing Two Proportions Using Paired or Dependent Samples

As was the case for continuous data we can used matched pairs to control for potential confounding factors when comparing two population proportions ([pic]). Each observation sampled from population 1 has a corresponding matched observation sampled from population 2. These observations may actually represent observations made on the same subject or it may be different subjects matched according to some criteria (e.g. sex, age, race, etc..)

Example 1: Effect of Disinfectant Use on Acute Cutaneous Complications (ACCs) During Insulin Pump Treatment

A study examined the effect of disinfectant use on acute cutaneous complications (ACCs) during insulin-pump treatment. At the time of the initial exam, 70.0% of 40 diabetic patients with insulin pumps had ACCs at the needle insertion site. After use of a disinfectant on the skin before needle insertion for two to four weeks, 27.5% of the patients had ACCs at the needle insertion site. Do the data below provide evidence that the population proportion of diabetic patients with ACCs before using the disinfectant differs from the proportion of diabetic patients with ACCs after using the disinfectant?

| |After Disinfectant | |

|Before Disinfectant | |Column Total |

| |ACCs |No ACCs | |

|ACCs |9 |19 |28 |

|No ACCs |2 |10 |12 |

|Column Total |11 |29 |n = 40 |

Each of the 40 subjects in study are classified to their ACC status before and after the use of disinfectant:

• 9 subjects had ACCs both before and after disinfectant

• 19 subjects had ACCs before the disinfectant and did NOT have ACCs after the use of disinfectant

• 2 subjects had no ACCs before disinfectant use but had them after disinfectant was used

• 10 patients had no ACCs before or after the use of disinfectant.

The subjects where no change in their ACC status occurred tell us nothing about the effectiveness of the disinfectant. These pairs of observations are called concordant pairs because the response was the same for each observation in the pair. The subjects where there was a change in ACC status provide information about the effectiveness of the disinfectant. In particular if the number of patients where ACCs were present before disinfectant but were absent after the use of disinfectant is large relative to the number of patients where the opposite was true then we have evidence that the disinfectant was effective at reducing ACCs. Pairs of observations where there is a difference in the response are called discordant pairs. The discordant pairs provide the basis for the test statistic.

Let

r = number of subjects in which they had ACCs before using disinfectant and no ACCs after using disinfectant = 19

s = number of subjects in which they did not have ACCs before using disinfectant but did have ACCs after using disinfectant = 2

If the null hypothesis of no difference in the proportion of ACCs before and after treatment is true, r and s should be approximately equal. If there are substantial differences between r and s then we would want to reject the null in favor of the alternative which states that the proportion of patients with ACCs is different before and after the use of disinfectant.

McNemar’s Test Statistic

[pic]which has a chi-square distribution with df = 1.

Note: Because the absolute difference is used the labeling of r and s is completely

arbitrary.

1. State Hypotheses

Ho: The proportion of patients with ACCs before and after the use of disinfectant are the same.

Ha: The proportion of patients with ACCs before and after the use of disinfectant are different.

2. Test Criteria

Choose [pic]

Use McNemar’s Test

3. – 4. Compute test statistic and compute p-value

[pic] >> 7.879 ( p-value < ____________.

5. Conclusion

Thus we have evidence to conclude that the ACC rate in insulin pump using diabetics is significantly lower after using disinfectant (p < ).

Example 2: Diabetes and Myocardial Infarctions amongst Navajo Indians

(pgs. 350-351)

A study investigating the potential relationship between diabetes and myocardial infarctions amongst Navajos was conducted by matching 144 victims of acute myocardial infarction with 144 individuals free of heart disease on the basis of gender and age. The members of each pair where then asked whether they had been diagnosed with diabetes. The results are shown in the table below.

| |No MI | |

|MI | |Column Total |

| |Diabetes |No Diabetes | |

|Diabetes |9 |37 |46 |

|No Diabetes |16 |82 |98 |

|Column Total |25 |119 |n = 144 |

Ho: There is no association between diabetes and myocardial infarctions in the population of Navajo Indians.

Ha: There is an association between diabetes and myocardial infarcations in the population of Navajo Indians.

r = number of pairs where the MI sufferer had diabetes but the age and gender matched non-MI sufferer did not = 37

s = number of pairs where the MI sufferer did not have diabetes but the age and gender matched non-MI sufferer did = 16.

[pic]

p-value < _________

Conclude that there is an association between myocardial infarctions and diabetes amongst Navajo Indians. In particular, victims of acute myocardial infarctions are more likely to also suffer from diabetes than individuals free from heart disease who have been matched on age and gender (p < ).

Chi-square Table

[pic]

-----------------------

|Treatable forms of Hodgkin's |

|Disease |

|(MC,LP & NS) |

|  |

This pull down menu allows you to select what numbers will be displayed in each cell of the table.  Here only Count and Row % are selected.

[pic]

Conclusion and discussion:

We conclude that there is an association between type of skin melanoma and where it is found on the body (p < .0001). In particular we see that Hutchinson’s melanomic freckle is most likely found on the head or neck, while indeterminate nodular, and superficial melanomas are most likely found on the extremities. Superficial melanomas are unlikely to be found on the head or neck.

[pic][?]!"#…Š–—˜™š­®¯°±²¾ÅÆÇ

' öíöáØÓÌÈÌÀ¸´¬¨™Œ¬¸´ˆ´„yncTcjh?WÅhRñCJU[pic]aJh?WÅhRñCJaJh?WÅhªA0CJaJh?WÅhò¬CJaJh?WÅhP3Øjh‡vdh‡vdEHüÿU[pic]j-G[pic]h‡vdCJU[pic]V[pic]aJh‡vdjh‡vdU[pic]hò¬hPThis pull-down menu allows you to choose which numbers will be displayed in each cell of the table. Here both Count and Row % have been selected. We have also selected Expected to get the expected frequencies (E) and

Cell Chi Square to see [pic]for each cell in the table.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download