Statistical Concepts for Disease DetectivesDivision CDescriptive Epidemiology deals with the frequency and the distribution of illness and risk factors in populations – generally in terms of person, place and time. It enables epidemiologists to determine the extent of a disease - the epidemiologist collects information to characterize and summarize the health event or problemUsed as a first step to look at health-related outcomesExamine numbers of cases to identify an increase Examine patterns of cases to see who gets sick and where and when they get sick (person, place and time) 52578008890Mean – average – sum of all the given elements divided by the number of elements. Means are typically reported when data are normally distributed.1819275204470where sum of x scores and n = number of units in a sampleMedian - The median of a series of numbers is the number that appears in the middle of the list when arranged from smallest to largest. Medians are typically reported when data are skewed.For a list with an odd number of members, the way to find the middle number is to take the number of members and add one. Then divide that value by two. In our case, there are 9 numbers in the series. 9+1 = 10 and half of 10 is 5. The fifth number in the series is the median or 14. If the number of members of the series was even, the average of the two middle numbers would be the median.Mode - The mode is the number in the series that appears the most often. If there is no single number that appears more than any other number in the series, there is no value for the mode. 466725083820Variance - measures how far the results are from the expected results Standard deviation - expresses how close the results are to each other - it is the square root of the variance45529501003301114425-177804991100-266700Standard error of the mean - standard deviation (sample standard deviation) divided by the square root of the sample size5353050169545Confidence intervals of means - specifies a range of values within which the mean may lie. Z Scores for Commonly Used Confidence IntervalsDesired Confidence IntervalZ Score90%1.64595%1.9699%2.576 Analytic Epidemiology the epidemiologist relies on comparisons between groups to determine the role of various risk factors in causing the problem. This is generally expressed as an association between an illness and various risk factors. These associations have two components – strength and statistical significance. Strength is measured by the relative risk or odds ratio while statistical significance is determined by using a chi square, McNemar test or Fishers exact test to calculate a p-value.Z-test - compares sample and population means to determine if there is a significant difference. It requires a simple random sample from a population with a Normal distribution and where the mean is known. The Z-test is preferable when the sample number n is greater than 30. The Z value indicates the number of standard deviation units of the sample from the population mean. The Z measure is calculated as: z = (x - m) / SE462915050800where x is the mean sample to be standardized, m (mu) is the population mean and SE is the standard error of the mean. where s is the population standard deviation and n is the sample size.The z value is then looked up in a z-table. A negative z value means it is below the population mean (the sign is ignored in the lookup table).T-test - An independent one-sample t-test is used to test whether the average of a sample differ significantly from a population mean, a specified value μ0Paired T-test -The dependent t-test for paired samples is used when the samples are paired. This implies that each individual observation of one sample has a unique corresponding member in the other sampleNote: The Z test, T-test and paired t-tests are used with continuous data such as age, weight or height. Epidemiologists might use these tests to determine if there is a significant difference between the ages or weights of a group of cases and a control group. They might also used these tests to show that case and control groups are comparable. Paired t-tests are used in settings where the observations are matched as in a before and after type of study. One might use a paired t-test to show if there was a statistically significant reduction in body mass index after a weight control program. Each participant would contribute two values – a before BMI and an after BMI. These would be analyzed as paired data. Chi-square - Any statistical test that uses the chi square distribution used to decide whether there is any difference between the observed (experimental) value and the expected (theoretical) value - Chi square test for independence of two attributes2076450112395 where O is the observed frequency and E is the expected frequency. Fischers exact test - a statistical test used to determine if there are nonrandom associations between two categorical variables - two such variables and , with and observed states, respectively.Let there exist two such variables and , with and observed states, respectively. Now form an matrix in which the entries represent the number of observations in which and . Calculate the row and column sums and , respectively, and the total sum (1) of the matrix. Then calculate the conditional probability of getting the actual matrix given the particular row and column sums, given by (2) which is a multivariate generalization of the hypergeometric probability function. Now find all possible matrices of nonnegative integers consistent with the row and column sums and . For each one, calculate the associated conditional probability using (2), where the sum of these probabilities must be 1. To compute the P-value of the test, the tables must then be ordered by some criterion that measures dependence, and those tables that represent equal or greater deviation from independence than the observed table are the ones whose probabilities are added together. There are a variety of criteria that can be used to measure dependence. In the case, which is the one Fisher looked at when he developed the exact test, either the Pearson chi-square or the difference in proportions (which are equivalent) is typically used. Other measures of association, such as the likelihood-ratio-test, -squared, or any of the other measures typically used for association in contingency tables, can also be used. The test is most commonly applied to matrices, and is computationally unwieldy for large or . For tables larger than , the difference in proportion can no longer be used, but the other measures mentioned above remain applicable (and in practice, the Pearson statistic is most often used to order the tables). In the case of the matrix, the P-value of the test can be simply computed by the sum of all -values which are . For an example application of the test, let be a journal, say either Mathematics Magazine or Science, and let be the number of articles on the topics of mathematics and biology appearing in a given issue of one of these journals. If Mathematics Magazine has five articles on math and one on biology, and Science has none on math and four on biology, then the relevant matrix would be (3) Computing gives (4) and the other possible matrices and their s are (5) (6) (7) (8) which indeed sum to 1, as required. The sum of -values less than or equal to is then 0.0476 which, because it is less than 0.05, is significant.The following data were obtained from investigating a foodborne outbreak at a small company outing. Only 20 people were in attendance and 12 became ill. You are looking at exposure to tuna salad.Table : Illness among participants of company A outing who ate and did not eat tuna saladIlnessAte Tuna SaladYesNoTotalYes10515No235Total12820The relative risk is (8/13)/(4/7) = 1.67. These are small numbers and 3 out of the 4 cells have fewer than 5 observations. The Fishers exact test (also known as the Fisher-Irwin exact test) is an appropriate measure of the statistical significance of these differences. The p-value for a one-tailed Fishers exact test is 0.296. While that for a two-tailed test is 0.603McNemar test for paired data - McNemar's test is basically a paired version of Chi-square test – a form of chi-square (Χ2) test for matched paired data which is used to compare paired proportions. It can be used to analyze retrospective case-control studies, where each case is matched to a particular control. Or it can be used to analyze experimental studies, where the two treatments are given to matched subjects.McNemar's test calculates a P value. This test uses only the number of discordant pairs, that is, the number of pairs for which the control was exposed to the risk factor but the case was not (4 as an example) and the number of pairs where the case was exposed to risk factor but the control was not (25 as an example). Call these two numbers R and S. Calculate chi-square using this equation: 229552524765For this example, chi-square=13.79, which has one degree of freedom. The two-tailed P value is 0.0002. If there were really no association between risk factor and disease, there is a 0.02 percent chance that the observed odds ratio would be so far from 1.0 (no association). The below example is modified from Statistical Methods for Rates and Proportions (JL Fless, 2nd edition). An investigator is interested in determining if there is a significant difference between proportions of patients diagnosed as having schizophrenia by two different physicians (physician A and physician B). Each of 100 patients is seen by both physicians who give a diagnosis – each patient has two diagnoses, one from each position. Table 8.7. Two-by-two table for comparing rates of schizophrenia by physician A and BPhysician APhysician BSchizophreniaNot schizophreniaTotalSchizophrenia35540Not schizophrenia253560Total6040100 Χ2 = (B-C - 1)2 /(B+C) = (5-25 - 1)2 /(5+25) = 12.03. The p-value is determined for a Χ2 statistic with 2 degrees of freedom. The odds ratio in this instance is b/c = 5/25 = .2, The chi square, Fishers exact and McNemar test are used to determine the statistical significance of proportions such as odds ratios or relative risks. The chi square test is used where all of the cells in a 2X2 table have at least 5 observations while a Fishers exact test is used when one or more of the cells have fewer than 5 observations. The McNemar test is used when cases and controls are matched by one of more variables and accounts for the lack of independence between observations. Cochran Mantel-Haenszel summary odds ratio (often called the Mantel-Haenszel test) is a hypothesis test for association between two variables while controlling for one or more nuisance or control variables. Mantel and Haenszel proposed stratification techniques to account for confounding.How to Perform a Mantel-Haenszel test of a series of fourfold (2x2) tables. Given two variables where each variable has exactly two possible outcomes (typically defined as success and failure), we define the odds ratio as: o = (N11/N12)/ (N21/N22)??? = (N11N22)/ (N12N21) where N11 = number of successes in sample 1 N21 = number of failures in sample 1 N12 = number of successes in sample 2 N22 = number of failures in sample 2The first definition shows the meaning of the odds ratio clearly, although it is more commonly given in the literature with the second definition. The log odds ratio is the logarithm of the odds ratio: l(o) = LOG{(N11/N12)/ (N21/N22)}?????? = LOG{(N11N22)/ (N12N21)} Alternatively, the log odds ratio can be given in terms of the proportions l(o) = LOG{(p11/p12)/ (p21/p22)}?????? = LOG{(p11p22)/ (p12p21)}where p11 = N11/ (N11 + N21)????? = proportion of successes in sample 1p21 = N21/ (N11 + N21)????? = proportion of failures in sample 1p12 = N12/ (N12 + N22)????? = proportion of successes in sample 2p22 = N22/ (N12 + N22)????? = proportion of failures in sample 2Success and failure can denote any binary response. Dataplot expects "success" to be coded as "1" and "failure" to be coded as "0". The bias corrected version of the statistic is: l'(o) = LOG[{(N11+0.5) (N22+0.5)}/ {(N12+0.5) (N21+0.5)}] In addition to reducing bias, this statistic also has the advantage that the odds ratio is still defined even when N12 or N21 is zero (the uncorrected statistic will be undefined for these cases). Note that N11, N21, N12, and N22 defines a 2x2 contingency table. These types of contingency tables are also referred to as fourfold tables. Fleiss, Levin, and Paik also use the following formulation for the ith 2x2 table: ? Outcome Variable ? Sample Present Absent Total 1 Xi ni1 - Xi ni1 2 mi - Xi Xi - li ni2 Total mi ni. - mi ni. where li = mi + ni2 - ni.. The Mantel-Haenszel test can be used to estimate the common odds ratio and to test whether the overall degree of association is significant. It is a consistent estimator in the following two cases: When the number of tables is fixed, and possibly small, but each table has large marginal frequencies. The number of tables is large. The marginal frequencies can be small in the individual tables. Define the following quantities The Mantel-Haenszel estimate of the common odds ratio is where g denotes the number of groups. An estimate of the variance of is A confidence interval for the log(odds ratio) is then where is the normal percent point function and SE is the standard error of the estimate (= square root of the variance). The Mantel-Haenszel chi-square statistic for the significance of the overall degree of association is where Pi1 = n11/ ni1Pi2 = n12/ ni2i = (ni1 Pi1 + ni2 Pi2)/ni. i = 1 - i The test statistic is compared to a chi-square distribution with one degree of freedom. ................

