Intro to Epidemiology - MIT ESP



Intro to Epidemiology

Lecture Notes: Week 2

Beginning Exercise:

Look at the article you brought in today and answer these questions:

1. What is the name of the disease or condition your article is about?

2. How prevalent is this condition?

3. What are the causes of this condition?

4. How certain are the researchers/authors about their findings?

5. What groups do the articles findings apply to?

6. What was the state of knowledge about this topic prior to this study?

7. How was the study set up?

8. What laboratory methods were employed to gain these results?

9. How were the people in the study selected to participate in the study?

10. Could the study’s results be caused by something other than what the researchers were studying?

…Most of those answers won’t be in the articles you brought. In order to get the answers to those questions, you need to read the research itself.

Research papers can be found in scientific journals. Scientific journals are generally peer-reviewed – this means that before an article is published it is read and reviewed by a panel of experts in that field. This process helps to ensure that papers published in that journal are of good quality. Peer review is not a perfect process, however, it is fair to say that an article from a peer-reviewed journal is a more reliable source than a paper which has not been peer-reviewed.

What is statistics?

Statistics refers to the branch of mathematics we apply to understanding data, especially large amounts of data. Statistics can be described as descriptive or as inferential.

Descriptive statistics refers to any statistical method which summarizes or describes a set of data.

Some examples of descriptive statistics are mean, median and mode.

The mean is the average:

x_1 + x_2 + … + x_n

----------------------------- = mean

n

The median is the middle number when all the numbers are listed in ascending or descending order. If there is an even number of numbers, the median is the mean of the two middle numbers.

The mode is the number which appears the most times in the list of numbers.

We will be using the mean a great deal, but it is important to remember the main failing of the mean – a single very large or very small number (that is, a small number of “outliers”) can have a significant effect on the mean. This is why the median is said to be more “robust” – the median is affected much less by a single out-of-the-ordinary number.

Another form of descriptive statistics is graphical representation: charts and graphs. These include bar graphs, box and whiskers plots, line graphs, pie charts and more.

Inferential statistics refers to any time we use statistical analysis of a smaller group to lead us to make guesses about the larger group. That is, when we take a sample that we believe to be representative, we feel we can infer that the mean of the sample is the same or similar to what the mean of the whole population would be, if we were able to measure the whole population. Inferential statistics are vital because it is virtually always impossible to study every member of a population.

The Normal Curve:

The Central Limit Theorem states that

What this means is that for normally distributed data, when graphing the data you will have a normal curve, which looks like this:

[pic]

The highest part of the curve is the mean. What we see is that the most datapoints are at the mean and that fewer and fewer datapoints occur as we move away from the mean.

Stratification of data: stratifying data means that you take all the data and split it into multiple categories based on some factor (ex: men and women) and then compare the smaller groups to each other to come to conclusions about whether the groups differ on the variable being examined.

For example, here’s a hypothetical sketch of what it might look like to stratify height data by gender:

[pic]

The variance is a measure of how far, on average, data points are from the mean. The standard deviation is the square root of the variance.

The 68-95-99 rule says that 68% of all the data will fall within one standard deviation on either side of the mean, the 95% of the data will fall within two standard deviations and that 99% will fall within three standard deviations of the mean.

In data that has been normalized (that is, has been adjusted to fit to simple numbers), the mean is 100 and the standard deviation is 15.

Confidence intervals are another very important concept: due to the nature of inferential statistics, we cannot be certain about our numbers. Confidence intervals represent that uncertainty by providing both a range of two numbers and a “% confidence” which qualifies that range.

A 95% confidence interval of [1.1, 1.5] means that we believe that if we were to repeat our experiment 100 times, with 100 different sample populations, that we would expect that 95 of those new results would fall between 1.1 and 1.5. It does NOT mean that we are “95% sure,” which is a common misconception.

A brief word about accuracy vs. precision. Precision is a measure of how well we can make a measurement, especially of something small. Accuracy is a measure of how close our measurement is to the truth. A darts player who always hits one square inch near the top is very precise, but not very accurate. The darts example may help you to understand why a bigger confidence interval isn’t always better.

Something else to be aware of is the p-value, which is a significance testing tool. When we are determining whether a study’s results disprove the starting hypothesis, the p-value helps us measure how likely it would be for us to see the results we did if the null hypothesis was true. A p-value of .05 is often used.

Probability

Many of you already know that if you flip a coin a large number of times, you expect that about half the time it will come up heads and about half the time it will come up tails. The mathematical way to phrase that is that P(heads)=.5 (or you could also say P(tails)=.5)

As we demonstrated in class, while one person flipping a coin four times might get 3 heads and 1 tail or even 4 heads, if the whole class flips a coin four times, when we graph the results we see something close to a normal curve.

Here is what the probability tree looks like:

So what is the probability of getting two heads in a row?

(.5) * (.5) = .25, or ¼

We can see this on the tree, that one in four paths leads to two heads.

What is the probability of getting exactly 2 heads in four tries?

P(heads, then tails) + P(tails, then heads) = .25 + .25 = .5

When you want the probability of event A AND event B, use multiplication

When you want the probability of event A OR event B, use addition.

Causation

What does it mean to say that a certain factor caused a disease?

One set of questions might be, could the person have gotten sick without that factor? (was it necessary)

And, if the factor in question was the only risk-factor present, would that be enough to cause the disease? (was it sufficient)

One way to think about risk factors is to imagine four kinds of people: doomed, causal, preventative and immune. The doomed people will get sick whether exposed to the thing we are studying or not. The causal group will get sick if they are exposed. The preventative group will get sick if they are NOT exposed. The immune group will not get sick.

The Bradford-Hill criteria are a list of factors which can be persuasive when determining causation. They are not absolute, they are basically guidelines.

Some factors on the list are:

strength of effect

temporality (cause happened before effect)

plausible biological model

Remember, correlation is when you see an association between two things. That is, when A increases, B increases also. We will often talk about correlation, but it is vital to remember:

CORRELATION DOES NOT EQUAL CAUSATION

Also, keep in mind the existence of confounding factors. When two things appear to be correlated but in fact a third thing is causing both of them, that third thing is the confounding factor.

Measures of Disease Frequency

All measures of disease frequency need to have three elements:

1. the number of people sick

2. the total number of a larger population

3. a measure of time

Ratios, Proportions and Rates:

It’s important to understand the difference between these:

-A ratio is a fraction made of two numbers which do not depend on each other

-A proportion is a fraction made of two numbers where the denominator is the total (the number in a population, in our case) and the numerator is always a subset of the denominator.

-A ratio is a fraction whose denominator is always time

Incidence:

Incidence rate is a measure of new cases of disease per unit of time.

Cumulative Incidence is a measure of the total numbers of disease which develop during a period of time.

Cumulative Incidence = Incidence Rate * time

CI = IR * t

Prevalence:

Prevalence is a measure of how much of a given population has a disease at a given point in time. Remember the sand through the funnel analogy – the sand pouring in represents people being born or moving into the population area, the sand pouring out represents people either getting better or dying and the sand inside the funnel represents the prevalence.

P= prevalence, IR = Incidence Rate, D = Duration of disease

P/(1-P) = IR * D

Risk:

We can use measures of incidence, prevalence or cumulative incidence to measure risk.

Risk Difference = Risk_exposed – Risk_unexposed

RD = R_e – R_u

Where R_e and R_u can be incidence, prevalence or cumulative incidence, as long as they represent the same measure of risk

Risk Ratio = R_e/R_u

Risk Ratio, or RR, is a measure which compares the risk in two populations in a relative way. This means, the RR is a measure not of any absolute number but of how the two numbers compare to each other. For example, if the RR is 1, then the risk is the same in both groups. If it is 1.5, then the second group has 50% more sick people.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download