Statistics in Biology - WELCOME TO MRS. HARENBERG'S CLASS!



Statistics in BiologyMeasures of Central TendencyThe center distribution in a set of data can be measured by the mean, median or mode. Mean The mean is a calculation where all of the data points are summed and then this number is divided by the total number of data points. The sample mean (below) is an estimate of the mean of the entire population (μ). 572135171450Students in a biology class planted eight bean seeds in separate plastic cups and placed them under a bank of fluorescent lights. Fourteen days later, the students measured the height of the bean plants that grew from those seeds and recorded their results in Table 2.Calculate the mean of these values in the space below. What can you determine from your 8 sample mean compared to the center distribution of a population of 200 plants?When would you use the mean? When might you not use the mean?Statistics in BiologyMeasures of Central TendencyThe center distribution in a set of data can be measured by the mean, median or mode. Median The median is the middle point of the data. If you lined up your data from largest to smallest you could find the middle value and that would be your median. The median is valuable when your data shows a large range or when there are data points that are extremely large or small. It is also valuable when your data set is small. A researcher studying mouse behavior recorded in Table 3 the time (in seconds) it took 13 different mice to locate food in a maze. Calculate the mean of the values. Calculate the median of the values. How do these two values compare in looking at the time it takes for a mouse to find food in a maze?Justify which measure of central tendency you would use to explain the “average” time for the mouse to run the maze and find the food. Statistics in BiologyMeasures of Central TendencyThe center distribution in a set of data can be measured by the mean, median or mode. Mode The mode is a measure of how often a value occurs in your data set. The value of the mode is less about an average and more about where your data clusters and how it is distributed. Below are examples of a modal distribution and a bimodal distribution. 3371850314325-189865129540If the mean or median had been used from the raw data, explain whether the modal patterns in each example would have been shown as they are with the mode. Why might a scientist be interested in the modal distribution (where data is clustered)?Statistics in BiologyMeasures of VariabilityThe variability in a set of data can be measured by the Range, Standard Deviation, and Variance. RangeThe range can be calculated in normally distributed data by subtracting the smallest value from the largest value. This is a simple measure of variability in the data. A large value indicates a relatively large variability in the data and a small value indicates low variability. Students in a biology class measured the width in centimeters of eight leaves from eight different maple trees and recorded their results in Table 4.Show your calculation of the range below. What does your range value say about the variability of leaf width for Maple trees? Notice the value of leaf number 5. How does that value make you feel about the range of all Maple leaves on a tree? Standard DeviationStandard deviation is the most common method of calculating variance. You should become intimately familiar with using this equation. The sample standard deviation (s) is the average of the deviation of each sample and the mean. in other words, how different each value is from the mean. If we sum all of this variance and divide by the sample size and take the square root, we get the standard deviation of the data set. The higher the standard deviation, the further your values are distributed from the mean. -114299200025In a normally distributed sample, 1 standard deviation represents 34.1% variance away from the mean. If we calculate variance on either side of the mean (+/- the mean), then for 1 standard deviation we should expect our samples to fall within 68.3% of the normal curve. If we bump that out to +/- 2 standard deviations, then we should expect all of our values to be found within 95.4% of the normal curve. You are interested in knowing how tall bean plants (Phaseolus vulgaris) grow in two weeks after planting. You plant a sample of 20 seeds (n = 20) in separate pots and give them equal amounts of water and light. After two weeks, 17 of the seeds have germinated and have grown into small seedlings (now n = 17). You measure each plant from the tips of the roots to the top of the tallest stem. You record the measurements in Table 5, along with the steps for calculating the standard deviation.Table 5. Plant Measurements and Steps for Calculating the Standard DeviationPlant NumberPlant Height (mm)(mm)(mm)11122102310641205986106780810591061011011951298137414112151151610917100mean = = variance (s2)standard deviation (s)Visualizing the Variability38100000Calculate the mean +/- 1 standard deviationMean + 1s = ___________ Mean - 1s = ___________ Calculate the mean +/- 2 standard deviationsMean + 2s = ___________ Mean - 2s = ___________Now that you have the standard deviation calculated, you can now look at your data and see that:68.3% of the measurements fall between ______________mm and 95.4% fall between ____________ mm. Another way to visualize the standard deviation relative to the mean is to create standard deviation bars. Thi is simply done by plotting the mean on a graph either as a single point or as a column and adding a t-shaped bar above and below the mean value. For your data on plant height, plot below the mean and representations of +/- 1 standard deviation and +/- 2 standard deviations. 1866900342900How do these calculations change with sample size? Statistics in BiologyMeasures of ConfidenceWhile the standard deviation tells us how spread out our data is from the mean, a different statistic can help us figure out the uncertainty of our mean calculation in the first place. How do sample means vary from the entire population? Take a die 20 and roll it 100 times. Record your values below. Mean of all 100 samples = ________________Use a random number generator to select at random 8 sub-samples of 5 values from the 100 you collected. Calculate the mean of each and write it in the bold bow below. SS1SS2SS3SS4SS5SS6SS7SS8For each mean in the box, how confident are you that it represents the mean of the entire population? Color each box red (not confident), yellow (somewhat confident), or green (confident)Calculating UncertaintyThe relationship of variability between a sample mean and a global mean can be expressed by calculating the standard error of the mean (abbreviated as SE(?? )or SEM). The standard error of the mean represents the standard deviation of such a distribution and estimates how close the sample mean is to the population mean.The less deviation your samples have from the mean, the less SE. Also, The greater each sample size (i.e., 50 roll values rather than 5 roll values), the more closely the sample mean will estimate the population mean, and therefore the standard error of the mean becomes smaller. What the standard error of the mean tells you is that about two-thirds (68.3%) of the sample means would be within ±1 standard error of the population mean and 95.4% would be within ±2 standard errors. Another more precise measure of the uncertainty in the mean is the 95% confidence interval (95% CI). For large sample sizes, 95% CI can be calculated using this formula: 1.96?/√ , which is typically rounded to 2?/ √? for ease of calculation. In other words, 95% CI is about twice the standard error of the mean.SEM Error BarsMany bar graphs include error bars, which may represent standard deviation, SEM, or 95% CI. When the bars represent SEM, you know that if you took many samples only about two-thirds of the error bars would include the population mean. This is very different from standard deviation bars, which show how much variation there is among individual observations in a sample. When the error bars represent 95% confidence intervals in a graph, you know that in about 95% of cases the error bars include the population mean. If a graph shows error bars that represent SEM, you can estimate the 95% confidence interval by making the bars twice as big—this is a fairly accurate approximation for large sample sizes, but for small samples the 95% confidence intervals are actually more than twice as big as the SEMs.If we are trying to tell whether two or more samples are significantly different from each other, we can look for error bar overlap. Error bars for different columns that overlap the mean may indicate that these two samples are not significantly different. A non-overlap of error bars may indicate that these two samples are significantly different. To be sure, another statistical test called a t-test could be performed. Seeds of many weed species germinate best in recently disturbed soil that lacks a light-blocking canopy of vegetation. Students in a biology class hypothesized that weed seeds germinate best when exposed to light. To test this hypothesis, the students placed a seed from crofton weed (Ageratina adenophora, an invasive species on several continents) in each of 20 petri dishes and covered the seeds with distilled water. They placed half the petri dishes in the dark and half in the light. After one week, the students measured the combined lengths in millimeters of the radicles and shoots extending from the seeds in each dish. Table 6. Combined Lengths of Crofton Weed Radicles and Shoots after One Week in the Dark and the LightPetri DishDark (x1)(mm)Light (x2)(mm)Dark (mm)Light (mm)1 and 212185.80.163 and 48222.612.965 and 6151729.11.967 and 8132311.521.169 and 1061613.05.7611 and 1241831.40.1613 and 14132211.612.9615 and 16141219.340.9617 and 1851921.10.3619 and 2061713.01.9615898.4Mean = 9.6 mm18.4 mms = mms = mmStandard Error = = 95% CI = = 03181353695700318135Use columns to graph your data for both treatments, using SE and 95% CI below. Remember the initial hypothesis, “that weed seeds germinate best when exposed to light”. With the statistical evidence you collected, explain whether the hypothesis can been rejected or fails to be rejected. Explain how the SEM error bars support your answer above. You may use either 1 SEM or 95% CI as support. Because of the error bars, explain whether you can population means (and thus the treatments) are different. Can you extend your thinking to whether to difference is purely by chance or whether it is statistically significant?Proceed to Calculating Descriptive StatisticsStatistics in BiologyInferential Statistics: T-Test Statistical hypotheses are different from experimental hypotheses. In experimental hypotheses you are measuring whether one variable has an effect on a process. Statistics evaluates a statistical null hypothesis. This null hypothesis states that when comparing groups, the experimental effect had no impact on the process and any change is due to chance alone. The null is given the variable H0. If you grow 10 bean plants in dirt with added nitrogen and 10 bean plants in dirt without added nitrogen. You find out that the means of these two samples are 13.2 centimeters and 11.9 centimeters, respectively. Does this result indicate that there is a difference between the two populations and that nitrogen might promote plant growth? Or is the difference in the two means merely due to chance? A statistical test is required to discriminate between these possibilities.How do we define “chance”?The significance level is the probability of getting a test statistic rare enough that you are comfortable rejecting the null hypothesis (H0). The widely accepted significance level in biology is 0.05. If the probability (p) value is less than 0.05, you reject the null hypothesis; if p is greater than or equal to 0.05, you don’t reject the null paring MeansRemember back when we calculated and graphed SEM bars to see if light and dark treatments showed any difference in growth? We mentioned that to see if this was a statistically significant difference that we could hang our hat on, we would need a method of comparing the means. Enter the t-test. The t-test assesses the probability of getting the observed result if the null statistical hypothesis (H0) is true. Typically, the null statistical hypothesis in a t-test is that the mean of sample 1 is equal to the mean of sample 2 or ?1 = ?2. Rejecting H0 supports the alternative hypothesis, H1,that the means are significantly different (?1not equal to ?2). In the plant example, the t-test determines whether any observed differences between the means of the two groups of plants are statistically significant or have likely occurred simply by chance. -1904991143000Calculating the T-TestTable 6. Combined Lengths of Crofton Weed Radicles and Shoots after One Week in the Dark and the LightState your null hypothesis (H0): Calculate the t-test SE value by dividing the variance (s2) for each treatment by its sample size. Take square root. Subtract mean 1 - mean 2 and take absolute value. Divide the answer in step 3 by the answer in step 2. This is the tobs (t observed) measurement. Use the t-Value critical table by adding the total number of data points in the experiment, minus 2. This gives us our degrees of freedom. Compare the tcrit (t-critical) value to the tobs value. If the calculated t-value is greater than the appropriate critical t-value, this indicates that the means of the two samples are significantly different at the probability value listed (in this case, 0.05). If the calculated t is smaller, then you cannot reject the null hypothesis that there is no significant difference.-1904990Explain how your tobs (t observed) measurement compared to the tcrit (t-critical) value. Was your null rejected, or did it fail to be rejected?What does this say about the significant significance between your two treatments? If you were to answer the hypothesis based on your statistical evidence, what you would write?Proceed to part C, Calculating t-Test StatisticsStatistics in BiologyInferential Statistics: Chi SquareStatistical hypotheses are different from experimental hypotheses. In experimental hypotheses you are measuring whether one variable has an effect on a process. Statistics evaluates a statistical null hypothesis. This null hypothesis states that when comparing groups, the experimental effect had no impact on the process and any change is due to chance alone. The null is given the variable H0. The t-test is used to compare the sample means of two sets of data. The chi-square test is used to determine how the observed results compare to an expected or theoretical result. Chi square should be done with raw values, not percent or frequenciesChi square should also be calculated from large sample sizeFor this example we will use the example of Sickle Cell Anemia pages 9-12. Statistics in BiologyEstablishing Relationships Between Sets of DataCorrelation and causation are two relationships that get confused often in society. Does the fact that two pieces of data have a relationship to each other mean that one causes the other. The answer is no. In fact, we would first need to establish correlation. One way to do this is through the correlation coefficient (r). This statistic measures how related two variables are and the result will be a value between 1 and -1. As the value approaches 0, the weaker the correlation. The correlation coefficient (r) establishes the relatedness between variables X and Y. A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally described as weak. An r2 value of 0.0 means that knowing X does not help you predict Y. There is no linear relationship between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values. When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y perfectly.the r2 value tells the strength of the relationship between variables X and Y. For example, if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation). The other 15% of the total variation in y remains unexplained.If you want to see data of correlation and how it does not prove causation, I would direct you to Spurious Correlations by Tyler Vigen and you can experiment with his data. (Document created by Bob Kuhn) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download