Statistics in Biology



Statistics for AS Biology

Descriptive Statistics

Repeated measurements in biology are rarely identical, due to random errors and natural variation. If enough measurements are repeated they can be plotted on a histogram, like the one on the right. This usually shows a normal distribution, with most of the repeats close to some central value. Many biological phenomena follow this pattern: eg. peoples' heights, number of peas in a pod, the breathing rate of insects, etc.

The central value of the normal distribution curve is the mean (also known as the arithmetic mean or average). But how reliable is this mean? If the data are all close together, then the mean is probably good, but if they are scattered widely, then the calculated mean may not be very reliable. The width of the normal distribution curve is given by the standard deviation (SD), and the larger the SD, the less reliable the data. For comparing different sets of data, a better measure is the 95% confidence interval (CI). This is derived from the SD, and is the range above and below the mean within which 95% of the repeated measurements lie (marked on the histogram above). You can be pretty confident that the real mean lies somewhere in this range. Whenever you calculate a mean you should also calculate a confidence limit to indicate the quality of your data.

In Excel the mean is calculated using the formula  =AVERAGE (range) , the SD is calculated using  =STDEV (range) , and the 95% CI is calculated using  =CONFIDENCE (0.05, STDEV(range), COUNT(range)) .

This spreadsheet shows two sets of data with the same mean. In group A the confidence interval is small compared to the mean, so the data are reliable and you can be confident that the real mean is close to your calculated mean. But in group B the confidence interval is large compared to the mean, so the data are unreliable, as the real mean could be quite far away from your calculated mean. Note that Excel will always return the results of a calculation to about 8 decimal places of precision. This is meaningless, and cells with calculated results should always be formatted to a more sensible precision (Format menu > Cells > Number tab > Number).

Plotting Data

Once you have collected data you will want to plot a graph or chart to show trends or relationships clearly. With a little effort, Excel produces very nice charts. First enter the data you want to plot into two columns (or rows) and select them.

Drawing the Graph. Click on the chart wizard [pic]. This has four steps:

1. Graph Type. For a bar graph choose Column and for a scatter graph (also known as a line graph) choose XY(Scatter) then press Next. Do not choose Line.

2. Source Data. If the sample graph looks OK, just hit Next. If it looks wrong you can correct it by clicking on the Series tab, then the red arrow in the X Values box, then highlight the cells containing the X data on the spreadsheet. Repeat for the Y Values box.

3. Chart Options. You can do these now or change them later, but you should at least enter suitable titles for the graph and the axes and probably turn off the gridlines and legend.

4. Graph Location. Just hit Finish. This puts the chart beside the data so you can see both.

Changing the Graph. Once you have drawn the graph, you can now change any aspect of it by double-clicking (or sometimes right-clicking) on the part you want to change. For example you can:

• move and re-shape the graph

• change the background colour (white is usually best!)

• change the shape and size of the markers (dots)

• change the axes scales and tick marks

• add a trend line or error bars (see below)

Lines. To draw a straight "line of best fit" right click on a point, select Add Trendline, and choose linear. In the option tab you can force it to go through the origin if you think it should, and you can even have it print the line equation if you are interested in the slope or intercept of the trend line. If instead you want to "join the dots" (and you don't often) double-click on a point and set line to automatic.

Error bars. These are used to show the confidence intervals on the graph. You must already have entered the 95% confidence limits on the spreadsheet beside the X and Y data columns. Then double-click on the points on the graph to get the Format Data Series dialog box and choose the Y Error Bars tab. Click on the red arrow in the Custom + box, and highlight the range of cells containing your confidence limits. Repeat for the Custom - box.

Problems

1. Here are the results of an investigation into the rate of photosynthesis in the pond weed Elodea. The number of bubbles given off in one minute was counted under different light intensities, and each measurement was repeated 5 times. Use Excel to calculate the means and 95% confidence limits of these results, then plot a graph of the mean results with error bars and a line of best fit.

| |light intensity |repeat 1 |repeat 2 |repeat 3 |repeat 4 |repeat 5 |

| |(Lux) | | | | | |

| |0 |5 |2 |0 |2 |1 |

| |500 |12 |4 |5 |8 |7 |

| |1000 |7 |20 |18 |14 |24 |

| |2000 |42 |25 |31 |14 |38 |

| |3500 |45 |40 |36 |50 |28 |

| |5000 |65 |54 |72 |58 |36 |

Statistics for A2 Biology

There is a bewildering variety of statistical tests available, and it is important to choose the right one. This flow chart will help you to decide which statistical test to use, and the tests are described in detail on the next 5 pages.

[pic]

Statistics to Test for a Correlation

Correlation statistics are used to investigate an association between two factors such as age and height; weight and blood pressure; or smoking and lung cancer. After collecting as many pairs of measurements as possible of the two factors, plot a scatter graph of one against the other. If both factors increase together then there is a positive correlation, or if one factor decreases when the other increases then there is a negative correlation. If the scatter graph has apparently random points then there is no correlation.

[pic]

There are two statistical tests to quantify a correlation: the Pearson correlation coefficient (r), and Spearman's rank-order correlation coefficient (rs). These both vary from +1 (perfect correlation) through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and normally-distributed use Pearson, otherwise use Spearman. In both cases the larger the absolute value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8 are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are probably insignificant.

In Excel the Pearson coefficient r is calculated using the formula: =CORREL (X range, Y range) . To calculate the Spearman coefficient rs, first make two new columns showing the ranks (or order) of the two sets of data, and then calculate the Pearson correlation on the rank data. The highest value is given a rank of 1, the next highest a rank of 2 and so on. Equal values are given the same rank, but the next rank should allow for this (e.g. if there are two values ranked 3, then the next value is ranked 5).

In this example the size of breeding pairs of penguins was measured to see if there was correlation between the sizes of the two sexes. The scatter graph and both correlation coefficients clearly indicate a strong positive correlation. In other words large females do pair with large males. Of course this doesn't say why, but it shows there is a correlation to investigate further.

Linear Regression to Investigate a Causal Relationship.

If you know that one variable causes the changes in the other variable, then there is a causal relationship. In this case you can use linear regression to investigate the relation in more detail. Regression fits a straight line to the data, and gives the values of the slope and intercept of that line (m and c in the equation y = mx + c).

The simplest way to do this in Excel is to plot a scatter graph of the data and use the trendline feature of the graph. Right-click on a data point on the graph, select Add Trendline, and choose Linear. Click on the Options tab, and select Display equation on chart. You can also choose to set the intercept to be zero (or some other value). The full equation with the slope and intercept values are now shown on the chart.

In this example the absorption of a yeast cell suspension is plotted against its cell concentration from a cell counter. The trendline intercept was fixed at zero (because 0 cells have 0 absorbance), and the equation on the graph shows the slope of the regression line.

[pic]

The regression line can be used to make quantitative predictions. For example, using the graph above, we could predict that a cell concentration of 9 x 107 cells per cm3 would have an absorbance of 1.37 (9 x 0.152).

T-Test to Compare Two Sets of Data

Another common form of data analysis is to compare two sets of measurements to see if they are the same or different. For example are plants treated with fertiliser taller than those without? If the means of the two sets are very different, then it is easy to decide, but often the means are quite close and it is difficult to judge whether the two sets are the same or are significantly different. To compare two sets of data use the t-test, which tells you the probability (P) that there is no difference between the two sets. This is called the null hypothesis.

P varies from 0 (impossible) to 1 (certain). The higher the probability, the more likely it is that the two sets are the same, and that any differences are just due to random chance. The lower the probability, the more likely it is that that the two sets are significantly different, and that any differences are real. Where do you draw the line between these two conclusions? In biology the critical probability is usually taken as 0.05 (or 5%). This may seem very low, but it reflects the facts that biology experiments are expected to produce quite varied results. So if P > 5% then the two sets are the same (i.e. accept the null hypothesis), and if P  ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download