Biodiversity Data Analysis: Testing Statistical Hypotheses - Miami

Biodiversity Data Analysis: Testing Statistical Hypotheses

By Joanna Weremijewicz, Simeon Yurek, Steven Green, Ph. D. and Dana Krempels, Ph. D.

In biological science, investigators often collect biological observations that can be tabulated as numerical facts, also known as data (singular = datum).

Figure 1. Three types of data can be collected.

I. Types of Data

Recall that data can be of three basic types :

1. Attribute data. are descriptive, "either-or" measurements, such as ? presence or absence of a particular attribute ? presence or absence of a genetic trait ("freckles" or "no freckles) ? type of a genetic trait (type A, B, AB or o blood)

Because such data have no specific sequence, they are considered unordered.

2. Discrete numerical data are observations counted as integers, such as ? the number of leaves on each member of a group of plants ? the number of breaths per minute in a group of newborns ? the number of beetles per square meter of forest floor

These data are ordered, but they do not describe physical attributes.

3. Continuous numerical data fall along a numerical continuum, such as ? tail length ? brain volume ? percent body fat ? anything that varies on a continuous scale.

Rates (e.g., miles per hour; mL/min) are numerical continuous data. Continuous numerical data generally fall along a normal (Gaussian, bell-shaped) distribution. The limit of resolution of continuous data is the accuracy of the methods and instruments used to collect them.

Biodiversity ? Data Analysis

1

II. Data Values of Interest

When an investigator collects numerical data from a group of subjects, s/he must determine how and with what frequency the data vary. For example, if we wish to study the distribution of body weight in adult male humans, we could

? weigh a sample of that population (say, 100 adult male humans) ? plot the numbers

o independent variable "body weight" along the abscissa (x-axis) o dependent variable "frequency (%)" on ordinate (y-axis)

The resulting frequency distribution (Figure 2) is a representation of how often a particular data point occurs at a given measurement.

Figure 2. Example of a Gaussian curve showing the number of individuals in a sample at each point on a continuum.

Data are usually distributed over a range of values. Measures of the tendency of data to occur near the center of the range include the population

? mean - the average measurement) ? median - the measurement at the exact center of the range ? mode - the most frequent measurement in the range

It is also important to understand variation around the mean.

For example, if the average human weighs 200lbs, we must determine whether

weight forms a very wide distribution (with some individuals spanning the entire

range from 100 ? 300lbs) or a narrow distribution that hovers near the mean

(with most individuals close to 200lbs).

Measurements of dispersion around the mean include

? range

? variance

? standard deviation

Biodiversity ? Data Analysis

2

III. Parameters vs. Statistics

If you were able to measure the weight of every adult male Homo sapiens who ever lived, and then calculate a mean, median, mode, range, variance and standard deviation from your measurements, those values would be known as parameters. Parameters represent the actual values calculated from measuring every member of a population of interest.

It is usually difficult (or impossible) to obtain data from every member of a population of interest. However, one can estimate parameters by randomly sampling members of the population to calculate a statistic. Statistics are estimates of population parameters.

? A parameter is written as a Greek symbol. ? A statistic is written as the equivalent Roman letter.

For example, the standard deviation of the weight of every adult human male who ever lived is represented as . The statistic calculated to estimate that parameter is as s. More examples are shown in Table 1.

Table 1. Greek and corresponding Roman letters used to represent population parameters and sample statistics.

Biodiversity ? Data Analysis

3

IV. Probability

The probability that an observed result is due to some factor other than chance is known as alpha (). By convention, is usually set at 0.05, or 5%, which means that there is a 95% probability that a particular outcome is due to some factor other than random chance. In essence, is a "cut off value" that defines the area(s) in a probability distribution where a particular value is unlikely to fall.

(In some studies, a more rigorous a of 0.01 (1%) is required to reject the null hypothesis, and in some others, a more lenient a of 0.1 (10%) is allowed for rejection of the null hypothesis. We will use an level of 0.05.) In scientific endeavors, statistical significance has a highly specific definition. The difference between an observed and expected result is said to be statistically significant if and only if:

Under the assumption that there is no true difference

? the probability that the observed difference would be at least as large as the actually observed difference is less than or equal to (5%; 0.05).

? the probability that the observed difference would be smaller than actually observed difference is greater than 95% (0.95).

A probability distribution assigns a relative probability to any possible outcome (e.g., Menhinick's Index). The species richness calculations you performed for each sample, while expressed as a number, are not distributed along a normal curve. They are ordinal, rather than continuous, data. For this reason, a non-parametric statistical test, the Mann-Whitney U test, should be used for your analysis.

V. Statistical Hypotheses

Your team should already have devised two statistical hypotheses ? null hypothesis (Ho) ? alternative hypothesis (Ha)

The null hypothesis assumes no significant difference between two populations being compared.

The alternative hypothesis may be either ? nondirectional (two-tailed) ? directional (one-tailed)

Biodiversity ? Data Analysis

4

A two-tailed hypothesis does not specify the way the populations will differ. ("Pond A and Pond B will differ in species richness").

A one-tailed hypothesis does specify the way the populations will differ. ("Pond A will have greater species richness than Pond B.")

Your team should already have devised null and alternative hypotheses for your survey of biodiversity. To determine whether or not there is a significant difference in biodiversity between your two sample sites, you must perform a statistical test on your data, the series of Menhinick's Indices (D) that you calculated from your individual survey samples.

VI. Applying a Statistical Test to Your Data: Example

The Mann-Whitney U Test determines whether there is significant difference between two sets of ranked data by detecting their degree of overlap.

? A large degree of overlap means that the two data sets are not very different. ? A small degree of overlap means that the two data sets are different. An overlap of 0.05 (5%) or less between the two data sets indicates rejection of the null hypothesis.

The Mann-Whitney test allows the investigator (you) to compare two data sets (from your two habitat types) without assuming that your data (D values or species abundance values) are normally distributed.

The Mann-Whitney U does have its rules. For this test to be appropriate:

? Data sets must be from random, independent samples (your two sites) ? The measurements (Menhinick's Indices, in our case) should be ordinal ? No two measurements should have exactly the same value

The Mann-Whitney U test allows the investigator to determine whether there is a significant difference between two sets of ordered/ranked data, such as those your team has collected in its biodiversity study.

Example: Table 1 shows 18 (imaginary) values for Menhinick's Indices from two ponds, silted (S) and clear (C). Table 2 shows values ranked and labeled by.

If two values are the same, then each one receives the average of their two ranks. ? Value 9 appears twice, at rank 6 and 7. ? Add the two ranks and divide by two to get their mean: 13/2 = 6.5. ? Each value is assigned their same, mean rank whenever there is a tie.

Biodiversity ? Data Analysis

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download