Chapter 6 Putting Statistics to Work: the Normal Distribution

Chapter 6

Putting Statistics to Work: the Normal Distribution

Productive inference from sample to population requires that the appropriate statistic be used to characterize various probabilities associated with the distributions of interest. As we may hypothetically have an infinite number of means as well as an infinite number of standard deviations that describe potential distributions, we therefore have a problem to solve in that we do not have an infinite number of statistical procedures to deal with every possible distribution. Does this mean that we can't use statistics to analyze a vast majority of our data? Thankfully, no. While each distribution is unique, most distributions can be grouped with other distributions that share important characteristics. These groups of similar distributions can be further characterized by an 'ideal' (i.e., theoretical) distribution that typifies the important characteristics. Statistics applicable to the entire group of similar distributions can then be developed based upon our knowledge of the ideal distribution. Perhaps the most important ideal distribution used is the 'normal' distribution (Figure 6.1). Once one understands the characteristics of the normal distribution, knowledge of other distributions is easily obtained. Figure 6.1. The normal distribution.

Most people are familiar with the normal distribution described as a "bell-shaped curve," perhaps as a scale for grading. The bell-shaped curve is nothing but a special case of the normal distribution; the words "bell-shaped" describe the general shape of the distribution, and the word "curve" is used as a synonym for distribution. While we generally refer to the normal distribution, there are really many different normal distributions. In fact, there are as many different normal distributions as there are possible means and standard deviations, both theoretically and in the real world. However, all of these normal distributions share five characteristics. 1. Symmetry. If divided into left and right halves, each half is a mirror image of the other.

2. The maximum height of the distribution is at the mean. One of the consequences of this stipulation and number 1 above is that the mean, the mode and the median have identical values.

3. The area under a normal distribution sums to unity. This characteristic is simpler than it sounds but is important because of how we use the normal distribution. Areas within the theoretical distribution as a geometric form represent probabilities of events that range from 0 to 1 (i.e., 0% to 100%). The phrase `sums to unity' means that all of the probabilities represented by the area under the normal distribution sum to 1 and thus represent all possible outcomes. Each half of the symmetrical distribution of which the mean is the center represents half (.5) of the probabilities.

4. Normal distributions are theoretically asymptotic at both ends, or tails, of the distribution. If we were to follow a point along the slope of the curve toward the tail to infinity, the point would incrementally become ever closer to zero without ever quite reaching it. This aspect of the normal distribution is necessary because we need to consider every possible variate to infinity. Put another way, every single possible variate can be assigned some probability of occurring, even if it is astronomically small.

5. The distribution of means of multiple samples from a normal distribution will have a tendency to be normally distributed. Considering this commonality among normal distributions requires thinking about means somewhat differently. As you know, means characterize groupings of variates. In this special context we need to consider calculating individual means on repeated samples, and plotting these means as variates that collectively create a new distribution that is composed of means. Accordingly, this new distribution has a tendency to be normally distributed. This issue will be further discussed in Chapter 7.

With these commonalties in mind, let us further consider some of the differences among normal distributions. First, as Figures 3.17, 3.18 and 3.19 show, normal distributions may be conceptualized as leptokurtic, platykurtic, or mesokurtic. Additionally, any combination of means and standard deviations is possible, and there is no necessary relationship between the mean and the standard deviation for any given distribution. Normal distributions may have different means and the same standard deviation (Figure 6.2) or the same means and different standard deviations (Figure 6.3).

Figure 6.2. Two normal distributions with different means and the same standard deviation.

Figure 6.3. Two normal distributions with the same mean and different standard deviations.

If is large, variates are generally far from the mean. If is small, most variates are relatively close to the mean. Regardless of the standard deviation, variates near the mean in a normal distribution are more common, and therefore more probable, than variates in the tails of the distribution. One of the most useful aspects of normal distributions is that regardless of the value of or ? (Figure 6.4):

? ? 1 contains 68.26% of all variates ? ? 2 contains 95.44% of all variates ? ? 3 contains 99.734% of all variates Figure 6.4. Percentages of variates within 1, 2, and 3 standard deviations from ? .

It is also possible to express this relationship in terms of more commonly used percentages. For example:

50% of all items fall between ? ? .674

95% of all items fall between ? ? 1.96

99% of all items fall between ? ? 2.58

If ? ? 1 contains 68.26% of all variates, ? ? 2 contains 95.44% of all variates, and ? ? 3 contains 99.74% of all variates (Figure 6.4), we know that any values beyond ? ? 2 are rare events, expected less than 5 times in 100, and ? ? 3 is even more rare, expected less than 1 time out of 100.

This characteristic of the normal distribution allows us to consider the probability of individual variates occurring within a geometric space under the distribution. As the probability space (i.e., the sum of the area of probability we are considering) under the normal distribution = 1.0, we know that the percentages mentioned above may be converted to probabilities. When we consider the relationship between a distribution and an individual variate of that distribution, we know that the probability is .6826 that the variate is within ? ? 1 ; .9544 that the variate is within ? ? 2 ; and .9974 that the variate is within ? ? 3 (Figure 6.5).

Figure 6.5. Standard deviations as areas under the normal curve expressed as probabilities.

The probabilities illustrated in Figure 6.5 are unchanging for all normal distributions regardless of their means or standard deviations. Furthermore, probabilities may be calculated for any area under the curve. For example, we might be interested in the areas between two points on the axis, or between one point and the mean, or between one point and infinity. These areas under the curve do vary depending on the location and the shape of the distribution as described by the mean and the standard deviation. In other words, there are as many relationships between any individual variate and the probabilities associated with normal distributions as there are different possible means and standard deviations. All are infinite in number.?

In order to best effectively use the normal distribution to generate probabilities, statisticians have created the standard normal distribution. The standard normal distribution has, by definition, ?=0 and =1. Rather than calculate probabilities of areas under the curve for every possible mean and standard deviation, it is easiest to convert any distribution to the standard normal. This transformation occurs through the calculation of z, where:

Formula 6.1: z = Yi - ?

The calculation of z establishes the difference between any variate and the mean (Yi - ? ), and expresses that difference in standard deviation units (by dividing by ). In other words, the product of the formula, called a z-score, is how many standard deviations Yi is from ? in the standard normal distribution. Appendix A is a table of areas under the curve of the standard normal distribution. Once we have a z-score, it is possible to use Appendix A to determine the exact probabilities under the curve. To illustrate this point, let us consider the following example.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download