5 Measures of Variation

[Pages:11]BIOL 4243

IUG

2. Measures of Variation (dispersion) ()

Just as measures of central tendency locate the "center" of data, measures of variation measure its "spread". When the variation is small, this means that the values are close together (but not the same). The figure below shows the frequency curves for two populations that have equal means but different amounts of variation.

Population 1

Population 2

? Figure. Two frequency distributions with equal means but different amounts of variation.

The most commonly used measures of data variation are the range, the variance, standard deviation, Coefficient of Variation, and the Interquartile range.

Range

The range is defined as the difference in value between the highest (maximum) and lowest (minimum) observation:

Range = xmax - xmin

The range can be computed quickly, but it is not very useful since is considers only he extremes and does not take into consideration the bulk of the observations.

The variance and standard deviation from ungrouped data

One way of measuring the spread of the data is to determine the extent to which each

observation deviates from the arithmetic mean. Clearly, the larger the deviations, the

greater the variability of the observations. However, we cannot use the mean of these

deviations as a measure of spread because the positive differences exactly cancel out the

negative differences. We overcome this problem by squaring each deviation, and finding

the mean of these squared deviations; we call this the variance.

The Variance is less when all values are close to the mean while it is more when the

values are spread out from the mean. The sample variance, or s2, is computed by either formula

n

2

(xi - x)

s 2 = i=1 n -1 OR

s2

=

n

xi 2

i =1

-

n i =1

xi

2

n

n -1

where:

s2 =sample variance

x = sample mean

31

BIOL 4243

IUG

n = total number of observations in the sample We can see that this is not quite the same as the arithmetic mean of the squared deviations because we have divided by n - 1 instead of n. The reason for this is that we almost always rely on sample data in our investigations. It can be shown theoretically that we obtain a better sample estimate of the population variance if we divide by n - 1. The units of the variance are the square of the units of the original observations, e.g. if the variable is weight measured in kg, the units of the variance are kg2.

The standard deviation is represented by the symbol s. It is the square root of the variance

i.e. s = s 2 It expresses exactly the same information as the variance, but re-scaled to be in the same units as the original measurement (raw data) and the mean.

The population variance of the population of the observations x is defined the formula

N (xi - ? ) 2

2 = i=1 N

where: 2 (sigma squared) = population variance

xi = the item or observation ? = population mean

N = total number of observations in the population.

The standard deviation of a population is equal to the square root of the variance

N (xi - ? ) 2

= 2 = i=1 N

Since most populations are large, the computation of 2 is rarely performed. In practice, the population variance (or standard deviation) is usually estimated by taking a sample from the population and using s2 as a estimate of 2.

Example A pediatric registrar in a district general hospital is investigating the amount of lead in the urine of children from a nearby housing estate. In a particular street there are 15 children whose ages range from 1 year to under 16, and in a preliminary study the registrar has found the amounts given in Table 1 of urinary lead (?mol/24hr),

Table 1 Urinary concentration of lead in 15 children from housing estate (?mol/24hr)

0.6, 2.6, 0.1, 1.1, 0.4, 2.0, 0.8, 1.3, 1.2, 1.5, 3.2, 1.7, 1.9, 1.9, 2.2

What is the variance and standard deviation?

32

BIOL 4243

IUG

Solution The calculation of the variance is illustrated in next Table with the 15 readings in the preliminary study of urinary lead concentrations. The readings are set out in column (1). In column (2) the difference between each reading and the mean is recorded. In column (3) the differences are squared, and the sum of those squares is given at the bottom of the column.

Calculation of standard deviation

(1) Lead concentration

(2) Differences from

mean

(3) Differences

squared

(4) Observations in col. (1)

squared

x 0.1 0.4 0.6 0.8 1.1 1.2 1.3 1.5 1.7 1.9 1.9 2.0 2.2 2.6 3.2

Total

xi =22.5

-1.4 -1.1 -0.9 -0.7 -0.4 -0.3 -0.2

0 0.2 0.4 0.4 0.5 0.7 1.1 1.7

(xi - x)= 0

1.96 1.21 0.81 0.49 0.16 0.09 0.04

0 0.04 0.16 0.16 0.25 0.49 1.21 2.89

2

(xi - x ) =9.96

0.01 0.16 0.36 0.64 1.21 1.44 1.69 2.25 2.89 3.61 3.61 4.00 4.84 6.76 10.24

xi 2 = 43.71

n = 15, x = l.5

The sum of the squares of the differences (or deviations) from the mean, 9.96, is now

divided by the total number of observation minus one, to give the variance. Thus,

n

2

(xi - x)

s 2 = i=1 n -1

33

BIOL 4243

IUG

In this case we find:

s 2 = 9.96 = 0.7114 (?mol/24hr) 14

Finally, the square root of the variance provides the standard deviation:

s = s 2 = 0.7114 =0.843 ?mol/(24hr)

Means and variance from grouped data ( )

More often than not, data are presented in grouped form. That is, the data are in part summarized and grouped in a frequency table. In actuality, this simplifies the handling of data -it's easier to work with the value of 28 serum cholesterol readings of 130.5 than to list the value 130.5 separately 28 times. Means and standard deviations may be computed from grouped data, but the equations are a bit different.

Formulas for calculating the mean and the variance for grouped data:

k

f i x i

x = i =1

,

n

where x = mean of the data set,

s2

=

k i =1

fi xi 2

k - i=1

f

i

xi

2

n

n -1

s2 = variance of the data set

xi = midpoint of the ith class,

fi = frequency of the ith class,

n = total number of observations.

Example Given below are the frequency distributions for the heights (in centimeters) of a sample of 100 students in the Islamic University, find the approximate value for the standard deviation for students.

Table 4.3 Frequency of heights of a sample of 100 students in the Islamic University

Class interval

xi

xi2

fi

fixi

fixi2

150-154

152

23,104

9

1,368

207,936

155-159

157

24,649

22

3,454

542,278

160-164

162

26,244

31

5,022

813,564

165-169

167

27,889

24

4,008

669,336

170-174

172

29,584

13

2,236

384,592

175-179

177

31,329

1

177

31,329

Total

n = 100 16,265

2,649,035

For the grouped data, we obtain

k

f i x i

x = i =1

= 16,265 = 162.65cm

n

100

34

BIOL 4243

IUG

s=

k i=1

fi xi 2

k - i=1

fi

xi

2

n

n -1

= 2,649,035 - 2,645,502.25 = 35.68 =5.97 cm 99

Note that there is some difference between results from computations ungrouped and

grouped data. The size of the discrepancy depends on width of the class interval and on

the number of observations within an interval. With short class intervals and large

samples, the discrepancy is negligible.

Coefficient of Variation

A disadvantage of the standard deviation as a comparative measure of variation is that it can not be used to compare variability in two different kinds of variables. For this reason, statisticians have defined the coefficient of variation which helps in comparing the relative variability among different variables. Also, while the standard deviation depends on the units of measurements, the coefficient of variation cv is unitless or dimensionless, since both standard deviation and the mean are expressed in same units. Therefore, it is useful in comparing scatter of variables measured in different units. It also possible to use the coefficient of variation to compare the relative variation of even unrelated quantities or that are vastly different in scale or magnitude of units; elephant weight versus mouse weight.

cv = Coefficient of variation = Standard deviation ? 100 % x

For example, we can say that, the standard deviation (sd) of 5 mmHg in case of systolic blood pressure (BP) readings is small but a standard deviation of even 3 g/dl in case of hemoglobin (Hb) level is large. This is because the standard deviation has to be assessed in relation to their mean. If the mean systolic BP level of the subject under study is 100 mmHg then a sd = 5 mmHg is only 5% of mean. If the mean Hb level is 10 g/dl then a sd = 3 g/dl is 30% of mean. This sd is surely higher.

The next Table contains CV of various hematological parameters of children of megaloblastic anemia in the age group of 3? months to 12 years. Note that the variability in mean corpuscular volume (stated in femtoliters = 10-15 liters) is much lower than in TLC (total leucocyte count). It is relatively more consistent between patients. Such comparison cannot be done on the basis of sd.

Hematological Data on 29 Children of Megaloblastic Anemia Between 3? Months and 12 Years of Age

Variable Hb (g/dl) TLC (cu mm) Platelet count (109/l) Mean corpuscular volume (fl)

Mean (sd) 5.32 (1.88) 8.39 (5.65) 110.83 (56.8) 110.3 (10.6)

Range 1.7 - 9.6 2.7 - 27.6 31 - 238 82 - 126

CV (%) 35.3 67.3 51.2 9.6

Source: Gomber et al. Note: We have computed the CV column. This was not given by the authors.

35

BIOL 4243

IUG

Measures of Positions

In cases where our data distribution are heavily skewed, we often get a better summary of the distribution by utilizing relative position of data rather than exact value. Measures of position are used to describe the location of a particular observation in relation to the rest of the data set. The median is an example of measure computed by using relative position of the data. If we are told that 81 is the median score on biostatistics test, we know that after the data have been ordered, 50% of the data fall below the median value of 81.

Percentiles, Deciles, and Quartiles

Percentiles

Percentiles are values of x that divide the ordered set into 100 equally sized groups. The pth percentile of a data set is a value such that p percent of the observations in the ordered set lying below it and (100 - p) percent of the observations lying above this value. The first percentile for example is the value of x that has 1% of the observations in the ordered set lying below it (and 99% of the observations lying above it). The value of x that has 4% of the observations lying below it is called the 4th percentile, and so on.

The first, second ...... 99th percentiles are expressed as P1, P2, ...... P99 respectively.

1% 1% 1% 1% 1% Lowest 1st 2nd 3rd 4th 5th

1% 1% 1% 98th 99th Highest

Deciles

The values of x that divide the ordered set into 10 equally sized groups are called deciles. The first decile is the value of x that has 10% of the observations in the ordered set lying below it (and 90% of the observations lying above it). The value of x that has 40% of the observations lying below it is called the 4th decile, and so on. The first, second ...... 9th deciles are expressed as D1, D2, ...., D9 respectively.

10% 10% 10% 10% 10% 10% 10% 10% 10% 10% Lowest 1st 2nd 3rd 4th 5th 6th 7th 8th 9th Highest

Quartiles

The values of x that divide the ordered set into four equally sized groups. The first quartile is the value of x that has 25% of the observations in the ordered set lying below it (and 75% of the observations lying above it). The value of x that has 50% of the observations lying below it is called the 2nd quartile, and the value of x that has 75% of the observations in the ordered set lying below it is called the 3rd quartile.

The first, second third quartiles are expressed as Q1, Q2, and Q3 .

36

BIOL 4243

IUG

25% 25% 25% 25%

Lowest

Q1 Q2 Q3 Median

50th percentile

Highest

Important Note: All the quartiles and deciles are percentiles. For example, the 7th decile is the 70th percentile and the 1st quartile is the 25th percentile. The 50th percentile is the 5th decile, the second quartile and the median of the ordered set. Consequently, deciles

and quartiles are often stated as percentiles.

Procedure to compute the percentile of x observation

The techniques for finding the various measures of position will be illustrated by using the data in the next Table which contains the aortic diameters measured in centimeters for 45 patients. Notice that the data in the Table are already ordered. Raw data need to be ordered prior to finding measures of position.

3.0

5.0

6.2

7.6

9.4

3.3

5.2

6.3

7.6

9.5

3.5

5.5

6.4

7.7

9.5

3.5

5.5

6.6

7.8

10.0

3.6

5.5

6.6

7.8

10.5

4.0

5.8

6.8

8.5

10.8

4.0

5.8

6.8

8.5

10.9

4.2

5.9

6.8

8.8

11.0

4.6

6.0

7.0

8.8

11.0

The percentile for observation x is found by dividing the number of observations less than x by the total number of observations and then multiplying this quantity by 100.

This percent is then rounded to the nearest whole number.

Example Find the percentile for the 5.5 observation. The number of observations in the Table less than 5.5 is 11.

11 .100 = 24.4% 45

This percent rounds to 24. The diameter 5.5 is the 24th percentile and we express this as P24 = 5.5.

Example Find the percentile for the 5 observation.

The number of observations less than 5.0 is 9. 9 .100 = 20% i.e. P20 = 5.0. 45

Example Find the percentile for the 10.0 observation.

The number of observations less than 10.0 is 39. 39 .100 = 86.7% 87 i.e. P87 = 10.0 45

37

BIOL 4243

IUG

Procedure to compute the pth percentile

The pth percentile for a ranked data set consisting of n observations is found by a twostep procedure.

1. The first step is to compute index i = ( p )(n ) . 100

2. If i is not an integer, the next integer greater than i locates the position of the pth percentile in the ranked data set. If i is an integer, the pth percentile is the average of the observations in positions i and i + 1 in the ranked data set.

Example Find the 10th percentile for the data. i = (10)(45) = 4.5.

100 The next integer greater than 4.5 is 5. The observation in the 5th position in the Table is 3.6. Therefore, the 10th percentile or P10 = 3.6.

Example Find the 40th percentile for the data in the Table. i = (40)(45) = 18.

100 The fortieth percentile is the average of the observations in the 18th and 19th positions in the ranked data set. The observation in the 18th position is 6.0 and the observation in the 19th position is 6.2. Therefore P40 = = (6.0)(6.2) = 6.1.

2

Deciles and quartiles are determined in the same manner as percentiles, since they may be expressed as percentiles.

Percentiles are most commonly used for child growth monitoring purposes i.e.for the monitoring of physical progress (weight and height) of infants and children. Here, the same percentiles, say the 90th, of weight or height of groups of different ages are joined by a curve. If the 95th percentile of weight of 2-year old boys is 14.8 kg then it means that 95% of such children have weight 14.8 kg or less. The other 5% have a higher weight. The growth curve is not linear but rather sigmoidal. The periods of rapid growth occur during the first 12 months of age. Growth normally starts to slow down at about 12 to 15 months of age, which is reflected in the growth chart. The growth chart is an essential tool to diagnose failure to thrive (FTT) or growth failure. Although there are no universal criteria for FTT, most consider the diagnosis if the child's weight is below the 5th percentile or drops more than two major percentile lines. When curves are outside the 5th and 95th percentiles, it is useful to mention the age at which the growth parameter is at its median value (50th percentile). For example, if a 12 month old baby weighs 8 kg, this weight is below the 5th percentile for a one year old; and, it is at the 50th percentile for a 6 month old. One could state that the weight age is 6 months, which is a better quantitative description of the growth abnormality.

38

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download