Chapter 1: Introduction



Chapter 1: Introduction

1.1 What is Statistics?

Statistics involves collecting, analysing, presenting and interpreting data.

We frequently see statistical tools (such as bar charts, tables, plots of data, averages and percentages) on TV, in newspapers and in magazines. Such methods used to organise and summarise data, so as to increase the understanding of the data, are called descriptive statistics.

Statistics is also used in practice in many different walks of life, going beyond simple data summarisation to answer a wide variety of questions such as:

• Medicine: Does a certain new drug prolong life for AIDS sufferers?

• Science: Is global warming really happening?

• Education: Are GCSE and A level examinations standards declining?

• Psychology: Is the national lottery making us a nation of compulsive gamblers?

• Sociology: Is the gap between rich and poor widening in Britain?

• Business: Do Persil adverts really make us want to buy Persil?

• Finance: What will interest rates be in 6 months time?

1.2 Populations and Samples

Suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies. We use this example to illustrate the following definitions.

Definitions:

• Experimental unit: the object on which measurements are made.

For above example, we are measuring birth weights of newborn babies, so a unit is a newborn baby.

• Variable: a measurable characteristic of a unit.

For above example, the variable is birth weight.

• Population: the set of all units about which information is required.

For above example, the population is all newborn babies.

• Sample: a subset of units of the population for which we can observe the variable of interest.

For above example, a sample would be the observed birth weights for a set of newborn babies (which will be a subset of all newborn babies).

• Random sample: a sample such that each unit in the population has the same chance of being chosen independently of whether or not any other unit is chosen.

To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a random sample of weights of new-born babies whose mothers smoke, with a random sample of weights of new-born babies of non-smoking mothers. By analysing the sample data, we would hope to be able to draw conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population). The process of using a random sample to draw conclusions about a population is called statistical inference.

If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, birth weights of twins are generally lower than the weights of babies born alone. So if all the non-smoking mothers in the sample were giving birth to twins, whereas all the smoking mothers were giving birth to single babies, then the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are affected by sampling bias.

Different units of the same population will have different values of the same variable ( this is called natural variation. For example, obviously the weights of all newborn babies are not the same. So different samples will contain different data- called sampling variability. Therefore it is important to bear in mind that slightly different conclusions could be reached from different samples.

1.3 Types of Data

Different types of data require different types of analysis. The type of data set is determined by several factors:

• Type of variable:

➢ quantitative data - i.e. numerical (e.g., heights of students, number of phone calls in an hour).

➢ qualitative data - i.e. non-numerical (for example, eye colour, M/F).

Quantitative data can be subdivided further:

➢ discrete – a discrete variable can take only particular values (e.g., number of phone calls received at an exchange).

➢ continuous- a continuous variable can take any value in a given range (e.g., heights of students).

• Number of variables measured:

➢ 1 variable ( univariate data.

➢ 2 variables ( bivariate data. E.g., we may have both the heights and weights of a set of individuals. The data set then consists of pairs of observations on each unit such as (1.7m, 65kg).

➢ 3 or more variables ( multivariate data. E.g., we have heights, weights, eye colour, gender for a group of individuals. In this case the data set consists of sets of 4 observations made on each unit such as (1.7m, 65kg, blue, M).

• Number of samples: For example, when investigating the effects of smoking during pregnancy, we would observe two samples:

➢ a sample of birth weights of babies born to smoking mothers

➢ a sample of birth weights of babies born to non-smoking mothers.

• Relationship of samples (if more than 1 sample):

➢ Are the samples independent? E.g., the two birth weight samples should be independent.

➢ Are the samples dependent?

❖ Example:

Suppose that a doctor would like to assess the effectiveness of changing to a low-fat diet in lowering cholesterol for a group of patients. To do this the doctor might measure the cholesterol of the patients before starting on the low-fat diet and then measure the cholesterol for the same patients after they have been on the low-fat diet. We therefore have 2 samples of measured cholesterol:

• a sample before the diet

• a sample after the diet.

However, the 2 samples are not independent, since the cholesterol measurements for each sample were taken on the same patients. Samples of this type are called matched pair data.

1.4 Recommended Books

You will need to use statistical tables for the course. The tables used in the exams are:

• Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical Tables, C.U.P., 1984.

Statistical tables will be used throughout this course.

There are many books which cover the material in this course. Some good books are:

• Introduction to probability and statistics for engineers and scientists; [with CD-ROM] / Sheldon M. Ross

• Probability and Statistics for Engineers and Scientists - 7th edition, R.E.Walpole, R.H.Myers, S.L.Myres and K. Ye, Prentice Hall, 2002

• Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold, 4th edition, 1999.

• Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements of Statistics, Open University, 1995.

Goes beyond what's required for this course, but is quite clearly written with some real examples.

• Devore, J and Peck, R. Introductory Statistics, West, 1990.

Rather simplistic at times, but has lots of real examples. Especially good if you have not done any statistics before.

• Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988.

In addition, you could browse in the library around QA276 and find a book which suits you. For starters you could try looking at some of the following.

• Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to Statistics: Concepts and Applications, West, 2nd edition, 1991.

• Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T. and North, P.M., Statistics: Problems and Solutions, Edward Arnold, 1986.

• Moore, D.S., The Basic Practice of Statistics, Freeman, 1995.

• Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986.

• Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985, 1979.

There are many online books which could be useful. See for example



Chapter 2: Graphical and Numerical Statistics

2.1 Histograms

Histograms give a visual representation of continuous data. We consider two separate cases corresponding to when (i) all the bars in the histogram have the same width; (ii) the intervals are of variable widths.

1. Histograms with equal class widths

❖ Example:

Mercury contamination can be particularly high in certain types of fish. The mercury content (ppm) on the hair of 40 fishermen in a region thought to be particularly vulnerable are given below (From paper “Mercury content of commercially imported fish of the Seychelles, and hair mercury levels of a selected part of the population.” Environ. Research, (1983), 305-312.)

|13.26 |32.43 |18.10 |58.23 |64.00 |68.20 |35.35 |33.92 |23.94 |18.28 |

|22.05 |39.14 |31.43 |18.51 |21.03 | 5.50 | 6.96 | 5.19 |28.66 |26.29 |

|13.89 |25.87 | 9.84 |26.88 |16.81 |38.65 |19.23 |21.82 |31.58 |30.13 |

|42.42 |16.51 |21.16 |32.97 | 9.84 |10.64 |29.56 |40.69 |12.86 |13.80 |

❖ The first step is to group the data. A reasonable choice of class intervals is:

0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.

The frequency table that results from the use of these intervals is:

|Interval |Frequency |

|0-10 |5 |

|10-20 |11 |

|20-30 |10 |

|30-40 |9 |

|40-50 |2 |

|50-60 |1 |

|60-70 |2 |

To construct the histogram in this situation (i.e. all class widths equal):

• Mark boundaries of the class intervals on the horizontal axis.

• The height of the bars above each interval can be taken as the frequency for that interval.

Instead of using frequencies to give the heights of the rectangles in a histogram, relative frequencies may be used. The relative frequency for an interval is that interval's frequency divided by the total frequency.

❖ So for the mercury example…

|Interval |Frequency |Relative frequency |

|0-10 |5 |.125 |

|10-20 |11 |.275 |

|20-30 |10 |.250 |

|30-40 |9 |.225 |

|40-50 |2 |.050 |

|50-60 |1 |.025 |

|60-70 |2 |.050 |

|Total |40 |1 |

The relative frequencies can be expressed as percentages (which is how Minitab produces a relative frequency histogram):

Notice that the shape of the histograms, whether using frequencies or relative frequencies, is the same.

2.1.2 Histograms with unequal class widths

There is no hard and fast rule as to how many intervals should be used. Too many classes produce an uneven distribution, but having too few loses information. Usually the number of classes is about 6-20. The more observations we have, the more classes we will usually use.

The width of the intervals defining the histograms need not all be equal. It is often sensible to choose short intervals where the data is quite dense but intervals with a longer width where the data is more sparse. This will ensure that we don’t have too many intervals with zero frequency, yet keeps as much information about the distributional shape of the data as possible.

When unequal interval widths are used, then the frequency density should be used on the vertical scale on the histogram, where

Frequency density = Frequency ( class width.

❖ Example:

The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the following table:

|Vehicle length (m) |Class width |Frequency |Frequency density |

|3.0-4.0 |1 |90 |90 |

|4.0-4.5 |0.5 |80 |160 |

|4.5-5.0 |0.5 |40 |80 |

|5.0-5.5 |0.5 |24 |48 |

|5.5-7.5 |2 |16 |8 |

Notice that if we had simply defined the heights of the rectangles to be the frequencies, then the histogram would exaggerate, for example, the incidence of cars between 3 and 4 metres in length.

An alternative way of producing a histogram in situations were not all class widths are equal is to set the bar height to be the relative frequency density. This is given by:

Relative freq. density = Relative freq. ( class width.

If the histogram is produced in this way, then the total area of all the bars is 1.

❖ Example (continued)

The relative frequency densities for the car vehicle length data are as follows:

|Vehicle length (m) |Class width |Frequency |Relative freq. |Rel. freq. density |

|3.0-4.0 |1 |90 |0.36 |0.36 |

|4.0-4.5 |0.5 |80 |0.32 |0.64 |

|4.5-5.0 |0.5 |40 |0.16 |0.32 |

|5.0-5.5 |0.5 |24 |0.096 |0.192 |

|5.5-7.5 |2 |16 |0.064 |0.032 |

The corresponding histogram can then be produced:

2.1.3 Histogram shapes

Histograms are very useful for giving some idea of the shape of a density by approximating the histogram to a smooth curve.

Densities can take many different shapes:

Unimodal Bimodal Multimodal

[pic] [pic] [pic]

Symmetric Positive skew Negative skew

[pic] [pic] [pic]

Normal Heavy-tailed Light-tailed

[pic] [pic] [pic]

4. Histograms for discrete data

Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are generally used for continuous data. However, when the number of possible values for the observations is large, a bar diagram would become uninformative. In this case it is acceptable to group the values into class intervals, much as you would for continuous data.

❖ Example:

Suppose we have the following data:

|1 | 1 |

|0.5 - 3.5 |7 |

|3.5 - 5.5 |8 |

|5.5 - 9.5 |8 |

|9.5 - 12.5 |13 |

|12.5 - 15.5 |14 |

|15.5 - 18.5 |8 |

|18.5 - 21.5 |5 |

|21.5 - 24.5 |5 |

|24.5 - 27.5 |2 |

|27.5 - 30.5 |1 |

The histogram can now be drawn in the normal way.

2.2 Stem-and-leaf plots

Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with very little effort. The idea of the plots is to separate each observation into 2 parts - the first part being the stem and the second the leaf.

To construct a stem-and-leaf plot:

• Select one or more leading digits for the stem values. The following digit or digits become the leaves.

• List possible stem values in a vertical column.

• Record the leaf value for every observation beside the corresponding stem value.

• Indicate the units for stems and leaves.

❖ Example:

To investigate the efficiency of new air-conditioning equipment installed on Boeing 720 aircraft, the times (in hours) to first failure of the equipment were obtained from 28 different aircraft:

|79 | 90 | 10 | 60 | 61 | 49 | 14 | 24 |

|57.8 | 70.6 | 70.5 | 68.9 | 62.6 | 69.7 | 74.6 | |

A.E.'s of voles:

|51.7 | 66.7 | 72.0 | 69.8 | 63.7 | 77.2 | 62.6 | 63.5 | 69.2 | 67.5 |

|70.1 | 67.3 | 75.2 | 73.8 | 59.6 | 69.9 | 77.6 | 74.1 | 73.7 | |

Rounding observations to the nearest integer gives us:

| |A.E.s for field mice | |A.E.s for voles | | |

| | | | | | |

| | | | | | |5L |

| 0 |4 |0 |3 |8 |5 |7 |

| 1 |1 |0 |2 |8 |4 | |

| 2 |3 | | | | | |

A slightly more informative diagram can be obtained by splitting each stem up into two parts (one for the lower leaves and the other for higher leaves):

|-0H |8 |5 | | |

|-0L |1 | | | |

| 0L |4 |0 |3 | |

| 0H |8 |5 |7 | |

| 1L |1 |0 |2 |4 |

| 1H |8 | | | |

| 2L |3 | | | |

Each diagram could then be ordered.

4. Problems

Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical for large numbers of observations.

2.3 Cumulative Frequency Plots

A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a class is the number of observations with values less than the upper boundary for that class.

❖ Example:

Consider the mercury example again. The cumulative frequencies are given in the table below:

|Interval |Frequency |Cumulative frequency |

|0-10 |5 |5 |

|10-20 |11 |16 |

|20-30 |10 |26 |

|30-40 |9 |35 |

|40-50 |2 |37 |

|50-60 |1 |38 |

|60-70 |2 |40 |

In a cumulative frequency polygon the cumulative frequencies are plotted against the upper class boundaries of the classes. These points are then joined with a straight line.

❖ Example (continued)

For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then join these points:

A cumulative frequency plot is useful for giving us some idea of the shape of the distribution function of the variable. They can also be used to obtain estimates of the median and other quantiles for grouped data.

2.4 Scatter Plots.

Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot we represent one of the variables by the horizontal axis and the other variable by the vertical axis. We then simply plot the pairs of data points on the graph.

❖ Example:

Fifteen children were given a visual-discrimination (V) test during the first week at primary school and a reading-achievement (R) test at the end of their first year of schooling. Scores out of 100 were calculated for each test.

|Child no. | 1 |

|0-10 |5 |

|10-20 |11 |

|20-30 |10 |

|30-40 |9 |

|40-50 |2 |

|50-60 |1 |

|60-70 |2 |

Here we have 7 classes, so that K = 7. Then [pic]and so on, such that [pic].

2.5.2 Measures of location

▪ The Sample Mean

Let [pic] denote the random variables for a sample of size n. The sample mean, denoted [pic], is defined by:

[pic]

The observed value of the sample mean for a particular sample is therefore:

[pic]

When the data are grouped by means of a frequency table, then the equivalent formula for [pic] is given by:

[pic]

where K is the number of classes or groups, and [pic] is the mid-point of class k.

❖ Example:

Consider the mercury example again.

|Interval |Mid-point, [pic] |Frequency, [pic] |

|0-10 |5 |5 |

|10-20 |15 |11 |

|20-30 |25 |10 |

|30-40 |35 |9 |

|40-50 |45 |2 |

|50-60 |55 |1 |

|60-70 |65 |2 |

The sample mean is therefore:

[pic]

Note: The mean is probably the most useful measure of location. Its advantages are that it uses all the values in the data and is easy to manipulate mathematically. A disadvantage is that it is not robust- this means that its value can be sensitive to the presence of outlying values. More robust measures of location (such as the median or trimmed mean) are increasing in popularity amongst statisticians.

▪ The Median

To find the median of a set of n data values, we must first rearrange them in order of size. The median is then equal to the middle observation if n is odd, and the average of the middle two observations is n is even.

More formally,

[pic]

❖ Example 1:

The values below are systolic blood pressures of patients admitted to a hospital:

112.1 138.6 115.9 109.5 108.2 110.9 159.6 115.8 122.3 122.4 123.8 117.5.

To find the median value for the blood pressure, we must first list them in ascending order:

108.2 109.5 110.9 112.1 115.8 115.9 117.5 122.3 122.4 123.8 138.6 159.6.

Here we have an even number of observations. So

Sample median = [pic]

For these data the sample mean is:

Sample mean = [pic]

which is somewhat larger than the sample median. The mean is influenced by the outlying value (159.6). The median is more robust than the mean and is not really affected by outliers.

❖ Example 2:

A football team has scored the following number of goals in the last 44 matches:

|Number of goals |0 |1 |2 |3 |4 |

|Frequency |9 |8 |15 |9 |3 |

As n = 44, the median will lie halfway between the 22nd and 23rd observations. Since both[pic] and [pic] are 2, the median value is 2.

For grouped data, the most convenient way to estimate the median is by graphical methods. This is most easily demonstrated via an example.

❖ Example

Consider the mercury example once again. The cumulative frequency plot is given below. We have a total of 40 observations, so when the cumulative frequency is 20 we might expect the corresponding value of mercury read off from the graph to be an estimate of the median. In this case we estimate the median as 23 approximately.

Note:

The median is also often a better measure of location than the mean when data are highly skewed. The following show the relative positions of the mean and median for 3 densities:

[pic]

❖ Example:

Distributions of incomes are commonly positively skewed as there are typically a few very large salaries which gives the density a long right-hand tail. Therefore the median is often used to give a typical salary value, rather than the mean.

Disadvantages for the median:

There are two main disadvantages of using the median. It ignores the actual values of the data and uses only their ranks (it effectively uses only the “middle” part of the data set). It is also not as easy to use mathematically in the theory of statistics as the arithmetic mean.

▪ The Trimmed Mean

The trimmed mean can be viewed as some sort of compromise between the mean and the median. To calculate a trimmed mean:

• order the data values

• delete a selected number of values from each end of the ordered list

• average the remaining values.

The trimmed mean avoids the disadvantages of the mean by excluding extreme observations and avoids that of the median by taking some account of the observations other than the middle one. To calculate the 5% trimmed mean for example, discard the top 5% and the bottom 5% of observations, and average those remaining.

❖ Example:

The body temperatures (deg. F) of 10 patients hospitalised with meningitis are as follows:

|104.0 | 104.8 | 101.6 | 108.0 | 103.8 |

|100.8 | 104.2 | 100.2 | 102.4 | 101.4 |

The sample mean for these data is: [pic]

To find the 10% trimmed mean, as we have 10 observations, we drop the smallest and largest data values.

10% trimmed mean = [pic]

In this case the 10% trimmed mean is probably a better representation of the centre of the distribution as it ignores the (possible) outlier, 108.

▪ The Mode

The mode is a very simple measure of location. For discrete data, it is the value of x with the largest frequency. We cannot calculate a mode for ungrouped continuous data. For data grouped into classes we obtain a modal class.

❖ Example:

Consider again the family size data presented in the previous section. The numbers of children in the sampled families are:

2, 6, 3, 2, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4, 1.

Here the most commonly occurring value is 2 and so this is the mode.

▪ Quantiles

The median divides the data into two equal parts. In a similar way, quartiles divide the data into four equal parts, deciles divide the data into 10 equal parts and percentiles divide it into 100 equal parts.

The upper and lower quartlies can be found in the following way:

sample lower quartile = median of lower half of data

sample upper quartile = median of upper half of data

If n is odd, then the median of the entire sample is included in both halves.

Note that deciles and percentiles only tend to be used on very large data sets.

❖ Example:

The salinity values for 28 water specimens are as follows:

|7.6 |7.7 | 4.3 | 5.9 | 5.0 | 10.5 | 7.7 | 9.5 | 12.0 | 12.6 |

|6.5 | 8.3 | 8.2 | 13.2 | 12.6 | 13.6 | 14.1 | 13.5 | 11.5 | 12.0 |

|10.4 | 10.8 | 13.1 | 12.3 | 10.4 | 13.0 | 14.1 | 15.1 | | |

To find the quartiles we first need to order the data:

|4.3 | 5.0 | 5.9 | 6.5 | 7.6 | 7.7 | 7.7 | 8.2 | 8.3 | 9.5 |

|10.4 | 10.4 | 10.5 | 10.8 | 11.5 | 12.0 | 12.0 | 12.3 | 12.6 | 12.6 |

|13.0 | 13.1 | 13.2 | 13.5 | 13.6 | 14.1 | 14.1 | 15.1 | | |

We have 28 observations and so

[pic]

To find the lower and upper quartiles we need to find the median of the lower 14 and upper 14 observations respectively:

[pic]

[pic]

❖ Exercise:

Find the median, together with the lower and upper quartiles for the following examination marks:

68, 72, 31, 60, 90, 96, 45, 57, 54, 45, 16, 22, 82, 63, 52.

Just as with finding the median, we can estimate quantiles graphically.

❖ Example:

Consider again the cumulative frequency polygon for the mercury data. As the total number of observations is 40, we can estimate the lower and upper quartiles by reading off the mercury values from the graph for a cumulative frequency of 10 and 30, respectively.

We see UQ = 34 and LQ = 14 (approximately).

2.5.3 Measures of dispersion

Obviously specifying the central value of a set of data does not tell the whole story. We also need to consider the variability (or spread or dispersion) of the data.

▪ The Range

The simplest measure of dispersion is the range which is simply the difference between the largest and smallest values in the data set. If we have grouped data then we cannot calculate an exact range, only an upper limit.

❖ Example:

For the water salinity data, the largest observation is 15.1 and the smallest is 7.6. Therefore,

range = 15.1 - 7.6 = 7.5.

Note: The range is sensitive to the presence of one or two extremely large or small values in the data.

▪ Inter-quartile range

This is a more useful measure of dispersion than the range. It is simply the difference between the upper and lower quartiles. The inter-quartile range contains the middle half of the data set.

❖ Example:

We calculated the upper and lower quartiles for the water salinity data to be 13.05 and 7.95 respectively. Therefore,

Inter-quartile range = 13.05 - 7.95 = 5.1.

▪ The Mean Deviation

The deviations in a sample are the differences,

[pic]

One possible idea for obtaining a summary measure of the dispersion in the sample would be to calculate the mean of these deviations. However, the mean of these deviations is always zero. [Think about why this should be.]

Instead we could take the absolute value of each of the deviations and calculate the mean of these. This gives the mean (absolute) deviation:

[pic]

For grouped data the equivalent formula is:

[pic]

where, [pic] is the midpoint of the kth class.

❖ Example

Twelve students record their weight in kg, creating the following sample:

50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.

The mean of these 12 observations is: [pic]

The deviations of each value from the mean are:

-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.

So the mean deviation is:

Mean deviation = [pic].

▪ The Sample Variance and Sample Standard Deviation

Instead of taking the absolute values of the deviations (so that the positive and negative deviations don't just cancel each other out), we could use the squares of the deviations. The sample variance (usually denoted by [pic]) can be thought of as an ‘average’ of the squared deviations.

The sample variance is defined by:

[pic]

Note that although we are summing n squared deviations, we divide through by n – 1. This is important! The reason why we use n - 1 and not n in the definition of the sample variance will become apparent later on in the course when we look at unbiased estimators.

The disadvantage of using the sample variance is that it is not measured in the units of measurement used for the data, but in squared units. This problem is overcome by using the standard deviation. The sample standard deviation is simply the square root of the sample variance, ie:

[pic]

Note: For grouped data, we use the following definition for a sample s.d.:

[pic]

❖ Example

Consider again the weights of the 12 students given above. The deviations from the mean were:

-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.

So the sample variance is:

[pic]

This means that the sample standard deviation is s = (109.1136 = 10.446 kg.

Result:

Using the above formula to calculate the sample variance can be complicated. In general it is better to use the expression:

[pic]

To calculate the variance using this expression we need to know the sum of the observations and the sum of the squares.

Proof:

We need to show that both formulae for the sample variance are equivalent. It suffices to show:

[pic]

Now,

[pic]

But, [pic] so

[pic]

as required.

Note:

There is an equivalent expression for grouped data, so that:

[pic]

❖ Example 1:

Consider again the student height data:

50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.

We can check that the new formula for calculating the variance does in fact give us the same result:

[pic]

So,

[pic]

as before.

❖ Example 2:

For an example of grouped data, consider the mercury data again:

|Interval |Mid-point, [pic] |Frequency |

|0 - 10 |5 |5 |

|10 - 20 |15 |11 |

|20 - 30 |25 |10 |

|30 - 40 |35 |9 |

|40 - 50 |45 |2 |

|50 - 60 |55 |1 |

|60 - 70 |65 |2 |

Here we have,

[pic]

So,

[pic]

The sample standard deviation is therefore (227.6282 = 15.09.

❖ Exercise:

A sample of 50 adults were asked how many lottery tickets they purchased last week:

|Number of lottery tickets |0 |1 |2 |3 |4 |5 |

|Frequency |19 |11 |10 |3 |4 |3 |

Find the sample standard deviation.

Note:

Find out how to use your calculator’s statistical mode to calculate s.d.s.

5. Box-and-whisker plots

Box-and-whisker plots aim to highlight a few important features of a data set. They are based on the following location summaries: minimum, lower quartile, median, upper quartile and maximum. These 5 quantities are sometimes referred to as the five-number summary.

❖ Simple Example:

The number of runs scored by a batsman on 14 occasions are as follows:

40, 22, 17, 50, 24, 48, 5, 0, 28, 19, 30, 25, 16, 37.

Ordering these values we get:

0, 5, 16, 17, 19, 22, 24, 25, 28, 30, 37, 40, 48, 50.

The five-number summary then is:

Minimum value = 0 Maximum value = 50

Median, Q2 = 24.5 Lower quartile, Q1 = 17 Upper quartile, Q1 = 37

The box-and-whisker plot then looks like:

In the above diagram, the box indicates the interquartile range. The whiskers go from the lower and upper quartiles to the smallest and largest observations respectively. The median is represented by a line within the box.

Note: the position of the median within the box gives an indication of whether the data are skewed:

• Symmetry: [pic];

• positive skew: [pic];

• negative skew: [pic].

Box-and-whisker plots are especially useful for comparing two different data sets as they give a simple picture of the locations and spreads of different distributions.

❖ Example:

The numbers of hysterectomies performed by 15 male doctors and 10 female doctors are given below:

|Male | 20 | 25 | 25 | 27 | 28 |

|doctors | | | | | |

| | | | | |5L |2 | | | | | | | | | | | | |8 |5H | | | | | | | | | |4 |3 |3 |2 |1 |6L |0 |3 |4 |4 | | | | | | | |9 |8 |5 |6H |7 |7 |8 |9 | | | | | | |2 |1 |1 |0 |7L |0 |0 |0 |2 |4 |4 |4 | | | | | |6 |5 |7H |5 |7 |8 | | | | | |

Scale: Stem = 10’s Leaves = 1’s

Draw box-and-whisker plots for the field mice and voles and compare the shapes of these.

Note:

Minitab calculates the quartiles slightly differently to the method used in this course. Consequently, slightly different values for the quartiles can arise when using Minitab.

-----------------------

N.B. By convention, any observation that is at a boundary of a class will be put into the higher class. For example, an observation of 10 above would be put into the 10-20 category.

[pic]

[pic]

[pic]

An unordered stem-and-leaf diagram for the Boeing data

Leaves- these should be in columns

Stem

An ordered stem-and-leaf diagram for the Boeing data

Leaves have now been put in order

This could be considered an outlying value

Scale: Stem = 1000's Leaves = 100's

Scale: Stem = 100's Leaves = 10's

In the high category you write any 5s, 6s, 7s, 8s or 9s.

In each low category you put any 0s, 1s, 2s, 3s, or 4s.

Scale: Stem = 100's Leaves = 10's

An unordered back-to-back stem-and-leaf diagram for the protein data

Outlier?

An ordered back-to-back stem-and-leaf diagram showing the protein data

A stem-and-leaf diagram showing the change in typing speeds after a short course

Scale: Stem = 10’s Leaves = units.

A stem-and-leaf diagram showing the change in typing speeds after a short course

Scale: Stem = 10’s Leaves = units.

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

An ordered back-to-back stem-and-leaf diagram showing the protein data

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download