By now you should have had a chance to skim through the ...



Environmental Analysis Analytical Chemistry Lecture 3 October 7, 2002

I have placed one copy of my lecture notes in the Chem Cave. The notes are also a text file in the program Share Directory under Handouts.

Also there is a detailed set of solutions to the firsts week’s homework in the Cave and on the Share. I had prepared a set of solutions that took 2½ pages, but then wrote a detailed set in addition which is 8 pages long. I expect you to try the problems before reading the solutions. On the other hand I don’t want you starring at a problem for hours and hours. The solutions are there so you can get help when you get stuck. Remember to work together, discuss the material and start learning.

Being able to apply concepts to the solution of numerical problems is a major part of this program. When you finish a problem go back over it and ask your self: What concepts did I use? What do the concepts say about tackling this kind of problem? How might I use these concepts to tackle similar problems?

Numerical solutions are just part of the answers. We want you to think about you learning and actively solving problems.

By now you should have had a chance to skim through the text and see how it is laid out.

After these first few weeks we will stop going quite so fast through the material and give you more time to digest the concepts. I hope that much of this is review, but if not, keep in mind that this is a year long program and that you have a whole year of learning ahead of you. You will struggle with some topics and find others easy.

You may find solubility equilibrium difficult now, but there will come a time when you will realize that solubility is now easy. But then, Analysis of Variance doesn’t made sense. Because we are a field application-based program you will apply nearly every concept we cover at some time during the year to your field work - many of them over and over.

XXXXXX

Homework assignment due Wednesday of Week 3 is the following problems from Harris:

3-12,4-11,4-20,4-22,4-3,3-15a,b,g;1-25,3-21

Chapter 3 is concerned with ways to deal with and express experimental errors while Chapter 4 is a discussion of statistics in analytical chemistry.

Today I want to give an introduction to experimental data and the use of statistics in data analysis, followed by doing some of these homework problems in workshop.

Errors in Chemical Analyses

It is impossible to perform a chemical analysis that is totally free of errors, or uncertainties.

All one can hope is to minimize these errors and to estimate their size with acceptable accuracy.

Every measurement is influenced by many uncertainties, which combine to produce a scatter of results. Measurement uncertainties can never be completely eliminated, so the true value for any measured quantity is always unknown.

XXXXXX

As the book points out, we can divide measurement errors into two classes. The first are Systematic errors arise from flaws in your methods or equipment which can be detected and corrected, although not always easily.

On the other hand, Random errors arise from uncontrollable variabilities in the measurement.

Systematic errors tend to decrease as one gains experience performing an assay, so training is important. Another type of systematic error arises from errors in procedures such as making a the wrong calibration standard.

Suppose you are to make a 1.0 mM standard but make instead one that is 0.9 mM. When the instrument reads 1.0 it really should be 0.9. Will your determinations be off on the high side or low side? When you report a sample has a concentration of 2.0 mM, the real value is 1.8 mM.

So you need to practice methods, be extremely careful with procedures and constantly test your results by running standards and calibration samples to eliminate systematic errors as much as possible.

One of the first questions to answer before beginning an analysis is, "What is the maximum error that I can tolerate in the result?" The answer to this question usually determines the time required to do the work. For example, a tenfold increase in accuracy may take hours, days, or even weeks of added labor.

No one can afford to waste time generating data that are more reliable than is needed. On the other hand, data that is known to only 1 or 2 significant figures is often useless.

In the lab you will generally carry three to five aliquots of a sample through an entire analytical procedure so that you can determine the variation between individual measurements.

We then describe a central or best value for the set of data by determining the mean.

XXXXXX

Mean, arithmetic mean, or average is obtained by dividing the sum of the experimental measurements by the number of measurements in the set.

Median is the middle value of a ranked order of the measurements.

XXXXXX

Suppose we are determining the amount of iron(III) in water samples. Six equal portions of an aqueous solution that contained exactly 20.00 ppm of iron(III) are analyzed in exactly the same way.

Note that the results range from a low of 19.4 ppm to a high of 20.3 ppm of iron(III). The average of the data is 19.8 ppm to three significant figures.

Mean is usually written as x(bar). Mean of the iron(III) data is 19.78 and median is 19.7 – determined by the average of the middle two points.

Precision describes the reproducibility of measurements -that is, the closeness of results that have been obtained in exactly the same way.

Accuracy indicates the closeness of the measurement to its true or accepted value and is expressed by the error in the measurement.

XXXXXX

This transparency illustrates the basic difference between accuracy and precision. Accuracy measures agreement between a result and its true value. Precision describes the agreement among several results that have been measured in the same way.

Precision is determined by simply repeating a measurement. On the other hand, accuracy can never be determined exactly because the true value of a quantity can never be known exactly. An accepted value must be used instead.

Accuracy is expressed in terms of either absolute or relative uncertainty.

Equations for both are given on page 51. Most of Chapter 3 deals with significant figures and how to determine the absolute and relative uncertainties of calculated quantities. Be sure you understand how these are calculated.

Let’s look at random errors and their effects on the precision of measurements.

Suppose we have four different random errors that combine to give an overall error of a measurement. We will assume that each error has an equal probability of occurring and that each can cause the final result to be high or low by a fixed amount ± U.

XXXXXX

This transparency shows all the possible ways these four errors can combine to give the indicated deviations from the mean value. Note that only one combination leads to a deviation of + 4 U, four combinations give a deviation of + 2 U, and six give a deviation of 0 U.

XXXXXX

If we plot this deviation from the mean we get the top graph in this transparency. The middle graph is for 10 random errors in an experiment.

We see that the most frequent occurrence is zero deviation from the mean. At the other extreme, a maximum deviation of 10 U occurs only about once in 500 measurements.

The bottom graph is for an experiment with a very large number of individual errors which has a bell-shaped curve that is called a Gaussian curve or a normal error curve.

XXXXXX

We find empirically that the distribution of replicate data from quantitative analytical experiments approaches that of the Gaussian curve. As an example, consider the data in this Table for 50 different calibration of a 10-mL pipet.

The results vary from a low of 9.969 mL to a high of 9.994 mL. This 0.025 mL spread of data results directly from an accumulation of all of the random uncertainties in the experiment.

XXXXXX

The information in the Table is easier to visualize when the data are rearranged into frequency distribution groups. Here, we tabulate the number of data falling into a series of adjacent 0.003-mL bins.

This plot is called a histogram. We can imagine that as the number of measurements increases, the histogram would approach the shape of the continuous curve shown as plot B.

This curve is a Gaussian curve derived for an infinite set of data having the same mean and the same precision.

So we see that variations in replicate results arise from numerous small and individually undetectable random errors. Such small errors ordinarily tend to cancel one another and thus have a minimal effect so that most of the values are close to the mean.

Occasionally, however, they occur in the same direction and produce a large positive or negative net error.

Now let us turn to statistical treatments of random error. Statistical analysis of analytical data is based upon the assumption that random errors in an analysis follow a Gaussian, or normal distribution.

In statistics, a finite number of experimental observations is called a sample. The sample is treated as a tiny fraction of an infinite number of observations that could, in principle, be made given infinite time. Statisticians call the theoretical infinite number of data a population.

Statistical laws have been derived assuming a population of data; often they must be modified substantially when applied to a small sample because a few data may not be representative of the population.

The population mean is given the symbol ( mu, while the sample mean is x(bar).

In the absence of systematic error, the population mean is also the true value for the measured quantity.

More often than not, particularly when N is small, ( differs from x(bar) because a small sample of data does not exactly represent its population. For example, tt is quite possible that if you take only 3 measurements, that all three could be above the actual mean of many measurements.

The probable difference between x(bar) and ( decreases rapidly as the number of measurements making up the sample increases; ordinarily, by the time N reaches 20 to 30, this difference is negligible.

Unfortunately (or fortunately) it is rarely cost effective to repeat the same measurement in analytical chemistry 30 times.

Three terms are widely used to describe the precision of a set of experimental data:

the standard deviation, variance, and coefficient of variation.

XXXXXXX

The population standard deviation ( (sigma), which is a measure of the precision of a

population of data, and is given by

_____________

( = ( (( xi - ()2

N

XXXXXXX

Here are two distribution curves for two populations of data that differ only in their standard deviations. The standard deviation for the data set yielding the broader but lower curve B is twice that for the measurements yielding curve A. So random errors in measurement B are greater than in A.

One of the goals of analytical chemistry is to find methods that have fewer random errors. Thus methods with a low standard deviation are favored over ones with large standard deviations.

XXXXX

The bottom graph gives another type of normal error curve in which the abscissa is expressed in units of ( by the variable z, which is defined as

z = (x - () / (

Plotted this way the two curves A and B above are identical here. z is a function of the standard deviation – i.e. of measurement precision.

XXXXXX

This generalized Gaussian or normal error curve has several important properties.

(1) The mean occurs at the central point of maximum frequency.

(2) There is a symmetrical distribution of positive and negative deviations about the maximum.

(3) There is an exponential decrease in frequency as the magnitude of the deviations increases.

(4) It can be shown that, regardless of its width, 68.3% of the area beneath a Gauss curve for a population of data lies within one standard deviation (± l() of the mean (.

Thus, 68.3% of the data making up the population lie within one standard deviation of the mean.

Furthermore, 95.5% of all data are within ± 2( of the mean and 99.7% are within ± 3(.

Thus it is important to realize that although most measurements are written as

Value ± 1 sdev, this range only includes about 2/3 of the measurements in a normal procedure.

Because of area relationships such as these, the standard deviation of a population of data is a useful predictive tool. For example, we can say that the chances are 68.3 in 100 that the random uncertainty of any single measurement is no more than ± 1 (. Similarly, the chances are 95.5 in 100 that the error is less than 2(, and so forth.

Now let's look at sample statistics. Remember samples are finite data sets. The sample standard deviation is given by the equation

________________

s = ( (( xi - x(bar))2

N - 1

The term (N – 1) is called the number of degrees of freedom and is 1 less than the number of data points.

The N - 1 term adjusts the math for the fact that small numbers of determinations may not be spread evenly over the Gaussian shape of all values.

You can see that the standard deviation of a single measurement is meaningless as it involves a divide-by-zero error. This is why measurements are repeated a number of times to get a reasonable N and to get a mean that more likely represents the true value.

Most scientific calculators have the standard deviation function written in. However be sure you use the sample standard deviation s, not the population standard deviation (. On some calculators the difference is denoted by n or n-1 on the sigma character.

Chemists ordinarily employ sample standard deviation to report the precision of their data. Three other terms that are often countered are:

XXXXXX

The variance which is equal to the square of the standard deviation and is written s2.

The relative standard deviation is the standard deviation divided by the mean.

RSD = s/x(bar) often expressed in parts per thousand ppt or RSD = s/x(bar) x 1000 ppt.

If the relative standard deviation is expressed as a percent instead, it is called the coefficient of variation.

CV = s/x(bar) x 100%

These terms help us describe data and the precision of our measurements.

Statistics is also used extensively in analytical chemistry for the purpose of comparing method results with actual values or with other methods.

For example we will want to establish how close the population mean is from a sample mean.

Remember the exact value of the mean ( for a population of data can never be determined exactly because such a determination requires that an infinite number of measurements be made.

Statistical theory, however, allows us to set limits around an experimentally determined mean x(bar) within which the population mean ( lies with a given degree of probability.

This assumes all experimental errors are random, i.e. no systematic errors.

XXXXXX

The confidence interval is an expression stating that the true mean, (, is likely to lie within a certain distance from the measured mean. The ends of this interval are called the confidence limits.

The confidence level is the degree of probability that the odds are that the true mean will be within the defined limits.

The confidence interval is derived from the sample standard deviation. It depends upon how accurately we know s and how closely our sample standard deviation approximates the population standard deviation (.

XXXXXXXX

This transparency shows a series of five normal error curves. In each, the relative frequency is plotted as a function of the quantity z, which is the deviation from the mean in units of the population standard deviation.

z = (x - () / (

The shaded areas in each plot lie between the values of - z and + z that are indicated to the left and right of the curves. The numbers within the shaded areas are the percentage of the total area under the curve that is included within these values of z.

For example, as shown in the top curve, 50% of the area under any Gaussian curve is located between -0.67( and +0.67(. Proceeding on down, we see that 80% of the total area lies between -1.29( and +1.29(, and 90% lies between -1.64(, and +1.64(.

Relationships such as these allow us to define a range of values around a measurement within which the true mean is likely to lie with a certain probability. For example we may assume that 90 times out of 100, the true mean, (, will be within (1.64( of any measurement that we make.

XXXXXX

Here, the confidence level is 90% and the confidence interval is (z( = (1.64( as shown in this table.

XXXXXX

For the mean of N > 30 measurements, the confidence limits are

__

CL for ( = x(bar) ( z(/ (N

XXXXXX

Notice that in order to determine the confidence limits and confidence internal we must first pick a confidence level. The book uses 95% for almost all its confidence levels.

This equation applies only if there is an absence of systematic error and only under the condition that s is a good estimate of (, i.e. only when N is large >20 or 30.

What happens when the sample size is too small?

As indicated earlier, s calculated from a small set of data may be quite uncertain. Thus, confidence limits are necessarily broader when a good estimate of ( is not available.

To account for the variability of s, we use the important statistical parameter t, which is defined in exactly the same way a z, except that s is substituted for (:

t = ( x - ( ) / s

Like z, t depends on the desired confidence level. In addition, however, t also depends on the number of degrees of freedom ( N – 1) in the calculation of s.

XXXXXXXX

Table 4-2 on page 67 provides values for t for a few degrees of freedom. More extensive tables are found in various mathematical and statistical books.

Now the confidence limits for the mean x(bar) of N replicate measurements can be derived from t by the equation

CL for ( = x(bar) ± t s

(N

Remember a confidence interval is an estimate of the experimental uncertainty. It is the probability that the true value lies with the interval surrounding the experimental mean. It also assumes all errors are random, i.e. all systematic errors have been eliminated.

Much of statistics is based upon hypothesis testing which we will go into detail next quarter. A t-test as shown in the book is used to make the following kinds of comparisons:

(1) the mean from an analysis x(bar) with what is believed to be the true value (; [Case 1]

(2) the means x(bar)1 and x(bar)2 and standard deviations from two sets of analyses (usually by different methods); [Case 2 in book]

(3) compare 2 methods on different samples [Case 3 in book]

{next section not included}

We will just talk about Case 1 for now.

Suppose we have a new method for determining iron in drinking water and we want to compare our results with the “gold standard” method. We will assume that the gold standard experimental mean gives the true value, xt which then is the same as ( [Case 1].

We then can rearrange the confidence limit equation and substitute xt for (

CL for ( = x(bar) ± t s == > tcalculated = ( xt - x(bar)((N

(N s

This allows us to calculate a t value that corresponds to the different between the mean for our new method and the “true value”.

To make this test we compare the calculated t with values of t from the t-table. But first, we must pick a confidence level. If we want a good method we might pick 95 or 99% as our confidence level, i.e. we may want our method to be correct at least 99 times out of 100.

Let's suppose N = 8 and tcalculated = 2.8. We look in the table 4-2 on page 67 for 7 degrees of freedom and get 3.5 as the t value at 99% confidence.

Conclusion tcalculated < ttable , therefore the means of the two methods is likely to give the same result at least 99 times out of 100.

If tcalculated > ttable then the two results are different at the 99% confidence level and the two means are not describing the same population, or the difference can not be ascribe to random error.

{End section not included.}

Let me stop here but note that there are other statistical tests for different kinds of hypotheses and comparisons.

The t-test compares means of data sets and populations.

When the value of xt is not known, we use a pooled method as on page 70.

The F-test compares precision of measurements, i.e. is one method more precise than another. This is a comparison of variance or s2 values of the two methods. We will do more with F-tests later.

The Q-test tests for outliers – i.e. can data points that are far from the mean be considered part of the statistical population or might they be dropped as not likely part of the population? You should get to know and use this test on your experimental data.

For the workshop start on the homework problems in order.

Read Chapter 8 for Wednesday.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download