Random processes - Johns Hopkins University



Uncertainty

Many physical processes, like the falling of a ball under the influence of gravity, or the motion of the sun across the sky, or the acceleration of air across the surface of an airplane wing, can be modeled explicitly as a set of equations. Then, if we want to predict, for example, the distance a ball falls in one second under gravity when it starts from rest, we simply plug in the value for gravity g = 9.8 m/s2 and t = 1 s, and use the predictive equation s = ½ g t2 to obtain s = 4.6 m. At this level of sophistication, this is a deterministic process, i.e., we can specify the parameters and initial conditions and, then, calculate the results.

Let’s contrast this with another physical process: that of rolling a die. Now, what is the predictive equation for the outcome? In principle, we could come up with a set of equations but they would horrendously complicated and not at all practical. So, in effect, we have a process whose outcome is unpredictable. When a process produces an outcome that is almost completely unpredictable, we say it is a random process or stochastic process.

The distinction between a deterministic process and a stochastic one is not absolute. In our rolling die example, we could produce a set of equations to model its motion. But the equations would depend critically on a number of parameters, whose specifications might be almost impossible to prescribe. For example, the initial orientation of the die, its initial release velocity and angular momentum, its height above the table, the characteristics of the tables surface would all play critical roles in how the die would come to rest. Even the geometry would be important—whether the edges were sharp or rounded. Not only rounded, but how much rounded. In practice you could never prescribe these parameters with sufficient accuracy to ensure that the equations would predict the outcome. So, in practice, the process is random.

(But sometimes, an apparently-perfect random process is not so perfect. The roulette wheel in gambling casinos is supposed to be perfectly random. In this apparatus a small ball is supposed to fall with equal probability in one of the 38 numbered slots in the wheel. But some years ago, a graduate student, Albert Hibbs, at the University of Chicago and visiting Las Vegas, began recording the slots into which the ball fell on one roulette wheel. After several days, he realized that the wheel favored some slots over others, i.e., it was imperfectly made. So with some strategic wagering he was able to “beat the odds”, and, additionally, made a name for himself.)

How completely we model a physical process will determine how well we can predict behavior. Usually we try to include those parameters that have large and systematic influences. That is, we try to include parameters that affect the process in a significant and expected way. In our first example of a falling ball, our model was the simple relationship s = ½ g t2 . One of the parameters neglected in this model is air resistance. If we were to use this simple equation as a predictor of an actual experiment, we would always overestimate distance, because the neglected ingredient always acts to slow the ball. That is, the element that we neglected had a systematic effect.

At some point we stop trying to include additional parameters in our model because, the model becomes too cumbersome. And we hope that the remaining parameters will have a relatively small influence on prediction. Further, we hope that these unmodeled parameters will partially cancel one another out, i.e., they won’t all act to affect the process in one direction. (Eventually, there is a limit beyond which deterministic modeling is not possible. You’ve probably heard of the Heisenberg uncertainty principle.) Often a process is considered mostly deterministic or mostly random based on practicality. Is it worth it to expand the model to include additional details?

Let’s take the example of water pressure in a municipal water system. If, at some point within the system, you were to measure pressure as a function of time you would discover it fluctuates, sometimes smoothly, sometimes wildly. In principle, one could model the system and make predictions from it, but the system is so complex that it would not be practical to do so in detail. Pump pressure, pipe diameter, and pipe roughness could be relatively easily included in a model. Those parameters are relatively constant. But water pressure also depends on flow rate. So, every time someone turns on a faucet or flushes a toilet, it has an effect. And it would be impossible to monitor all faucets in the system to be able to include their individual impacts on the system. So, what to do?

One answer is to predict “in the average”. That is, make some assumptions about the distribution of faucets turned on and toilets flushed by time of day, and use and a measure or statistic to characterize a “typical” situation, perhaps by time of day or day of week. Then use this statistic to approximate the actual conditions in the model. Predictions resulting from an averaged input will be imperfect, but they could be reasonably useful. (Actually one of the more interesting problems for water-works engineers and one that defies getting usable predictions from “averaged” inputs is the Superbowl problem. In a period of about twenty minutes during Superbowl half-time, 50 million toilets are flushed throughout the U.S.—a once a year occurrence. This wreaks havoc with municipal water systems.)

Uncertainty and randomness enter the engineering world in yet another way: measurement. If fifty people were given meter sticks and you asked them to measure the length of a soccer field, you would get fifty different answers. They might be closely clustered, but they would be different. Why? The soccer field isn’t changing in size. The answer is that errors are introduced in taking the measurement. Maybe the meter sticks are not all exactly one meter long. Maybe the tick marks on the meter sticks were read incorrectly. Maybe the number of meter stick lengths along the field was miscounted. Maybe the meter sticks were not laid in a straight line. There are a lot of “maybes”.

So, with fifty different answers, how long is the soccer field? Really, we can’t tell. But we can estimate its most likely length by taking the average value of all the measurements. So, let’s say that average value is 112 m. How confident are we that the actual length is 112m? If all fifty measurements lie between 111.5m and 112.5m, we’re pretty confident. But, if the fifty measurements lie between 100m and 120m, we would be far less sure. So, the spread or distribution of values makes a difference in the confidence of our estimate. Can we actually quantify measurement confidence? How can we deal with non-deterministic quantities? How can we characterize random processes or distributions of outcomes? These are all questions vital to engineering. And they are addressed in terms of probability and statistics.

DISTRIBUTIONS

With random processes we can never predict a specific outcome—that’s what makes it random. But we might be able to deduce the likelihood that a particular outcome will occur. That can be very helpful. But determining this likelihood requires knowledge of the distribution of possible outcomes. Sometimes we can infer what the distribution is; other times we cannot. In the case of a “perfect” die, we presume that each side of the die is equally likely to land face up. So, 1/6th of the time we would expect to find a “2” face up, for example. And the same would be true for each of the other numbers. Another way of saying it is that the probability of getting a “2” on any one roll of the die would be 1/6 or 0.16667.

Probability is the likelihood that an event will occur, or a particular outcome will occur. Probabilities always lie between 0.0 and 1.0. If the probability of an event is 0.0, that means it will never occur. If the probability of an outcome is 1.0, that means it is certain to occur.

What is the probability that a “1” or a “2” or a “3” or a “4” or a “5” or a “6” will occur in one roll of the die? Since this event encompasses every possible outcome, its probability must be 1.0 (presuming that the die cannot end up on an edge or corner). That is, the sum of probabilities of all possible outcomes is 1.0. This fact allows us to define a probability distribution function f(n), where n is a particular outcome. For a rolled die f(1) = f(2) = f(3) = f(4) = f(5) = f(6) = 0.16667. A plot of this function looks like this:

This is a uniform distribution or flat distribution, i.e., each outcome is equally likely to occur. And, since we have scaled the values so that the sum of the heights of the rectangles: 6 * 0.16667 = 1.0, this plot can be thought of as a probability distribution function. Mathematically it can be written as [pic]. This is a property of probability distribution functions.

An event can also be more complicated—for example the rolling of two dice. Then we might define the outcome as the sum of the spots on the two dice. In this case there are 6 x 6 = 36 possible ways the dice can land, each equally likely. But in those 36 ways, there are only 11 possible outcomes: the values 2 through 12. But each of these values is not equally likely to occur. There is only one way to obtain a 2—when both dice show a “1”. But there are four ways of obtaining a 5: (1,4), (4,1), (2,3), (3,2). So, if every combination of faces is equally likely to occur, one would expect a 5 to occur 4/36th s of the time and a 1 to occur 1/36th of the time. Again, it’s useful to plot the probability distribution function:

And, again, because this plot includes every possible outcome, the sum of the heights of the rectangles is one.

The rolling of dice is an example of a random process which produces discrete outcomes, i.e., outcomes which one can enumerate. There are also random processes which produce non-denumerable or continuous outcomes, e.g., the angle at which a spinner stops turning. Here, the probability that the spinner will stop at, say, 30.0123456( is almost zero. The reason is that 30.0123456( is only one of an infinite number of possible values. So, how can this be useful? The answer is, that if we specify a range of values over which the spinner may stop, then the probability becomes finite. For example, the probability that the spinner will stop between 30( and 31( is 1/360.

With continuous outcomes, we no longer speak of probability distribution functions, but rather of probability density functions. And, we can no longer scale the sum of all possible outcomes f(n) as [pic] because we cannot enumerate individual outcomes. But we can write the equivalent expression using calculus. Let x be a continuum of outcomes and f(x) be the probability density function of the occurrence of x; then, over all possible values of x, the integral [pic] The probability density function for the spinner is uniform and would be plotted as follows:

In the case of discrete outcomes, the sum of heights of the rectangles must add to 1.0. In the case of continuous outcomes, the area under the curve must equal 1.0. Then, the probability of obtaining a value between x = a and x = b is the area under the curve between a and b.

n and x in the discussion above are called random variables. Their values are distributed according to the generating random process.

Depending on the underlying random process, probability distribution or density functions can take on many forms. One of the more well-known is this one:

This is often called a bell-shaped distribution. More formally, it is known as a Gaussian distribution or normal distribution. It can be skinny or wide, but the area under the curve is always 1.0. This function is extremely important in engineering. We’ll discuss it in detail a little later.

One of the difficulties in studying random processes is that we almost never know what the probability density function (pdf) is. Sometimes we have a general idea about its shape, but not much more. In fact, to learn more, we usually infer its characteristics by taking sample outcomes. From these data, we try to estimate its form. Remember, there is some underlying process whose outcomes are probabilistically distributed. It is the characteristics of that underlying process that we are trying to determine. (In the case of Albert Hibbs, everyone’s initial expectation was that the roulette wheel had a flat or uniform distribution function, i.e., each number on the wheel was equally likely to occur. But Hibbs collected data and concluded that the distribution function was not perfectly uniform—some numbers were more likely to occur than others. Based on his estimate of the wheel’s distribution function, he was able to improve his odds of winning.)

Deducing pdfs, however, is not easy, because they don’t necessarily have analytical forms, i.e., they may not be explicitly expressible as mathematical functions. Nevertheless, we can learn something about a pdf’s characteristics by exploring its moments. It turns out that every pdf can be expressed in terms of an infinite number of parameters called moments.

In general, moments denote the effect of something which is applied at a distance. For example, in physics there is the concept of torque—the twisting effect of applying a force at the end of a lever arm. If the force is applied at right angles to the lever arm, then the torque T = r * F, the length of the lever arm r, and the magnitude of the force F. Torque is a moment.

In probability and statistics the idea is the same, except that the “something” is an outcome, and the “distance” is how far that outcome is from zero or an average value. Statistical moments characterize the shape of the distribution. For discrete and continuous random processes, respectively, the Pth moment is defined as:

[pic] , [pic]

where [pic] and [pic], respectively. For the discrete cases, recall that N is the total number of possible outcomes.

mP is called the Pth moment about the mean. ( is the mean or the average value of x. In fact, ( is the 1st moment about zero. Each mP emphasizes a feature in the distribution of f(x). So, if we knew the values of all the mPs, we could deduce the actual probability density function f(x). Knowing all the moments tells us everything there is to know about f(x). But, there are two problems: first, there are an infinite number of moments; second, all we will have at our disposal is a set of sample outcomes produced by the underlying random process.

It turns out that the first problem is not so serious, because in most applications, almost everything we would like to know about a pdf is contained in the first several moments. In fact, only rarely are we interested in more than the first four moments. The second problem is a little more serious, because the best we can do is estimate what those moments are. We will never be able to really know they are, but with enough sampling, we can estimate them arbitrarily closely.

So, how do we proceed? First, suppose we have obtained N sample outcomes from some random process. (Note, that we will now use N to denote the number of samples, not the number of possible outcomes, as before.) So, we have a list of xi ‘s

for i = 1,N. The first moment that we’re interested in is the 1st moment about zero, i.e., the mean. To find the mean, we use the formula: [pic] .What’s going on here? First, why is the mean denoted by [pic] and not ( ? Second, why is the formula so different from what was given before? (Actually, this is the equation that probably looks most familiar.) The answer to the first question is that we can’t calculate (. It’s a property of the underlying pdf. We can only estimate ( from our sample data. We denote that estimate as [pic]. Our expectation is that [pic]is close to (. In fact, we’ll even be able to calculate how close it’s likely to be. Now the second question. In our original formula for calculating (, we considered all possible values of x, and we “weighted” them by their probability of occurring. In adding up (or integrating) all weighted values, we arrived at (. In estimating [pic], however, the probability of getting a certain value of x has already been taken into account by the sampling procedure. x’s with low probabilities of occurring, are not found very often in the sample. So, with the relative distribution of x’s already accounted for, adding the unweighted samples is equivalent. And, since we’re interested in the “per sample” average of x, we divide that sum by N.

So, what can we do with our new-found knowledge, an estimate of the mean? Not a lot. It simply gives us a zero reference point around which outcomes from the random process will occur. To illustrate, here are some pdfs, all with the same mean ( = 10.

We begin to get some really useful information with M2, the second moment about the mean. This moment also has a special name: the variance. Here the calculation is [pic], where s2 is a sample estimate of the true variance (2. The variance is a measure of the spread of the distribution. Often we’re most interested in [pic], the square-root of the variance. This is called the standard deviation (s.d.) or standard error, depending on the application. Notice that [pic] has the same units as x, i.e., if x is in meters, then the standard deviation is in meters; if x is a score on an exam, then, the standard deviation is a score.

In some cases we may not care to know much more about the pdf. For example, we might want to know how well we did on an exam. Here we are usually satisfied with how well we did with respect to the class average. Scores on an exam can be considered outcomes of a random process whose distribution is usually “bell-shaped”. So, knowing the class average and the standard deviation is information enough to make that determination. If your score is, say, 2 s.d.’s above the class average, then you know you did very well, because scores above 2 s.d.’s occur only 2% of the time. But that last tidbit of information arises from your assuming that the distribution is “bell shaped”. So, you could be wrong.

There are two more moments that are very helpful in characterizing a distribution function: m3 and m4. These two moments by themselves however can be difficult to interpret, because they have units x3 and x4, respectively. So their magnitudes will depend on the units of the random variable x. To make these moments a bit more useful, it is customary to non-dimensionalize them, i.e., normalize them with respect to another parameter—the variance. Normalizing these two moments gives us the non-dimensional statistics skewness and kurtosis:

skewness = [pic] , and kurtosis = [pic].

The skewness is a measure of symmetry. A distribution with zero skewness will be perfectly symmetric. If the skewness is non-zero, the magnitude of the skewness indicates how lopsided the distribution is. Notice, that we wouldn’t be able to make that interpretation if only m3 were used, because different units for x would produce different values for m3 even if they came from the same underlying distribution. Non-dimensionalizing parameters is a very useful practice in statistics in particular, and in engineering in general.

Finally, we have kurtosis. This is considered to be a measure of “peakedness”, i.e., how “pointy” the distribution is. For reference, the kurtosis of a Gaussian distribution is 3.0. –a wonderful item of statistical trivia. And, if you want to add to your statistics vocabulary, distributions which depart from the Gaussian are called leptokurtic, platykurtic, and mesokurtic, depending on the nature of their departure.

There are, of course, an infinite number of additional moments to consider. But knowing the mean and variance is often enough to make engineering predictions and decisions. Remember these ideas and formulas.

Now, let’s apply some of these ideas. A little while ago I mentioned the problem of measuring the length of a soccer field with only a meter stick. Let’s explore this problem in some detail. Let’s presume that every time you lay down the meter stick you record your length. But, in fact, because of slight misplacements, the actual distance for each measurement i is di = di(meas) + ei , where the measurements are in cm, and ei is a random error. And, let’s say that the random error is –1cm 30% of the time, 0cm 40% of the time, and +1cm 30% of the time. One meter at a time, you measure from one end of the soccer field to the other, and you discover it’s 110.53 m long. At least, your measurements indicate that it’s 110.53 meters long. But how long is it actually? One way to tell is by carrying out the measurement again, and again, and again.

We can simulate this problem with our statistics applet at jhu.edu/virtlab. First, let the random variable x in the applet be used to represent the error. Then define the distribution of x in terms of the individual values as

Pr(x) .3 .4 .3

x -1 0 1

This is our definition of the error every time we take measurement with the meter stick, i.e., our measurement will be in error by –1cm, 0 cm, or 1 cm. Now, we need to see what effect this has on our total measurement of the soccer field. That requires, say, 111 measurements. So we can a total error w, a random variable based on x, as w = sum(x,111). This expression will take the sum of 111 realizations of the random variable x and add them together. w will be the total error of our measurement. Because this is a computer simulation, it is easy to carry out our “measurement” 1000 times. So, set the number of realizations to 1000, and click on draw. What do you get? A fairly broad distribution of errors. Recall, that this is the distribution of the sum of 111 individual errors added together. What is the average value of this distribution? It is probably close to zero. And what about the spread or standard deviation? That’s probably about 8 cm.

Now, click on the button “Normal curve”. This will produce a Gaussian curve with the same mean and standard deviation as the displayed distribution. It looks pretty close. So, the sum of 111 errors which are distributed as –1, 0, +1 is approximately a Normal distribution. Interesting.

Usually, we are required to include some indication of accuracy when we report a measurement. That report is usually the measurement ( one standard deviation. ( one standard deviation is often called the standard error. Thus we would report that the length of the soccer field as 110.53m ( 8cm.

That’s not bad accuracy for measuring the length of a soccer field— a standard error of ( 8cm. But, could we do better? In the case we just calculated, a single measurement has a standard error of ( 8cm. Would our measurement improve if we were to take an average value of several measurements? What we mean by improvement is that the standard error would be less. Again, we can use our random variable simulator to get an answer. First, let’s assume that the distribution of errors for our total measurement is Gaussian distributed with a mean of zero and a standard deviation of 8. That’s roughly what we discovered in the first simulation. Let’s begin all over again, this time defining a random variable x as being Gaussian distributed with mean zero and standard deviation 8.

Now we want to use w as the average value of a number of measurements, say 10. So, we define w = sum(x,10)/10. sum(x,10) produces the sum of ten realizations of the total error. Since we want the average error over those 10 realizations, we must divide by 10.

Carry out this calculation, say, 1000 times, and plot, as before.

What do we get? Good news. The standard deviation of the error in taking an average of 10 measurements is 2.5cm. What happens if we take an average over 100 measurements? Try it. The standard deviation of the error in taking an average of 100 measurements is 0.8 cm. So, the more measurements we average over, the smaller the standard error. If you plot standard error as a function of M, the number of measurements in the average, you will discover that the standard error is proportional to [pic].

In fact, it’s not too difficult to prove that the s.d. of [pic] is [pic], where (2 is the variance of the random variable x. So, let’s do it. To make the algebra easier, lets define a new random variable y = x - [pic]. That means that the variance of y will be the same as that of x, but we can estimate it with the more compact formula s2 = [pic]. Suppose, now, that M yi’s are averaged together. We’ll label them y1, y2, ... yM. Then, we could write [pic]. Note, that this equation says that we will take N realizations each of M samples of yi which are then averaged together. We expand this to obtain:

[pic]

The first set of terms contain all the squared values of yi; the second set of terms contain all the cross products of yi. Each of the terms in the first set of brackets is nothing other than the variance of yi. So the first set of terms reduces to [pic] . But, what is the second set? What, for example is [pic]? Recall that the yi’s are random variables having zero mean. Most importantly, y1i is picked or sampled totally independently of y2i , i.e., all the yi’s are independent samples. This means that the value of one yi is uncorrelated with any other yi. So, every crossproduct of yi’s averages to zero. And all the terms in the second set of brackets is zero. The final result is then: [pic].

This is the result we obtained by simulation. Actually what we’ve just said is slightly wrong. We really should say that the estimated [pic]. Or we could say the[pic]. These appear to be subtle distinctions, but in the study of statistics, these distinctions are very important. In almost all engineering applications, you will never know ( or (; they will have to be estimate from the data.

Mathematical notation and properties of averaging

Using [pic]and subscript notation is a fairly cumbersome way to indicate average value. A more convenient notation is to use the overbar, i.e., [pic]. Then, by knowing a few simple mathematical rules, we can simplify some of our random variable expressions considerably. In what follows, suppose that constants are represented by upper-case letters and random variables are represented by lower-case letters.

Suppose we want the average value of [pic]. Using our overbar notation we would write this as [pic]. When elements are added under an overbar, we can separate the terms into two separate averages, because the average of a sum--in this case the average of M + e--equals the sum of the averages, because of the associativity of addition. And, since M is a constant, [pic] is just M .

There are also two simple rules for multiplication within an overbar. One of them is [pic]. That is, the average of a constant times a random variable is a constant times the average of that random variable. The second rule of multiplication is that [pic], the average of the crossproducts, cannot be mathematically separated. But, there are some things we can say about its value. Suppose e and f are random variables, each with zero mean. If e and f are statistically independent of one another, then [pic] = 0, i.e., the average cross-product of independent random variables with zero means is zero. On the other hand, if e and f have zero means, but are not independent of one another, then [pic]is the covariance between them. In the special case where e and f are the same variable, we would get the expression [pic] which is the variance of e.

This overbar notation is quite standard in those areas of engineering where the problems themselves contain random quantities, for example, in the study of turbulence. We'll use the notation a little further on.

The Gaussian distribution

Why is it that any time we add together a bunch of random variables, the resulting distribution looks bell-shaped. For example, the score on each question of an exam could be considered a random variable; the total score for the exam is almost always bell-shaped. If we add together 111 measurements, each of which could contain a random error, the resulting total error has a distribution which is bell-shaped. If we count the total number of dots on a throw of 10 dice, the distribution of dots is bell-shaped. What is even more remarkable is that if we add together the outcomes of 100 random variables, the sum is always bell-shaped no matter what the probability distribution is for each of the random variables.

Run some experiments using the random function simulator. Define a random variable x with flat distribution between 0 and 1. Construct a new random variable w as the sum of ten realizations of x, i.e., w= sum(x,10). Obtain 1000 realizations. And, what do you get? A distribution that is very close to Gaussian. Define a random variable y as a Gaussian with a mean of 0 and a standard deviation of 2, and define w =sum(y,10). Obtain 1000 realizations. And, what do you get? A distribution that is very close to Gaussian. In fact, construct any random variable x with as wild a probability distribution as you can think of. Define w as w=sum(x,10). And w will tend to have a Gaussian distribution.

This remarkable result has been observed for a 100 years, but only little by little has the observation been converted into a mathematical theorem. This theorem is the Central Limit Theorem. In short, the central limit theorem says that if one takes the sum of outcomes of a set of random variables (with suitable restrictions), the resulting sum will have a Gaussian distribution. Actually, there are a number of central limit theorems, each with its own list of restrictions. And, the topic is still open for research. A Gaussian or normal probability distribution function has the form:

[pic]

Of course, if you integrate over all x, it integrates to 1.0. The most important element of this function is that it is fully defined by the mean ( and the standard deviation (. None of the other moments need be known to estimate this function. Some other properties: it is symmetric, so we know that all odd moments are zero. And, since earlier we mentioned the fourth moment about the mean—kurtosis--we will state here that the kurtosis of a Gaussian is 3.0. (That’s a genuine factoid.) Sometimes this function will be referred to as N((,()—a normal distribution with mean ( and standard deviation (. For example if you read that a variable x is distributed as N(0,1), you should what that means.

Although the Gaussian is a fairly complicated function to work with, we do have its exact functional form. This means we can learn anything we need to know about it, either through mathematical analysis or by tabulation. Earlier I mentioned that if your score on an exam was two standard deviations above the average, only about 2% of the class would have a higher score than you. The reason I could make that statement is because the total score on an exam, being a sum, is Gaussian distributed. And it has been tabulated that 95% of the area of a Gaussian distribution lies between + 2( and -2( from the mean. Since the Gaussian is symmetric, half of the remaining 5% must lie below -2( and the other half must lie above +2(.

Now, think about this question: Suppose you measure the length of a soccer field 10 times and take the average value as your best estimate. What’s the distribution of the average value? Of course, since we’re talking about Gaussian distributions, you’ll say “Gaussian”. (And, you’d be right.) But, think about what an average is. It’s the sum of a sequence of values divided by a constant. Since it’s a sum, it’s variation will tend to be Gaussian. And, we noted earlier that the variance of an average is (/(N, where ( is the standard deviation of each individual element and N is the number of elements in the sum.

The present discussion should also make it a little clearer why we might denote the quality of a measurement D by expressing it as D (( , an expected value plus or minus the standard error. Since any measurement is likely to be contaminated by any number of contributing errors, the total error in D is likely to be Gaussian distributed. That means that a measurement of, say 100m (10cm will incorporate the true value 68% of the time—the area under a Gaussian curve between -1( and +1(.

The Gaussian curve is certainly the most important in science and engineering. Unfortunately, it is not universal. There are a number of random processes that generate other probability distributions, e.g., a Poisson process.

And life is a little more complicated than we’ve led you to believe. Recall, we don’t know ( and (. We can only estimate it with [pic]and s2. So, with uncertainties in ( and (, we can’t really justify the precise probabilities that we’ve mentioned. A more thorough study of probability and statistics would show us how to deal with the problem.

ESTIMATION AND VARIANCE

Let’s talk a bit more about estimation. Estimation is the process of trying to determine properties of a population by sampling. Almost always, this consists of taking samples—whether it’s determining the distribution of spots on a pair of rolled dice or determining the errors in measuring the length of a soccer field. The idea is to collect data whose characteristics most closely match those of the population. Then, we can calculate sample statistics which we hope will represent the population as a whole.

As we have seen, some ways are better than others. What do we mean when

we say “better”? We mean that the variance of our statistic is smaller. The goal is always to arrive at a sample statistic whose variance is smallest. Every estimate we make has a variance. We can even talk about the variance of the sample variance. That is, when we estimate the sample variance of a random variable as s2, that estimate of s2 is just one value of a distribution of possible sample variances. So, sampling strategies can be important.

In finding the length of a soccer field, for example, it wasn’t so much the sampling strategy, but rather how to use the data. Recall, that the variance of a single measurement was s2, but the variance of an average value of measurements was s2/M. So, we can reduce variance by taking an average.

In some situations, we have little choice for a sampling strategy. For example, if we want to find the distribution of outcomes from a roulette wheel, there is not much choice but to spin and record, spin and record, spin and record.

But, there are some instances where one does have a choice. Consider the following schematic:

This represents a plot of land with trees. Area A is sparsely populated with trees; areas B are densely populated with trees. The total area is known and is very large, say, thousands of square kilometers. The task is to estimate how many trees are in the plot.

Since there are, perhaps, tens of millions of trees, counting them is out of the question. Sampling is the only reasonable approach. First, what will you measure? Since you know the total area, you can sample tree density (, e.g., trees per hectare (100m x 100m) and calculate the total number of trees from that. Recall, that one of the goals of sampling is to accurately represent the total population in the sample. In this case, the population is the distribution of trees. And its distribution would look like this:

One strategy is to randomly pick locations within the area. Then, use those as center points about which you measure your 100m x 100m sections. Then count the trees in each of these sections. Since, the locations are chosen randomly, you are assured of sampling the right proportion of A and B areas. So, you expect to be able to estimate a representative statistic: average trees per hectare. The difficulty is the statistic you would deduce would have a fairly high variance. Is there another sampling strategy that would have a lower variance? The answer is yes. The land area consists of two subpopulations: high tree density and low tree density. The sampling strategy we outlined above is based on random sampling over the entire area. It turns out that we can significantly lower the variance on our estimates if we separately sample each of the subareas and combine their results. This is the technique of stratified sampling.

Here’s why it works. First, we’ll simplify the problem so we don’t get bogged down in algebra. Make the assumption that half the area is of type A, and half of type B. And, let’s denote the sample average tree density of the areas as [pic]and [pic]. Since the areas are of equal size, we can write the average density of the whole plot as [pic] = ([pic]+[pic])/2. If we define D = [pic] -[pic], then –D = [pic] - [pic]. That is, D is the difference between the global mean [pic] and the individual means [pic]and [pic]. Now we can write the sample variance of ( as:

[pic] [pic]

But, if we reorganize this equation to indicate from which area the samples were taken, we would obtain:

[pic][pic], where NA + NB = N. But [pic] can be expressed in terms of D and the individual means. So:

[pic]. Expanding the squared terms, we get:

[pic]The first sum can be rewritten as:

[pic]

What happened to the middle term? It’s zero because it’s simply the sum of the observations about the mean. The second sum in the previous equation gives a similar result. So, now we can write:

[pic]. What does this say? It says that the variance of a set of random samples taken from the entire plot consists of three elements: a weighted variance of samples taken from region A, a weighted variance of samples taken from regions B, and the square of the difference between the mean tree density of the two areas. So, even if the distribution of trees is very narrow in each of the two types of areas, the sample variance could be large, simply because of the difference in the average tree density between the two areas. There must be a better way! And, there is.

Rather than carrying out global random sampling, carry out stratified random sampling. That is, sample the areas A and B separately and obtain [pic] and [pic]. We have eliminated the last term in the equation for [pic]--at least almost. D is, of course, constant; but we don’t know what that constant is. We can only estimate it from the sample data. And that estimate will have some variation. It is that variation that we must add to the total variance of [pic].

[pic]

PROPAGATION OF ERROR

To conclude this section on random processes we’ll discuss their impact on measurement error. Usually, we will take some measurement [pic] and presume that it consists of the real value m and some error ei, i.e., [pic]= m + ei. And we presume that the error ei has zero average. This means that the average value of a measurement [pic] would be expected to equal the true value m. In other words, the average value of the ei’s is zero. If the ei’s do not have zero average then our measurement is biased. Or we can say that it contains a systematic error. Of course, we always try to carry out a measurement that does not contain a systematic error.

But, just because we can carry out a measurement with an average error of zero, does not mean that we will be free of systematic error. Depending on how we use that measurement, it is possible to introduce one. We can illustrate this with a very simple example. Suppose we want to estimate the area of a square. So, we measure the length of a side, and we square that value. Just to make sure, we do this a number of times and take an average:

[pic]

The right hand side of this equation has three terms. The first is the true area m2. The second is the average of a random variable with zero mean. So, it’s zero. But the third is the average of a random variable which is squared. So every item in the sum is positive. This term has introduced a bias into our estimate of A, even though our measurement error was not biased. So, we will always overestimate A.

The reason that we have created this bias is because we have used the measurement (and the error) in a non-linear way. This means that the error does not appear in the calculations just as a first power. Here, in one term, the error is squared. Is there a way to carry out the measurement so that we don’t introduce such a problem? In general, no. But, in this case, yes. All we need to do is to take two measurements of the square: one to measure the height; one to measure the width--even though they're supposed to be the same length. If we do that, then we can calculate the average area as

[pic], where h and w are the true height and width of the square, and ei and (i are the errors in taking those measurements. Now, if you look at the terms, all but the first average to zero. Terms two and three average to zero because the random errors ei and (i have zero averages. Term four is the sum of ei(i --errors which are independent of one another. So, the expected value of this product is zero: some terms will be positive; some will be negative.

An example of a measurement which will always have a biased error is estimating the height of a tree by measuring out some distance X from the tree, then measuring the angle ( to the top of the tree. Then, the height of the tree h = X tan(().

In N measurements, we would obtain an average height of the tree as:

[pic]Here, the mathematics begins to get sticky. On average the second term is zero because it is the product of one random error times a function of another, independent, random error. But the average of the first time is not X tan((). It's value will depend on (. And, since there is no closed form expression which separates ( from the other variables, we can't even calculate its effect. However, if you're curious, you can experiment with the random function simulator and see for yourself. Especially, when ( is large--like greater than 60o--the nonlinearity in the problem yields quite biased results.

The Calculus of errors

There's another way to estimate the effect of measurement error that has nothing to do with probability or random processes. It involves the way a function F(x,y) changes as x and y change--the essential ideas of calculus. We'll begin with a problem. Suppose we want to calculate the volume of a structure that consists of a cone resting upon a rectangular parallelopiped. The total volume of this structure is:

[pic].

We will not measure V directly, but rather we'll calculate V by taking measurements of R, Hc, L, W, and H. But suppose these measurements are not perfectly accurate. So the question is how much error will we introduce into our calculation of V by using inaccurate values for the measured variables?

If each measurement is in error, then the calculated volume would consist of the true volume V plus an error v. The relation between the error-borne measurements and the resulting calculated volume would be:

[pic].

Expanding this equation, then subtracting out the equation for V, we get

[pic]

+[pic].

First, let's look at v statistically, i.e., what would be the average value of the error v if measurements were taken many times and averaged. Using overbar notation, we obtain [pic] [pic].

If hc, r, l, w, and h are all independent random variables with zero mean, then we get

[pic]. All the other terms average to zero.

Thus, we have deduced an average expected error in v. But, suppose we can't take a lot of sample measurements, and we would like to know what is the "worst case" error. That's fairly simple. Suppose we can estimate the maximum possible error on each measurement. Let these be labeled rmax, lmax, hc max, hmax, and wmax. Then

[pic]

[pic]

A simplification of this is to assume that rmax, lmax, hc max, hmax, and wmax are all very small compared to R, L, Hc, H, and W. Then terms containing two or more of these maximum errors will be much smaller than terms containing only one. Consequently, if we ignore these smaller terms, vmax can be approximated as

[pic].

There's a reason for making this simplification, even though it's only an approximation to the maximum error in v. The reason is that this simplified expression is easily deduced for any combination of measurements. This is the result that one would obtain by taking the total differential of the function [pic]. The total total differential is defined like this: If F is a differentiable function depending on n variables [pic], then infinitesimal variations in F are determined by infinitesimal variations in the xis as:

[pic]

dF is called the “total differential” . Our variations rmax, lmax, hc max, hmax, and wmax are not infinitesimal. But, if they're quite small, then the total differential is a pretty good approximation to the total error vmax (that is, dF in the notation immediately above).

Notice that the value for vmax really is an error that you would never expect to have. Not only are all of the measurement errors assumed to be at their maximum, but all of them are contributing with the same sign. That is, there are no canceling errors. If the measurement errors are truly random variables with zero mean, the expected error in a calculated value of V would be much less. Nevertheless, estimating error using this calculus can be extremely valuable, especially if you must absolutely determine some parameter within a specified error.

Another way of representing this maximum error is by percentages. In our example, if the total volume V is separated into its constituent pieces V = Vc + Vp , where the subscripts c and p refer to the cone and parellelopiped, respectively, then the above equation can be rewritten as

[pic].

This equation shows that percentage errors in the parallelepiped measurements, e.g., [pic], are linearly additive with weight Vp, whereas a percentage error in the measurement of R is doubly additive with weight Vc.

For more information on this technique, look under “total differential” or “calculus of errors” in elementary calculus books.

-----------------------

1

2

4

5

6

3

n

f(n)

0.16667

0.16667

f(n)

2

3

n

6

5

4

11

9

8

7

12

10

1/360

0

360

x

f(x)

10

10[pic]0

10[pic]

f(x)

x

x

x

x

f(x) [pic]

f(x)

f(x)

[pic]

A

B

B

B

B

f(()

(

A

B

(

X

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download