Random processes



Uncertainty

Many physical processes, like the falling of a ball under the influence of gravity, or the motion of the sun across the sky, or the acceleration of air across the surface of an airplane wing, can be modeled explicitly as a set of equations. Then, if we want to predict, for example, the distance a ball falls in one second under gravity when it starts from rest, we simply plug in the value for gravity g = 9.8 m/s2 and t = 1 s, and use the predictive equation s = ½ g t2 to obtain s = 4.6 m. At this level of sophistication, this is a deterministic process, i.e., we can specify the parameters and initial conditions and, then, calculate the results.

Let’s contrast this with another physical process: that of rolling a die. Now, what is the predictive equation for the outcome? In principle, we could come up with a set of equations but they would horrendously complicated and not at all practical. So, in effect, we have a process whose outcome is unpredictable. When a process produces an outcome that is almost completely unpredictable, we say it is a random process or stochastic process.

The distinction between a deterministic process and a stochastic one is not absolute. In our rolling die example, we could produce a set of equations to model its motion. But the equations would depend critically on a number of parameters, whose specifications might be almost impossible to prescribe. For example, the initial orientation of the die, its initial release velocity and angular momentum, its height above the table, the characteristics of the tables surface would all play critical roles in how the die would come to rest. Even the geometry would be important—whether the edges were sharp or rounded. Not only rounded, but how much rounded. In practice you could never prescribe these parameters with sufficient accuracy to ensure that the equations would predict the outcome. So, in practice, the process is random.

(But sometimes, an apparently-perfect random process is not so perfect. The roulette wheel in gambling casinos is supposed to be perfectly random. In this apparatus a small ball is supposed to fall with equal probability in one of the 38 numbered slots in the wheel. But some years ago, a graduate student, Albert Hibbs, at the University of Chicago and visiting Las Vegas, began recording the slots into which the ball fell on one roulette wheel. After several days, he realized that the wheel favored some slots over others, i.e., it was imperfectly made. So with some strategic wagering he was able to “beat the odds”, and, additionally, made a name for himself.)

How completely we model a physical process will determine how well we can predict behavior. Usually we try to include those parameters that have large and systematic influences. That is, we try to include parameters that affect the process in a significant and expected way. In our first example of a falling ball, our model was the simple relationship s = ½ g t2 . One of the parameters neglected in this model is air resistance. If we were to use this simple equation as a predictor of an actual experiment, we would always overestimate distance, because the neglected ingredient always acts to slow the ball. That is, the element that we neglected had a systematic effect.

At some point we stop trying to include additional parameters in our model because, the model becomes too cumbersome. And we hope that the remaining parameters will have a relatively small influence on prediction. Further, we hope that these unmodeled parameters will partially cancel one another out, i.e., they won’t all act to affect the process in one direction. (Eventually, there is a limit beyond which deterministic modeling is not possible. You’ve probably heard of the Heisenberg uncertainty principle.) Often a process is considered mostly deterministic or mostly random based on practicality. Is it worth it to expand the model to include additional details?

Let’s take the example of water pressure in a municipal water system. If, at some point within the system, you were to measure pressure as a function of time you would discover it fluctuates, sometimes smoothly, sometimes wildly. In principle, one could model the system and make predictions from it, but the system is so complex that it would not be practical to do so in detail. Pump pressure, pipe diameter, and pipe roughness could be relatively easily included in a model. Those parameters are relatively constant. But water pressure also depends on flow rate. So, every time someone turns on a faucet or flushes a toilet, it has an effect. And it would be impossible to monitor all faucets in the system to be able to include their individual impacts on the system. So, what to do?

One answer is to predict “in the average”. That is, make some assumptions about the distribution of faucets turned on and toilets flushed by time of day, and use and a measure or statistic to characterize a “typical” situation, perhaps by time of day or day of week. Then use this statistic to approximate the actual conditions in the model. Predictions resulting from an averaged input will be imperfect, but they could be reasonably useful. (Actually one of the more interesting problems for water-works engineers and one that defies getting usable predictions from “averaged” inputs is the Superbowl problem. In a period of about twenty minutes during Superbowl half-time, 50 million toilets are flushed throughout the U.S.—a once a year occurrence. This wreaks havoc with municipal water systems.)

Uncertainty and randomness enter the engineering world in yet another way: measurement. If fifty people were given meter sticks and you asked them to measure the length of a soccer field, you would get fifty different answers. They might be closely clustered, but they would be different. Why? The soccer field isn’t changing in size. The answer is that errors are introduced in taking the measurement. Maybe the meter sticks are not all exactly one meter long. Maybe the tick marks on the meter sticks were read incorrectly. Maybe the number of meter stick lengths along the field was miscounted. Maybe the meter sticks were not laid in a straight line. There are a lot of “maybes”.

So, with fifty different answers, how long is the soccer field? Really, we can’t tell. But we can estimate its most likely length by taking the average value of all the measurements. So, let’s say that average value is 112 m. How confident are we that the actual length is 112m? If all fifty measurements lie between 111.5m and 112.5m, we’re pretty confident. But, if the fifty measurements lie between 100m and 120m, we would be far less sure. So, the spread or distribution of values makes a difference in the confidence of our estimate. Can we actually quantify measurement confidence? How can we deal with non-deterministic quantities? How can we characterize random processes or distributions of outcomes? These are all questions vital to engineering. And they are addressed in terms of probability and statistics.

DISTRIBUTIONS

With random processes we can never predict a specific outcome—that’s what makes it random. But we might be able to deduce the likelihood that a particular outcome will occur. That can be very helpful. But determining this likelihood requires knowledge of the distribution of possible outcomes. Sometimes we can infer what the distribution is; other times we cannot. In the case of a “perfect” die, we presume that each side of the die is equally likely to land face up. So, 1/6th of the time we would expect to find a “2” face up, for example. And the same would be true for each of the other numbers. Another way of saying it is that the probability of getting a “2” on any one roll of the die would be 1/6 or 0.16667.

Probability is the likelihood that an event will occur, or a particular outcome will occur. Probabilities always lie between 0.0 and 1.0. If the probability of an event is 0.0, that means it will never occur. If the probability of an outcome is 1.0, that means it is certain to occur.

What is the probability that a “1” or a “2” or a “3” or a “4” or a “5” or a “6” will occur in one roll of the die? Since this event encompasses every possible outcome, its probability must be 1.0 (presuming that the die cannot end up on an edge or corner). That is, the sum of probabilities of all possible outcomes is 1.0. This fact allows us to define a probability distribution function f(n), where n is a particular outcome. For a rolled die f(1) = f(2) = f(3) = f(4) = f(5) = f(6) = 0.16667. A plot of this function looks like this:

This is a uniform distribution or flat distribution, i.e., each outcome is equally likely to occur. And, since we have scaled the values so that the sum of the heights of the rectangles: 6 * 0.16667 = 1.0, this plot can be thought of as a probability distribution function. Mathematically it can be written as [pic]. This is a property of probability distribution functions.

An event can also be more complicated—for example the rolling of two dice. Then we might define the outcome as the sum of the spots on the two dice. In this case there are 6 x 6 = 36 possible ways the dice can land, each equally likely. But in those 36 ways, there are only 11 possible outcomes: the values 2 through 12. But each of these values is not equally likely to occur. There is only one way to obtain a 2—when both dice show a “1”. But there are four ways of obtaining a 5: (1,4), (4,1), (2,3), (3,2). So, if every combination of faces is equally likely to occur, one would expect a 5 to occur 4/36th s of the time and a 1 to occur 1/36th of the time. Again, it’s useful to plot the probability distribution function:

And, again, because this plot includes every possible outcome, the sum of the heights of the rectangles is one.

The rolling of dice is an example of a random process which produces discrete outcomes, i.e., outcomes which one can enumerate. There are also random processes which produce non-denumerable or continuous outcomes, e.g., the angle at which a spinner stops turning. Here, the probability that the spinner will stop at, say, 30.0123456( is almost zero. The reason is that 30.0123456( is only one of an infinite number of possible values. So, how can this be useful? The answer is, that if we specify a range of values over which the spinner may stop, then the probability becomes finite. For example, the probability that the spinner will stop between 30( and 31( is 1/360.

With continuous outcomes, we no longer speak of probability distribution functions, but rather of probability density functions. And, we can no longer scale the sum of all possible outcomes f(n) as [pic] because we cannot enumerate individual outcomes. But we can write the equivalent expression using calculus. Let x be a continuum of outcomes and f(x) be the probability density function of the occurrence of x; then, over all possible values of x, the integral [pic] The probability density function for the spinner is uniform and would be plotted as follows:

In the case of discrete outcomes, the sum of heights of the rectangles must add to 1.0. In the case of continuous outcomes, the area under the curve must equal 1.0. Then, the probability of obtaining a value between x = a and x = b is the area under the curve between a and b.

n and x in the discussion above are called random variables. Their values are distributed according to the generating random process.

Depending on the underlying random process, probability distributions or density functions can take on many forms. One of the more well-known is this one:

This is often called a bell-shaped distribution. More formally, it is known as a Gaussian distribution or normal distribution. It can be skinny or wide, but the area under the curve is always 1.0. This function is extremely important in engineering. We’ll discuss it in detail a little later.

One of the difficulties in studying random processes is that we almost never know what the probability density function (pdf) is. Sometimes we have a general idea about its shape, but not much more. In fact, to learn more, we usually infer its characteristics by taking sample outcomes. From these data, we try to estimate its form. Remember, there is some underlying process whose outcomes are probabilistically distributed. It is the characteristics of that underlying process that we are trying to determine. (In the case of Albert Hibbs, everyone’s initial expectation was that the roulette wheel had a flat or uniform distribution function, i.e., each number on the wheel was equally likely to occur. But Hibbs collected data and concluded that the distribution function was not perfectly uniform—some numbers were more likely to occur than others. Based on his estimate of the wheel’s distribution function, he was able to improve his odds of winning.)

Deducing pdfs, however, is not easy, because they don’t necessarily have analytical forms, i.e., they may not be explicitly expressible as mathematical functions. Nevertheless, we can learn something about a pdf’s characteristics by exploring its moments. Every pdf can be expressed in terms of an infinite number of parameters called moments. Moments characterize probability distribution functions just like Taylor series polynomials characterize mathematical functions. But they characterize them in a different way. Moments characterize pdf’s in terms of shape—spread, symmetry, peakedness, etc. Taylor series polynomials characterize math functions in terms of curvature: linear, quadratic, cubic. Different techniques for exploring the different properties of functions.

In general, moments denote the effect of something which is applied at a distance. For example, in physics there is the concept of torque—the twisting effect of applying a force at the end of a lever arm. If the force is applied at right angles to the lever arm, then the torque T = r * F, the length of the lever arm r, and the magnitude of the force F. Torque is a moment.

In probability and statistics the idea is the same, except that the “something” is an outcome, and the “distance” is how far that outcome is from zero or an average value. Statistical moments characterize the shape of the distribution. For discrete and continuous random processes, respectively, the Pth moment is defined as:

[pic] , [pic]

where [pic] and [pic], respectively. For the discrete cases, recall that N is the total number of possible outcomes.

mP is called the Pth moment about the mean. ( is the mean or the average value of x. In fact, ( is the 1st moment about zero. Each mP emphasizes a feature in the distribution of f(x). So, if we knew the values of all the mPs, we could deduce the actual probability density function f(x). Knowing all the moments tells us everything there is to know about f(x). But, there are two problems: first, there are an infinite number of moments; second, all we will have at our disposal is a set of sample outcomes produced by the underlying random process.

It turns out that the first problem is not so serious, because in most applications, almost everything we would like to know about a pdf is contained in the first several moments. In fact, only rarely are we interested in more than the first four moments. The second problem is a little more serious, because the best we can do is estimate what those moments are. We will never be able to really know they are, but with enough sampling, we can estimate them arbitrarily closely.

Let’s start with a concrete example. Suppose we want to characterize the distribution of defective screws coming off an assembly line. They’re packaged 1000 screws per box. So, how do we proceed? Maybe we decide to count the number of defective screws in N=100 boxes—the boxes selected randomly. (Note, that we will now use N to denote the number of samples, not the number of possible outcomes, as before.) So, we have a list of xi ‘s or defective screws per box for i = 1,N. First, we’re interested in the 1st moment about zero, i.e., the mean. On average, how many screws per box are defective? To find the mean, we use the formula: [pic] .What’s going on here? Why is the mean denoted by [pic] and not ( ? Second, why is the formula so different from what was given before? (Actually, this is the equation that probably looks most familiar.) The answer to the first question is that we can’t calculate (. It’s a property of the underlying pdf. We can only estimate ( from our sample data. (There’s a lot more than 100 boxes of screws coming off the assembly line.) We denote that estimate as [pic]. Our expectation is that [pic]is close to (. In fact, we’ll even be able to calculate how close it’s likely to be. Now the second question. In our original formula for calculating (, we considered all possible values of x, and we “weighted” them by their probability of occurring. In adding up (or integrating) all weighted values, we arrived at (. In estimating [pic], however, the probability of getting a certain value of x has already been taken into account by the sampling procedure. x’s with low probabilities of occurring, are not found very often in the sample. So, with the relative distribution of x’s already accounted for, adding the unweighted samples is equivalent. And, since we’re interested in the “per sample” average of x, we divide that sum by N.

So, now we have an estimate of the mean number of defective screws per box. Maybe we’d to know how that mean comes about. Do all boxes have 10.0 defective screws? Or maybe most boxes have no defective screws, while a few boxes have many. These are questions about the distribution of defects. To illustrate, here are some pdfs, all with the same mean ( = 10.0.

From the point of view of our boxes of screws, these graphs represent the following situations: a) almost all boxes have exactly 10 defective screws; b) boxes tend to have roughly 5 or roughly 15 defective screws, but hardly anything else; c) many boxes have about 9 defective screws, but the number can vary from one to very many. These are quite different quality characteristics, yet they all have an average value of 10.

We get additional useful information with m2, the second moment about the mean. This moment also has a special name: the variance. It’s a measure of the “spread” of the distribution. In our example of defective screws, this statistic would tell us how much variation there could be from box to box. Here the calculation is [pic], where s2 is a sample estimate of the true variance (2 . Again, we can only estimate the true variance. Often we’re most interested in [pic], the square-root of the variance. This is called the standard deviation (s.d.) or standard error, depending on the application. Notice that [pic] has the same units as x, i.e., if x is in meters, then the standard deviation is in meters; if x is the number of defective screws, then, the standard deviation is the number of defective screws.

One can define lots of different variances and they all have the same meaning—amount of spread. To specify which variance, one can use the notation var(). For example, to specify the variance of a random variable x as we did immediately above, one can write var (x). But, one can also calculate the variance of an estimate, say [pic], and denote it as var ([pic]). We might also want to know how much variation there is in our estimate of the variance of [pic]. That would be denoted as var ( var ([pic])).

You probably already know the use of standard deviation: Suppose you’ve received a score of 55 on an exam. Is that good or bad? What’s the first question you ask: “What was the class average?” Suppose the answer is 50. Now, at least you know that you did better than average. But how much better? Here’s where you ask your next question: “What was the standard deviation?”. Why the standard deviation? Because the standard deviation indicates the spread of scores. And the standard deviation is especially useful for exam scores because the distribution of scores on an exam is often “bell-shaped” or “Gaussian”. And a Gaussian distribution is completely characterized by its mean and standard deviation. (We’ll elaborate on this later.) So, if you know the standard deviation you can estimate quite precisely how well you did with respect to the rest of the class. For example, if the standard deviation is 10.0, that means that you are only 0.5 standard deviations above the mean. Assuming a Gaussian distribution, that means that approximately 31% of the class did better than you. Your score is OK, but not great. On the other hand, if the standard deviation is 2.5, your score is 2.0 s.d.’s above the class average. That means only about 2% of the class did better than you. That’s terrific.

In the example of defective screws, you would use the standard deviation in a similar way. You might report your defective rate as 10.0 ( 2.3 per box, where the “2.3” is the standard deviation. This characterization, then, relates not only the average rate, but also suggests the variability of the rate of defects.

But not all distributions are Gaussian, which means that more moments are useful in characterizing the distribution function. Two of these moments are m3 and m4. By themselves are not so useful, because their magnitudes depend on the choice of units of the random variable x—in fact as x3 and x4, respectively. To make these moments more representative of the shape of the distribution function, it is customary to non-dimensionalize them, i.e., normalize them with respect to another parameter—the variance. Normalizing these two moments gives us the non-dimensional statistics skewness and kurtosis:

skewness = [pic] , and kurtosis = [pic].

Notice that the values have no units whatsoever. That is, they are unit independent. So if your data were converted, say, from millimeters to kilometers, the result would be the same.

Skewness is a measure of symmetry. A distribution with zero skewness will tend to be symmetric about the mean. If the skewness is non-zero, the magnitude of the skewness indicates how lopsided the distribution is. Notice, that we wouldn’t be able to make that interpretation if only m3 were used, because different units for x would produce different values for m3 even if they came from the same underlying distribution. Non-dimensionalizing parameters is a very useful practice in statistics in particular, and in engineering in general.

Finally, we have kurtosis. This is considered to be a measure of “peakedness”, i.e., how “pointy” the distribution is. For reference, the kurtosis of a Gaussian distribution is 3.0. –a wonderful item of statistical trivia. And, if you want to add to your statistics vocabulary, distributions which depart from the Gaussian are called leptokurtic, platykurtic, and mesokurtic, depending on the nature of their departure.

There are, of course, an infinite number of additional moments to consider. But knowing the mean and variance is often enough to make engineering predictions and decisions. Remember these ideas and formulas.

In our example above, we estimated the mean number ( of defective screws as [pic]=10.0. How good an estimate is that? What we mean by “good”, is what is the variance of the estimate of [pic]. Actually, if we know the variance of xi, we can deduce the variance of [pic]. Here’s how.

To make the algebra easier, lets define a new random variable y, = x, - [pic]. That means that the variance of y will be the same as that of x, but we can estimate it with the more compact formula s2 = [pic]. Suppose, now, that M yi’s are averaged together, each being sampled N times. We’ll label them y1, y2, ... yM. Then, we can write [pic], that is N realizations each of M samples of yi which are then averaged together. We expand this to obtain:

[pic]

The first set of terms contain all the squared values of yi; the second set of terms contain all the cross products of yi. Each of the terms in the first set of brackets is nothing other than the variance of yi. So the first set of terms reduces to [pic] . But, what is the second set? What, for example is [pic]? Recall that the yi’s are random variables having zero mean. Most importantly, y1i is picked or sampled totally independently of y2i , i.e., all the yi’s are independent samples. This means that the value of one yi is uncorrelated with any other yi. So, every crossproduct of yi’s averages to zero. And all the terms in the second set of brackets is zero. The final result is then: [pic].

We really should say that the estimated [pic]. Or we could say the[pic]. These appear to be subtle distinctions, but in the study of statistics, these distinctions are very important. In almost all engineering applications, you will never know ( or (; they will have to be estimated from the data.

So, why do we care about this result? Look carefully. It says that an average measurement has less variance than a single measurement. In fact, the variation is reduced by the factor 1/M. If you wanted to know the width of a soccer field, intuitively you might make three measurements and take the average value. Now you know why. Your intuition told you that an average is a better estimate than a single measurement. Here, we’ve demonstrated what that improvement is. Since we’re usually interested in the standard deviation or standard error of a measurement, our improvement by taking averages is proportional to [pic]. So, I should now be able to ask you the following question: If you know that the variability of defective screws per box is s2 = 5.0, then how many boxes must you sample to estimate the average number of defectives to within 0.1? That is, how big must M be so that the s.d. of the average number of defects [pic]is less than 0.1? Questions like this arise in science and engineering very frequently.

Random processes, distribution functions, moments, uncertainty, etc., are concepts that can be hard to grasp. And theory doesn’t always provide insight. Fortunately, we can get a feel for some of these ideas using a random process simulator. We’ll experiment with the simulator in our virtual lab at jhu.edu/virtlab/stats/statistics.htm

Awhile back I mentioned the task of measuring the length of a soccer field with only a meter stick. Let’s think about that situation in some detail. Let’s presume that every time you lay down the meter stick you introduce a random placement error ei. And, let’s say that the random error is –1cm 30% of the time, 0cm 40% of the time, and +1cm 30% of the time. One meter at a time, you measure from one end of the soccer field to the other, and you discover it to be 112.60 lengths of the meter stick to span the field, i.e., you think the field is 112.60m long. At least, your measurements indicate that it’s 112.60 meters long. But with each measurement, you’ve introduce a possible error of ( 1 cm. And these have accumulated 113 times—one for each time you placed the meter stick. How has this accumulation of errors [pic] affected your result? One way of estimating this is by carrying out the measurement again, and again, and again.

This is a process in uncertainty that we can simulate in our virtual lab. First, let the random variable x in the simulation be used to represent the error ei. Then define the distribution of x in terms of “individual values” as

Pr(x) .3 .4 .3

x -1 0 1

This is our definition of the error every time we take measurement with the meter stick, i.e., our measurement will be in error by –1cm, 0 cm, or 1 cm. First, let’s determine the standard deviation of this error. Do this by setting w = x. That is, our final random variable is nothing other than measurement error itself. Then, set the number of realizations to 1000—that is, you want to get 1000 separate values of this random variable. Then click on “draw”. What you’ll get is a distribution that consists of three values: -1, 0, +1. Not so surprising. And their frequencies of occurrence should be about 3 to 4 to 3. You should also obtain a calculated mean of approximately zero, and a standard deviation of about 0.77. These two numbers partially characterize the nature of the error in taking a single measurement with the meter stick.

Now, we need to see what effect this error has on our total measurement of the soccer field. That requires, say, 113 measurements. So we can form a new random variable w, which is the sum of 113 values of the random error x, as w = sum(x,113). This expression will take the sum of 113 realizations of the random error x and add them together. w will be the total error of our measurement. Again, we want to carry out this “measurement” 1000 times. So, set the number of realizations to 1000, and click on draw. What do you get? A fairly broad distribution of errors. Recall, that this is the distribution of the sum of 113 individual errors added together. What is the average value of this distribution? It is probably close to zero. And what about the spread or standard deviation? That’s probably about 8 cm. Might we have predicted that 8 cm?

Yes. Here’s how.

The standard deviation of an average is [pic] times the standard deviation of the individual random variable. In this case we’re summing the random error 113 times. That’s the same as taking its average, except we’re not dividing by the total number of elements in the sum. So, we might predict that the standard deviation of this sum would be [pic] times the s.d. of the individual error or [pic]*0.77 = 8.18. That’s just about right. So the theory does work. (Or, if you’re a skeptic, you might say the simulation works.)

Now, click on the button “Normal curve”. This will produce a Gaussian curve with the same mean and standard deviation as the displayed distribution. It looks pretty close. So, the sum of 113 errors which are distributed as –1, 0, +1 is approximately a Normal distribution. Interesting. Would you have expected anything else?

Usually, we are required to include some indication of accuracy when we report a measurement. That report is usually the measurement ( one standard deviation. ( one standard deviation is often called the standard error. Thus we would report that the length of the soccer field as 112.60m ( 8cm.

But that value of 112.60m is based on a single measurement. What might be a better estimate? An average, of course. What we mean by improvement is that the standard error is less. For one measurement the standard error is ( 8cm (based on the error simulation). Let’s see what happens if we take a number of measurements. Again, we can use our random variable simulator to get an answer. First, let’s assume that the distribution of errors for our total measurement is Gaussian distributed with a mean of 0 cm. and a standard deviation of 8 cm. That’s roughly what we discovered in our first simulation. Let’s begin all over again, this time defining a random variable x as being Gaussian distributed with mean 0 and standard deviation 8.

Now we want to use w as the average value of a number of measurements, say 10. So, we define w = sum(x,10)/10. sum(x,10) produces the sum of ten realizations of the total error. Since we want the average error over those 10 realizations, we must divide by 10.

Carry out this calculation, say, 1000 times, and plot, as before.

What do we get? Good news. The standard deviation of the error in taking an average of 10 measurements is 2.5cm. What happens if we take an average over 100 measurements? Try it. The standard deviation of the error in taking an average of 100 measurements is 0.8 cm. So, the more measurements we average over, the smaller the standard error. If you plot standard error as a function of M, the number of measurements in the average, you will discover that the standard error is proportional to [pic], just as we deduced. Plot it. See how close it really is.

Mathematical notation and properties of averaging

Using [pic]and subscript notation is a fairly cumbersome way to indicate average value. We can often work with just the overbar notation itself, like [pic]. By knowing a few simple mathematical rules, we can reduce summation expressions directly. In what follows, constants are represented by upper-case letters and random variables are represented by lower-case letters.

Suppose we want the average value of [pic]. Using our overbar notation we would write this as [pic]. When elements are added under an overbar, we can separate the terms into two separate averages, because the average of a sum--in this case the average of M + e--equals the sum of the averages, because of the associativity of addition. And, since M is a constant, [pic] is just M .

There are also two simple rules for multiplication within an overbar. One of them is [pic]. That is, the average of a constant times a random variable is a constant times the average of that random variable. The second rule of multiplication is that [pic], the average of the crossproducts, cannot be mathematically separated. But, there are some things we can say about its value. Suppose e and f are random variables, each with zero mean. If e and f are statistically independent of one another, then [pic] = 0, i.e., the average cross-product of independent random variables with zero means is zero. On the other hand, if e and f have zero means, but are not independent of one another, then [pic]is the covariation between them. In the special case where e and f are the same variable, we would get the expression [pic] which is the variance of e.

This overbar notation is quite standard in those areas of engineering where the problems contain statistical or random quantities.

The Gaussian distribution

Why is it that any time we add together a bunch of random variables, the resulting distribution looks bell-shaped. For example, the score on each question of an exam could be considered a random variable; the total score for the exam is almost always bell-shaped. If we add together 111 measurements, each of which could contain a random error, the resulting total error has a distribution which is bell-shaped. If we count the total number of dots on a throw of 10 dice, the distribution of dots is bell-shaped. What is even more remarkable is that if we add together the outcomes of 100 random variables, the sum is always bell-shaped no matter what the probability distribution is for each of the random variables.

Run some experiments using the random function simulator. Define a random variable x with flat distribution between 0 and 1. Construct a new random variable w as the sum of ten realizations of x, i.e., w= sum(x,10). Obtain 1000 realizations. And, what do you get? A distribution that is very close to Gaussian. Define a random variable y as a Gaussian with a mean of 0 and a standard deviation of 2, and define w =sum(y,10). Obtain 1000 realizations. And, what do you get? A distribution that is very close to Gaussian. In fact, construct any random variable x with as wild a probability distribution as you can think of. Define w as w=sum(x,10). And w will tend to have a Gaussian distribution.

This remarkable result has been observed for a 100 years, but only little by little has the observation been elevated into a mathematical theorem. This theorem is the Central Limit Theorem. In short, the central limit theorem says that if one takes the sum of outcomes of a set of random variables (with suitable restrictions), the resulting sum will have a Gaussian distribution. Actually, there are a number of central limit theorems, each with its own list of restrictions. And, the topic is still open for research. A Gaussian or normal probability distribution function has the form:

[pic]

Of course, if you integrate over all x, it integrates to 1.0. The most important element of this function is that it is fully defined by the mean ( and the standard deviation (. None of the other moments need be known to estimate this function. Some other properties: it is symmetric, so we know that all odd moments are zero. And, since earlier we mentioned the fourth moment about the mean—kurtosis--we will state here that the kurtosis of a Gaussian is 3.0. (That’s a genuine factoid.) Sometimes this function will be referred to as N((,()—a normal distribution with mean ( and standard deviation (. For example if you read that a variable x is distributed as N(0,1), you should what that means.

Although the Gaussian is a fairly complicated function to work with, we do have its exact functional form. This means we can learn anything we need to know about it, either through mathematical analysis or by tabulation. Earlier I mentioned that if your score on an exam was two standard deviations above the average, only about 2% of the class would have a higher score than you. The reason I could make that statement is because the total score on an exam, being a sum, is Gaussian distributed. And it has been tabulated that 95% of the area of a Gaussian distribution lies between + 2( and -2( from the mean. Since the Gaussian is symmetric, half of the remaining 5% must lie below -2( and the other half must lie above +2(.

Now, think about this question: Suppose you measure the length of a soccer field 10 times and take the average value as your best estimate. What’s the distribution of the average value? Of course, since we’re talking about Gaussian distributions, you’ll say “Gaussian”. (And, you’d be right.) But, think about what an average is. It’s the sum of a sequence of values divided by a constant. Since it’s a sum, it’s variation will tend to be Gaussian. And, we noted earlier that the variance of an average is (/(N, where ( is the standard deviation of each individual element and N is the number of elements in the sum.

The present discussion should also make it a little clearer why we might denote the quality of a measurement D by expressing it as D (( , an expected value plus or minus the standard error. Since any measurement is likely to be contaminated by any number of contributing errors, the total error in D is likely to be Gaussian distributed. That means that a measurement of, say 100m (10cm will incorporate the true value 68% of the time—the area under a Gaussian curve between -1( and +1(.

The Gaussian curve is certainly the most important one in science and engineering. Unfortunately, it is not universal. There are a number of random processes that generate other probability distributions, e.g., a Poisson process.

And life is a little more complicated than we’ve led you to believe. Recall, we don’t know ( and (. We can only estimate it with [pic]and s2. So, with uncertainties in ( and (, we can’t really justify the precise probabilities that we’ve mentioned. A more thorough study of probability and statistics would show us how to deal with the problem.

ESTIMATION AND VARIANCE

Let’s talk a bit more about estimation. Estimation is the process of trying to determine properties of a population or event by sampling—whether it’s determining the distribution of defective screws in a box or determining the length of a soccer field. The idea is to collect data in a way which will allow us to most closely estimate properties of the population or event. “Most closely” means with the least variance.

In finding the length of a soccer field, for example, it wasn’t so much the sampling strategy, but rather how to use the data. Recall, that the variance of a single measurement was s2, but the variance of an average value of measurements was s2/M. So, we can reduce variance by taking averages. But, there are other ways as well.

In some situations, we have little choice for a sampling strategy. For example, if we want to find the distribution of outcomes from a roulette wheel, there is not much choice but to spin and record, spin and record, spin and record.

But, there are some instances where one does have a choice. And the choice one makes, can have a significant effect. Consider the following schematic:

This represents a plot of land with trees. Area A is sparsely populated with trees; areas B are densely populated with trees. The total area is known and is very large, say, thousands of square kilometers. The task is to estimate how many trees are in the plot.

Since there are, perhaps, tens of millions of trees, counting them is out of the question. Sampling is the only reasonable approach. First, what will you measure? Since you know the total area, you can sample tree density (, e.g., trees per hectare (100m x 100m) and calculate the total number of trees from that. Recall, that one of the goals of sampling is to accurately represent the total population in the sample. In this case, the population is the collection of trees. And its distribution would look like this:

One strategy is to randomly pick locations within the entire area. Then, use those locations as center points about which you measure 100m x 100m sections. Then count the trees in each of these sections. Since, the locations are chosen randomly, you expect to sample the correct proportion of A and B areas. So, you can expect to estimate a representative statistic: average trees per hectare. The difficulty is the statistic you would deduce would have a fairly high variance. Is there another sampling strategy that would have a lower variance? The answer is yes. The land area consists of two sub-populations: high tree density and low tree density. The sampling strategy we just outlined is based on random sampling over the entire area. It turns out that we can significantly lower the variance on our estimates if we separately sample each of the sub-areas and combine their results. This is the technique of stratified sampling.

Here’s the theory. First, we’ll simplify the problem so we don’t get bogged down in algebra. Make the assumption that half the area is of type A, and half of type B. And, let’s denote the sample average tree density of the areas as [pic]and [pic]. Since the areas are of equal size, we can write the average density of the whole plot as [pic] = ([pic]+[pic])/2. If we define D = [pic] -[pic], then –D = [pic] - [pic]. That is, D is the difference between the global mean [pic] and the individual means [pic]and [pic]. First, we’ll do the problem as if we’re doing a simple random sample over the entire area. But we’ll develop it to acknowledge that there are these two different sub-areas. We write the sample variance of ( as:

[pic] [pic]

If we reorganize this equation to indicate from which area the samples were taken, we obtain:

[pic][pic], where NA + NB = N. Now, [pic] can be expressed in terms of D and the individual means. So:

[pic]. Expanding the squared terms, we get:

[pic]The first sum can be rewritten as:

[pic]

What happened to the middle term? It’s zero because it’s simply the sum of the observations about the mean. The second sum in the previous equation gives a similar result. So, now we can write:

[pic]. What does this say? It says that the variance of a set of random samples taken from the entire plot consists of three elements: a weighted variance of samples taken from region A, a weighted variance of samples taken from regions B, and the square of the difference between the mean tree density of the two areas. So, even if the distribution of trees is very narrow in each of the two types of areas, the sample variance could be large, simply because of the difference in the average tree density between the two areas.

We’re not quite done. we’re interested in how well we can estimate the total number of trees; and that is characterized by the variance of [pic]. We know that the variance of an average value is related to the variance of the random variable itself as var([pic]) = var (()/N, so we deduce that[pic].

Rather than carrying out global random sampling, one can carry out stratified random sampling to obtain another estimate of [pic], say, [pic] . That is, sample the areas A and B separately and obtain values for var((A) and var((B). The statistic we want is [pic] = ([pic]+[pic])/2, and its variance. Since [pic] is unrelated to [pic], the variances add, and we get [pic]. By evaluating the area as two sub-areas, we’ve eliminated the D2 term. That’s our improvement using stratified sampling. If you want to see the value of this technique, try the tree-counting simulation at jhu.edu/virtlab/trees/howmany.htm .

Note: The expressions for [pic] and [pic] are not quite parallel, because in our derivation of [pic] we explicitly toke into account that sub-areas A and B are of equal size. We did not make that assumption in deriving [pic]. The two expressions would be similar (except for the D2 term), if NA and NB were taken as N/2 to reflect equal areas and equal sampling of each area.

PROPAGATION OF ERROR

To conclude this section on random processes we’ll discuss their impact on measurement error. Usually, we will take some measurement [pic] and presume that it consists of the real value m and some error ei, i.e., [pic]= m + ei. And we presume that the error ei has zero average. This means that the average value of a measurement [pic] would be expected to equal the true value m. In other words, the average value of the ei’s is zero. If the ei’s do not have zero average then our measurement is biased. Or we can say that it contains a systematic error. Of course, we always try to carry out a measurement that does not contain a systematic error.

But, just because we can carry out a measurement with an average error of zero, does not mean that we will be free of systematic error. Depending on how we use that measurement, it is possible to introduce one. We can illustrate this with a very simple example. Suppose we want to estimate the area of a square. So, we measure the length of a side, and we square that value. Just to make sure, we do this a number of times and take an average:

[pic]

The right hand side of this equation has three terms. The first is the true area m2. The second is the average of a random variable with zero mean. So, it’s zero. But the third is the average of a random variable which is squared. So every item in the sum is positive. This term has introduced a bias into our estimate of A, even though our measurement error was not biased. So, we will always overestimate A.

The reason that we have created this bias is because we have used the measurement (and the error) in a non-linear way. This means that the error does not appear in the calculations just as a first power. Here, in one term, the error is squared. Is there a way to carry out the measurement so that we don’t introduce such a problem? In general, no. But, in this case, yes. All we need to do is to take two measurements of the square: one to measure the height; one to measure the width--even though they're supposed to be the same length. If we do that, then we can calculate the average area as

[pic], where h and w are the true height and width of the square, and ei and (i are the errors in taking those measurements. Now, if you look at the terms, all but the first average to zero. Terms two and three average to zero because the random errors ei and (i have zero averages. Term four is the sum of ei(i --errors which are independent of one another. So, the expected value of this product is zero: some terms will be positive; some will be negative.

An example of a measurement which will always have a biased error is estimating the height of a tree by measuring out some distance X from the tree, then measuring the angle ( to the top of the tree. Then, the height of the tree h = X tan(().

In N measurements, we would obtain an average height of the tree as:

[pic]Here, the mathematics begins to get sticky. On average the second term is zero because it is the product of one random error times a function of another, independent, random error. But the average of the first time is not X tan((). It's value will depend on (. And, since there is no closed form expression which separates ( from the other variables, we can't even calculate its effect. However, if you're curious, you can experiment with the random function simulator and see for yourself. Especially, when ( is large--like greater than 60o--the nonlinearity in the problem yields quite biased results.

The Calculus of errors

There's another way to estimate the effect of measurement error that has nothing to do with probability or random processes. It involves the way a function F(x,y) changes as x and y change--the essential ideas of calculus. We'll begin with a problem. Suppose we want to calculate the volume of a structure that consists of a cone resting upon a rectangular parallelopiped. The total volume of this structure is:

[pic].

We will not measure V directly, but rather we'll calculate V by taking measurements of R, Hc, L, W, and H. But suppose these measurements are not perfectly accurate. So the question is how much error will we introduce into our calculation of V by using inaccurate values for the measured variables?

If each measurement is in error, then the calculated volume would consist of the true volume V plus an error v. The relation between the error-borne measurements and the resulting calculated volume would be:

[pic].

Expanding this equation, then subtracting out the equation for V, we get

[pic]

+[pic].

First, let's look at v statistically, i.e., what would be the average value of the error v if measurements were taken many times and averaged. Using overbar notation, we obtain [pic] [pic].

If hc, r, l, w, and h are all independent random variables with zero mean, then we get

[pic]. All the other terms average to zero.

Thus, we have deduced an average expected error in v. But, suppose we can't take a lot of sample measurements, and we would like to know what is the "worst case" error. That's fairly simple. Suppose we can estimate the maximum possible error on each measurement. Let these be labeled rmax, lmax, hc max, hmax, and wmax. Then

[pic]

[pic]

A simplification of this is to assume that rmax, lmax, hc max, hmax, and wmax are all very small compared to R, L, Hc, H, and W. Then terms containing two or more of these maximum errors will be much smaller than terms containing only one. Consequently, if we ignore these smaller terms, vmax can be approximated as

[pic].

There's a reason for making this simplification, even though it's only an approximation to the maximum error in v. The reason is that this simplified expression is easily deduced for any combination of measurements. This is the result that one would obtain by taking the total differential of the function [pic]. The total total differential is defined like this: If F is a differentiable function depending on n variables [pic], then infinitesimal variations in F are determined by infinitesimal variations in the xis as:

[pic]

dF is called the “total differential” . Our variations rmax, lmax, hc max, hmax, and wmax are not infinitesimal. But, if they're quite small, then the total differential is a pretty good approximation to the total error vmax (that is, dF in the notation immediately above).

Notice that the value for vmax really is an error that you would never expect to have. Not only are all of the measurement errors assumed to be at their maximum, but all of them are contributing with the same sign. That is, there are no canceling errors. If the measurement errors are truly random variables with zero mean, the expected error in a calculated value of V would be much less. Nevertheless, estimating error using this calculus can be extremely valuable, especially if you must absolutely determine some parameter within a specified error.

Another way of representing this maximum error is by percentages. In our example, if the total volume V is separated into its constituent pieces V = Vc + Vp , where the subscripts c and p refer to the cone and parellelopiped, respectively, then the above equation can be rewritten as

[pic].

This equation shows that percentage errors in the parallelepiped measurements, e.g., [pic], are linearly additive with weight Vp, whereas a percentage error in the measurement of R is doubly additive with weight Vc.

This technique is described as the “total differential” or “calculus of errors” in elementary calculus books.

Why do we care about all this stuff: propagation of errors, lower variances of estimates, Gaussian distribution functions? Are these ideas just academic curiosities? Or are they actually important to engineers? The answer is they are actually important—really important.

Let’s take the case of counting trees—something that would appear to be a make-work exercise. Let’s put it into an engineering context. Suppose you are an engineer working for a lumber company which is trying to decide how much money it should offer for 100,000 acres of forested land—we’re talking about tens of millions of dollars. The land is valuable to the lumber company for its timber, and that depends on how many trees are on the plot. If more money is offered for the land than its timber value, the company loses money. If less money is offered than its timber value, then another company is likely to outbid yours. So you need to make the right offer—not too big, not too small. Of course you can actually count the trees. Then you could make a bid with almost perfect accuracy. But how much time and money would it take for you to actually count the trees on 100,000 acres? Far too much. So, the engineering solution is to make an estimate based on sampling. The more samples, the more accurate the estimate, the more the costs. Remember, you’re in competition with other companies bidding for the land. What you need to do is optimize your estimation procedure—obtain the most accurate estimate at the least cost. In this example, that might be carried out through stratified sampling. What we covered above was only an introduction to the concept of stratified sampling. In reality, there are whole books on the subject: how to get the most information, mostly accurately, and at least cost. There’s that universal engineering parameter again: $$$.

Uncertainty is an integral part of engineering and science. Understand it, because every measurement, every experiment, every real situation has uncertainty. Understand it, because it’s central to the design of earthquake-resistant buildings, safety factors for airplane wings, and production strategies in a market economy. Understand it, because it represents those things that an engineer does not know about a given physical situation—the details of the process, the variability of the environment, the error in the measurement. Sometimes engineers don’t want to know anymore. Because good statistics can give them enough information to proceed. But engineers need to know how statistics characterize uncertainty. Only then can the engineer use it as a tool to solve his problems.

-----------------------

1

2

4

5

6

3

n

f(n)

0.16667

0.16667

f(n)

2

3

n

6

5

4

11

9

8

7

12

10

1/360

0

360

x

f(x)

10

10[pic]0

10[pic]

f(x)

x

x

x

x

f(x) [pic]

f(x)

f(x)

[pic]

A

B

B

B

B

f(()

(

A

B

(

X

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download