Probability Distributions - Duke University

[Pages:1]Probability Distributions

CEE 201L. Uncertainty, Design, and Optimization Department of Civil and Environmental Engineering Duke University

Philip Scott Harvey, Henri P. Gavin and Jeffrey T. Scruggs Spring 2022

In the context of random variables, capital italics (X) represent an uncertain quantity (a random variable) and lower case italics (x) represent a particular value of that random variable. Random variables can be discrete or continuous.

? Discrete random variables can take on values that are members of a (finite or infinite) set of discrete values. If X can take on only positive whole numbers (the number of times a team can win over all time), then X is a discrete random variable with an infinitely large population. If X can take on only whole numbers, between 0 and 23, (the hour of a day) then X is a discrete random variable with a finite population.

? Continuous random variables can take on any value within finite or infinite bounds. The population of potential values of any continuous random variable is infinitely large.

This document focuses on continuous random variables.

1 Probability distributions of continuous random variables

The properties of a random variable (rv) X distributed over the domain x X x^ are fully described by its probability density function or its cumulative distribution function.

The probability density function (PDF) of X is the function fX (x) such that for any two numbers a and b within the domain x a b x^,

b

P [a < X b] = fX (x) dx

a

For fX (x) to be a proper distribution, it must satisfy the following two conditions:

? The PDF fX (x) is not negative; fX (x) 0 for all values of x between x and x^.

x^

? The rule of total probability holds; the total area under fX (x) is 1; fX (x) dx = 1.

x

The cumulative distribution function (CDF) of X is the function FX (x) that gives, for any specified value b between x and x^, the probability that the random variable X is less than or equal to the value b is written as P [X b]. The CDF is defined by

x

FX (x) = P [X x] = fX (s) ds ,

-

2 CEE 201L. Uncertainty, Design, and Optimization ? Duke University ? Spring 2022 ? P.S.H., H.P.G. and J.T.S.

where s is a dummy variable of integration. So, P [a < X b] = FX (b) - FX (a)

By the first fundamental theorem of calculus, the functions fX (x) and FX (x) are related as

fX (x)

=

d dx

FX

(x)

Some important characteristics of CDF's of X are:

? CDF's, FX (x), are monotonic non-decreasing functions of x.

? For any number a, P [X > a] = 1 - P [X a] = 1 - FX (a)

b

? For any two numbers a and b, with a b, P [a < X b] = FX (b) - FX (a) = fX (x)dx

a

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

Probability Distributions

3

2 Statistics of random variables

The expected or mean value of a continuous random variable X with PDF fX (x) is the centroid of the probability density.

?X = E[X] = x fX (x) dx

-

The expected value of an arbitrary function of X, g(X), with respect to the PDF fX (x) is

?g(X) = E[g(X)] =

g(x) fX (x) dx

-

The variance of a continuous random variable X with PDF fX (x) and mean ?X gives a quantitative measure of how much spread or dispersion there is in the distribution of x values. The variance is the expectation of (X - ?X )2

X2 V[X] = E[(X - ?X )2]

= (x - ?X )2 fX (x) dx

-

=

(x2 - 2?X x + ?2X ) fX (x) dx

-

=

x2fX (x) dx - 2?X

xfX (x) dx + ?2X

fX (x) dx

-

-

-

= E[X2] - 2?X E[X] + ?2X ... but ?X = E[X] so ...

= E[X2] - ?2X ... the mean of the square minus the square of the mean

The standard deviation (s.d.) of X is X = V[X]. The coefficient of variation (c.o.v.) of X is the standard deviation as a fraction of the mean:

cX =

X ?X

... for ?X = 0

The c.o.v. is a normalized measure of dispersion and is dimensionless.

A mode of a probability density function, fX (x), is a value of x such that the PDF is maximized;

d dx fX (x) x=xmode = 0 . A multi-modal distribution is a distribution with multiple modes.

The median value, xmed, is is the value of x such that

P [X xmed] = P [X > xmed] = FX (xmed) = 1 - FX (xmed) = 0.5 .

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

4 CEE 201L. Uncertainty, Design, and Optimization ? Duke University ? Spring 2022 ? P.S.H., H.P.G. and J.T.S.

3 Statistics from a sample of values of a random variable (Sample Statistics)

Consider a fixed sample of m specific observed numerical values {x1, ? ? ? , xm} drawn from a population with CDF FX (x). If X is is a continuous random variable, it can take on any value within potentially infinite bounds. In such cases the population is infinitely large, and it is impossible to know it's distribution FX (x) exactly. A random sample of the population can, however, be used to estimate the population statistics. A few sample statistics are:

? xmax and xmin: the maximum and minimum values of the sample {x1, ? ? ? , xm}

xmin = miin(xi), i = 1, ? ? ? , m

xmax = miax(xi), i = 1, ? ? ? , m

? xavg : the arithmetic average of values of the sample {x1, ? ? ? , xm}

... is the estimate of the population mean, ?^X

?^X

xavg

=

1 m

m

xi

i=1

? xgm : the geometric average of values of the sample {x1, ? ? ? , xm}

m

1/m

xgm =

xi

i=1

? xhm : the harmonic average of values of the sample {x1, ? ? ? , xm}

xhm =

m1 i=1 xi

-1

? xmed: the median value of the sample, for which half of the sample is greater than xmed.

? xmad: the average absolute deviation of the sample,

xmad

=

1 m

m

|xi

i=1

- xavg|

? xsd: the standard deviation of values in the the sample, ... is the estimate of the population standard deviation, ^X

^X

x2sd

=

1 m-1

m

(xi

i=1

- xavg)2

? xcov: the coefficient of variation of the sample

xcov =

xsd xavg

? Sample statistics of a function of a sample {g(x1), g(x2), ? ? ? , g(xm)} are analogously ...

g(x)min = min(g(xi)), i = 1, ? ? ? , m

i

g(x)max = max(g(xi)), i = 1, ? ? ? , m

i

g(x)avg =

1 m

m

g(xi)

i=1

g(x)2sd

=

1 m-1

m

(g(xi) - g(x)avg)2

i=1

Importantly, note that in general, g(x)min = g(xmin), g(x)avg = g(xavg), et cetera.

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

Probability Distributions

5

4 Empirical PDFs, CDFs, and exceedance rates

A PDF and a CDF of a sample of values can be computed directly from the sample. without assuming any particular probability distribution

A sample of m random values can be sorted into increasing numerical order, so that

x1 x2 ? ? ? xi-1 xi xi+1 ? ? ? xN-1 xm.

In the ordered sample there are i data points less than or equal to xi. So, if the sample is representative of the population, and the sample is "big enough" the probability that a random X is less than or equal to the ith ordered value is i/m. In other words, P [X xi] = i/m. Unless we know that no value of X can exceed xm, we must accept some probability that X > xm. So, P [X xm] should be less than 1. In such cases, the unbiased estimate1 E[FX (xi)] for P [X xi] is i/(m + 1)

The empirical CDF computed from a ordered sample of m values is

F^X (xi)

=

m

i +

1

The empirical PDF is basically a histogram of the data. The following Matlab lines plot empirical CDFs and PDFs from a vector of random data, x.

1 m = length(x);

% number of values in the sample

2 x = sort(x);

% sort the sample

3 x_avg = sum( x )/ m ;

% average value

of the sample

4 x_med = x (round( m /2));

% median value

of the sample

5 x_sd = sqrt(var(x));

% standard deviation

of ths sample

6 x_cov = abs( x_sd / x_avg );

% coefficient of variation of the sample

7 nBins = floor(m/50);

% number of bins in the histogram

8 [fx ,xx] = hist (x,nBins );

% compute the histogram

9

fx = fx / m * nBins /(max( x ) -min( x ))); % s c a l e t h e h i s t o g r a m t o a PDF

10 F_x = ([1: m ])/( m +1);

% e m p i r i c a l CDF

11 subplot (121); bar( xx , fx );

% p l o t e m p i r i c a l PDF

12 subplot (122); s t a i r s ( sort ( x ) , F_x ); % p l o t e m p i r i c a l CDF

13 p r o b a b i l i t y _ X _ g t _ 1 = sum(x >1) / ( m +1) % f r a c t i o n o f t h e sample f o r which X > 1

1.4

1

P[X>1] = 0.237

1.2

0.8

x = 0.786 avg

1

0.8

0.6

x = 0.387 sd

PDF f (x) X

CDF F (x) X

0.6

0.4

0.2

0 x -x x

x +x

avg sd avg avg sd

0.4

x = 0.492 cov

0.2

x = 0.703 med

0

0

0.5 x 1

1.5

2

2.5

3

med

x

The number of values in the sample greater than xi is (m - i). If the sample is representative, the probability of a value exceeding xi is Prob[X > xi] = 1 - FX (xi) 1 - i/(m + 1). If the m observations were collected over a period of time T , the average exceedance rate (number of events greater than xi per unit time) is (xi) = (1 - FX (xi))(m/T ) (1 - i/(m + 1))(m/T ).

1E.J. Gumbel, Statistics of extremes, Columbia Univ Press, 1958 Lasse Makkonen, "Problems in the extreme value analysis," Structural Safety 2008:30:405-419

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

6 CEE 201L. Uncertainty, Design, and Optimization ? Duke University ? Spring 2022 ? P.S.H., H.P.G. and J.T.S.

5 Some common distributions

The National Institute of Standards and Technology (NIST) lists properties of nineteen commonly used probability distributions in their Engineering Statistics Handbook. This section describes the properties of seven distributions. For each of these distributions, this document provides figures and equations for the PDF and CDF, equations for the mean and variance, the names of Matlab functions to generate samples, and empirical distributions of such samples.

5.1 The Normal distribution

The Normal (or Gaussian) distribution is perhaps the most commonly used distribution function.

The notation X N (?X , X2 ) denotes that X is a normal random variable with mean ?X and variance X2 . The standard normal random variable, Z, or "z-statistic", is distributed as N (0, 1). The probability density function of a standard normal random variable is so widely used it has its

own special symbol, (z),

(z) = 1 exp 2

z2 -2

Any normally distributed random variable can be defined in terms of the standard normal random variable, through the change of variables

X = ?X + X Z.

If X is normally distributed, it has the PDF

fX (x) =

x - ?X X

=

1 exp 2X2

-

(x

- ?X 2X2

)2

There is no closed-form equation for the CDF of a normal random variable. Solving the integral

(z) = 1

z e-u2/2 du

2 -

would make you famous. Try it. The CDF of a normal random variable is expressed in terms of the

error function, erf(z). If X is normally distributed, P [X x] can be found from the standard

normal CDF

P [X x] = FX (x) =

x - ?X X

.

Values for (z) are tabulated and can be computed, e.g., the Matlab command . . . Prob_X_le_x = normcdf(x,muX,sigX). The standard normal PDF is symmetric about z = 0, so (-z) = (z), (-z) = 1 - (z), and P [X > x] = 1 - FX (x) = 1 - ((x - ?X )/X ) = ((?X - x)/X ).

The linear combination of two independent normal rv's X1 and X2 (with means ?1 and ?2 and variances 12 and 22) is also normally distributed,

aX1 - bX2 N a?1 - b?2 , (a1)2 + (b2)2 ,

and more specifically, aX - b N a?X - b , (aX )2 .

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

Probability Distributions

7

Given the probability of a normal rv, i.e., given P [X x], the associated value of x can be found from the inverse standard normal CDF,

x - ?X = z = -1(P [X x]) . X

Values of the inverse standard normal CDF are tabulated, and can be computed, e.g., the Matlab command . . . x = norminv(Prob_X_le_x,muX,sigX).

5.2 The Log-Normal distribution

The Normal distribution is symmetric and can be used to describe random variables that can take positive as well as negative values, regardless of the value of the mean and standard deviation. For many random quantities a negative value makes no sense (e.g., modulus of elasticity, air pressure, and distance). Using a distribution which admits only positive values for such quantities eliminates any possibility of non-sensible negative values. The log-normal distribution is such a distribution.

If ln X is normally distributed (i.e., ln X N (?ln X , ln X )) then X is called a log-normal random variable. In other words, if Y (= ln X) is normally distributed, eY (= X) is log-normally distributed.

?Y = ?ln X ,

Y2 = l2n X ,

P [Y y] P [ln X ln x]

P [X x]

=

FY (y) Fln X (ln x)

FX (x)

=

y-?Y Y

ln x-?ln X ln X

The mean and standard deviation of a log-normal variable X are related to the mean and standard deviation of ln X.

?ln X

=

ln ?X

-

1 2

l2n

X

l2n X = ln 1 + (X /?X )2

If (X /?X ) < 0.30, ln X (X /?X ) = cX

The median, xmed, is a useful parameter of log-normal rv's. By definition of the median value, half of the population lies above the median, and half lies below, so

ln xmed - ?ln X ln X

= 0.5

ln xmed - ?ln X = -1(0.5) = 0 ln X

and, ln xmed = ?ln X xmed = exp(?ln X ) ?X = xmed 1 + c2X For the log-normal distribution xmode < xmedian < xmean. If cX < 0.15, xmedian xmean.

If ln X is normally distributed (X is log-normal) then (for cX < 0.3) P [X x] ln x - ln xmed cX

If ln X N (?ln X , l2n X ), and ln Y N (?ln Y , l2n Y ), and Z = aXp/Y q then

ln Z = ln a + p ln X - q ln Y N (?ln Z , l2n Z )

where ?ln Z = ln a + p?ln X - q?ln Y = ln a + p ln xmed - q ln ymed and l2n Z = (pln X )2 + (qln Y )2 = p2 ln(1 + c2X ) + q2 ln(1 + c2Y ) = ln(1 + c2Z )

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

8 CEE 201L. Uncertainty, Design, and Optimization ? Duke University ? Spring 2022 ? P.S.H., H.P.G. and J.T.S.

Uniform X U[a, b]

a X b;

1/(b-a)

Triangular X T (a, b, c) a X b, a c b

2/(b-a)

p.d.f., f(x)

p.d.f., f(x)

c.d.f., F(x)

0

a

?-

?

?+

b

0.79

0.5

0.21

a

?-

?

?+

b

x

f (x) =

1 b-a

,

x [a, b]

0, otherwise

0,

xa

F (x) =

x-a b-a

,

x [a, b]

1,

xb

?X

=

xmed

=

1 2

(a

+

b)

X2

=

1 12

(b

-

a)2

x = a + (b-a)*rand(1,N);

c.d.f., F(x)

0 a

?- c

?

?+

?+2

b

0.97 0.82

0.55

0.17 a

?- c

?

?+

x

?+2

b

2(x-a) (b-a)(c-a)

,

f (x) =

2(b-x) (b-a)(b-c)

,

0,

x [a, c] x [c, b] otherwise

0,

F (x)

=

(x-a)2 (b-a)(c-a)

,

1

-

(b-x)2 (b-a)(b-c)

,

1,

xa x [a, c] x [c, b] xb

?X

=

1 3

(a

+

b

+

c)

X2

=

1 18

(a2

+

b2

+

c2

-

ab

-

ac

-

bc)

x = triangular rnd(a,b,c,1,N);

0.8

0.8

empirical p.d.f.

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0.5

1

1.5

2

0

0

0.5

1

1.5

2

2.5

1

1

empirical c.d.f.

0.8

0.8

0.6

0.6

0.4

0.4

0.2

?=1.0

=0.5

0.2

?=1.0 =0.5

0

0

0.5

1

1.5

2

x

0

0

0.5

1

1.5

2

2.5

3

x

(CC) BY-NC-ND April 27, 2022 PSH, HPG, JTS

empirical p.d.f.

empirical c.d.f.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download