1 Multivariate Normal Distribution - Princeton University

嚜燙TA561: Probabilistic machine learning

Gaussian Models (9/9/13)

Lecturer: Barbara Engelhardt

1

Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava

Multivariate Normal Distribution

The multivariate normal distribution (MVN), also known as multivariate gaussian, is a generalization of

the one-dimensional normal distribution to higher dimensions. The probability density function (pdf) of an

MVN for a random vector x ﹋ Rd as follows:

N (x|?, 曳) ,

1

(2羽)d/2 |曳|1/2

1

exp ? (x ? ?)T 曳?1 (x ? ?)

2

(1)

X2

where ? = E[x] ﹋ Rd is the mean vector, and 曳 = cov[x] is d ℅ d symmetric positive definite matrix, known

as the covariance matrix. 曳?1 is known as the precision matrix.

1st eigenvector of 曳

?2

2nd eigenvector of 曳

X1

?1

Figure 1: 2 dimensional Gaussian density.

Fig. 1 shows a 2-dimensional Gaussian density. The random vectors span two dimensions and are denoted in

the plot by X1 (x-axis) and X2 (y-axis). The means of X1 and X2 are ?1 and ?2 respectively. The density at

? is highest, and as the random vector moves away from ?, the density goes down. All of the points on the

red contour (level set) have the same density. The first and second eigenvectors of the covariance matrix are

orthogonal to each other as shown in the Fig. 1. The first eigenvalue is the direction of maximum variance

in the MVN; the second eigenvector is orthogonal to the first.

p

The expression inside the exponent can be rewritten as: (x ? ?)T 曳?1 (x ? ?). The Mahalanobis distance

between two vectors x1 and x2 is equivalent to the MVN, calculating the pdf of one of the two points, with

1

2

Gaussian Models

the other as the mean:

md(x1 , x2 ) =

q

(x1 ? x2 )T 曳?1 (x1 ? x2 ).

(2)

Note that this distance is symmetric.

2

MLE for MVN

In order to determine the MLE for a MVN, we need some basic results from linear algebra. Recall the

following definitions.

? The trace of a matrix A ﹋ Rd℅d is defined as the sum of A＊s diagonal elements, i.e., tr(A) =

Pd

i

Aii .

? The determinant of a matrix A ﹋ Rd℅d is defined as the product of its eigenvalues. A positive definite

matrix A has positive eigenvalues, so the determinant will always be positive.

? Symmetric positive definite matrices (as we will consider for our covariance matrices) are defined as

having eigenvalues that are strictly positive.

? Trace has the cyclic permutation property:

tr(ABC) = tr(CAB) = tr(BCA)

Given vectors a, b and matrices A, B, C, we have the following facts:

?

?bT a

?a

?

?(aT Aa)

?a

= (A + AT )a. (Note that if A is symmetric, this equals to 2Aa)

?

?tr(BA)

?A

= BT

?

? log |A|

?A

= A?T , (A?1 )T

=b

? Trace trick: aT Aa = tr(aT Aa) = tr(aaT A) = tr(AaaT )

We will be using these aforementioned facts in deriving the MLEs, ??M LE and 曳?M LE , for a MVN given a

data set D = {x1 , x2 , x3 , ..., xn }, where xi ﹋ Rd is a sample vector from the MVN. The log likelihood of the

data set D given MVN parameters ?, 曳 can be written as

L(?, 曳; D)

=

log

n

Y

p(xi |?, 曳)

(3)

i=1

n

=

1X

n

log |曳?1 | ?

tr[(xi ? ?)(xi ? ?)T 曳?1 ]

2

2 i=1

Set the partial derivative with respect to ? to 0,

n

?L(?, 曳; D)

1X

=?

?2曳?1 (xi ? ?)] = 0

??

2 i=1

(4)

Gaussian Models

3

we get MLE of ? as follows,

n

X

?

i=1

n

X

(xi ? ?) = 0

xi ?

i=1

n

X

?=0

i=1

n

?

?M LE =

1X

xi

n i=1

(5)

This means that the MLE of ? for MVN is just the empirical mean of the samples.

Similarly, setting the partial derivative of the log likelihood (Equation 4) with respect to 曳?1 to 0,

n

n

?L(?, 曳; D)

1 X

=

曳

?

(

(xi ? ?)(xi ? ?)T ) = 0

?曳?1

2

2 i=1

we get MLE of 曳 as follows,

n

曳M LE =

1X

(xi ? ?)(xi ? ?)T

n i=1

(6)

This expression is just the empirical covariance of the data, centered on ?.

3

The MVN is in the Exponential Family

We have already seen that if x ‵ N (?, 曳) then E[x] = ? and cov[x] = 曳. These are also called the mean

or the moment parameters of the distribution. We can also express the MVN in exponential family form in

terms of the natural parameters as

托 = 曳?1 , 灰 = 曳?1 ?.

(7)

Similarly, we can convert the natural parameters back to moment parameters as

曳 = 托?1 , ? = 托?1 灰.

(8)

Note that the natural parameter covariance matrix is the precision matrix. Also note that the relationship

between the mean parameters and the natural parameters is an invertible relationship, so the MLE for the

natural parameters can be converted into the MLE for the mean parameters (and vice versa). This enables

us to work in the most mathematically convenient space, and convert afterwards between parameterizations.

We can rewrite the MVN density, in Eqn (1), in exponential family form using the natural parameters as

follows:

h 1

i

P (x|灰, 托) = (2羽)?d/2 |托|1/2 exp ? (xT 托x + 灰 T 托灰 ? 2xT 灰)

2

h

i

1

1

= (2羽)?d/2 |托|1/2 exp 灰 T x ? xT 托x ? 灰 T 托灰

2

2

h

i

1

1

= (2羽)?d/2 |托|1/2 exp 灰 T x ? tr(托xxT ) ? 灰 T 托灰

(9)

2

2

Recall the exponential family form:

P (X|灰) = h(X) exp{灰 T T (X) ? A(灰)},

(10)

4

Gaussian Models

where 灰 in Eqn (10) denotes the natural parameter vector, and T(x) is the vector of sufficient statistics for

the MVN.

We can see then that Eqn (9) is in exponential family form, and we read out the sufficient statistics of MVN:

x

T (x) =

,

xxT

and the natural parameters are

灰? =

灰

? 21 托

and the log partition function is

A(灰, 托) =

1 T

1

灰托灰 = 托灰灰 T .

2

2

Thus, the sufficient statistics for a MVN are the empirical mean and the empirical covariance.

4

Marginals and Conditionals for an MVN

Let＊s consider an example where d = 2. If it is simpler, let X = (x1 , x2 ) where x1 and x2 are scalar. However,

if we consider x1 and x2 to be a split of the MVN data in dimension d > 2, where each x1 and x2 is a vector,

all of this subsequent section goes through naturally. Suppose that x1 and x2 are jointly Gaussian:

?

曳11 曳12

托

托12

?= 1

,曳 =

and hence 托 = 曳?1 = 11

?2

曳21 曳22

托21 托22

Can we find the marginal and conditional distributions in this space? Recall that

Z

P (x1 ) =

N (x1 , x2 |?, 曳)dx2 .

x2

Using this model we can derive the following distributions with Eqn. (8):

?1

? Marginal: P (x1 ) = N (x1 |?1 , 曳11 ), where ?1 = 灰2 ? 托21 托?1

11 灰1 and 曳11 = 托22 ? 托21 托11 托12 .

?1

? Marginal (equivalent): P (x2 ) = N (x2 |?2 , 曳22 ), where ?2 = 灰1 ?托12 托?1

22 灰2 and 曳22 = 托11 ?托12 托22 托21 .

? Conditional distribution: P (x1 |x2 ) = N (x1 |?1|2 , 曳1|2 ) , where ?1|2 = ?1 + 曳12 曳?1

22 (x2 ? ?2 ), 曳1|2 =

曳11 ? 曳12 曳?1

曳

,

or

more

concisely

(natural

parameters),

?

=

灰

?

托

x

,

曳

1

12 2

1|2

1|2 = 托11 .

22 21

The converse conditional distribution, p(x2 |x1 ) is written out equivalently (swapping the 1, 2 indices). These

formulas are derived using the Schur complement of a matrix and the matrix inversion lemma. Note that

conditional probabilities are straightforward to consider in the natural parameter space, where marginal

probabilities are much simpler in the mean parameter space.

The marginal distribution in the mean parameter space is a simple projection of a (for example) 2D MVN

cloud onto each of the univariate Gaussian distributions i one dimension. The conditional distribution is a

similar projection, but considering only a slice of the space at the conditional random variable. When the

off-diagonal elements of the covariance matrix are 0, the conditional distribution is identical to the marginal

distribution (as the two univariate Gaussians are independent).

Gaussian Models

5

5

Conjugate prior

The conjugate prior for the mean term ? of a multivariate normal distribution is a multivariate normal

distribution:

p(?|X) ≦ p(?)p(X|?),

(11)

where p(?) is a multivariate normal distribution, ? ‵ N (?0 , 曳0 ). The implication of this prior is that the

mean term has a Gaussian distribution across the space that it might lie in: generally large values of 曳0

are preferable unless we have good prior information about the mean term (e.g., that it will be right around

zero).

The conjugate prior for the covariance matrix 曳 of a multivariate normal distribution is the inverse Wishart

distribution:

p(曳|X) ≦ p(曳)p(X|曳),

(12)

where p(曳) is an inverse Wishart distribution 曳 ‵ IW(糸, 朵). The inverse Wishart is a PDF for positive

definite matrices.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

1 Multivariate Normal Distribution - Princeton University

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches

1 Multivariate Normal Distribution - Princeton University

Marginal distribution example

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches