1 Multivariate Normal Distribution - Princeton University

STA561: Probabilistic machine learning

Gaussian Models (9/9/13)

Lecturer: Barbara Engelhardt

1

Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava

Multivariate Normal Distribution

The multivariate normal distribution (MVN), also known as multivariate gaussian, is a generalization of

the one-dimensional normal distribution to higher dimensions. The probability density function (pdf) of an

MVN for a random vector x ¡Ê Rd as follows:

N (x|?, ¦²) ,

1

(2¦Ð)d/2 |¦²|1/2





1

exp ? (x ? ?)T ¦²?1 (x ? ?)

2

(1)

X2

where ? = E[x] ¡Ê Rd is the mean vector, and ¦² = cov[x] is d ¡Á d symmetric positive definite matrix, known

as the covariance matrix. ¦²?1 is known as the precision matrix.

1st eigenvector of ¦²

?2

2nd eigenvector of ¦²

X1

?1

Figure 1: 2 dimensional Gaussian density.

Fig. 1 shows a 2-dimensional Gaussian density. The random vectors span two dimensions and are denoted in

the plot by X1 (x-axis) and X2 (y-axis). The means of X1 and X2 are ?1 and ?2 respectively. The density at

? is highest, and as the random vector moves away from ?, the density goes down. All of the points on the

red contour (level set) have the same density. The first and second eigenvectors of the covariance matrix are

orthogonal to each other as shown in the Fig. 1. The first eigenvalue is the direction of maximum variance

in the MVN; the second eigenvector is orthogonal to the first.

p

The expression inside the exponent can be rewritten as: (x ? ?)T ¦²?1 (x ? ?). The Mahalanobis distance

between two vectors x1 and x2 is equivalent to the MVN, calculating the pdf of one of the two points, with

1

2

Gaussian Models

the other as the mean:

md(x1 , x2 ) =

q

(x1 ? x2 )T ¦²?1 (x1 ? x2 ).

(2)

Note that this distance is symmetric.

2

MLE for MVN

In order to determine the MLE for a MVN, we need some basic results from linear algebra. Recall the

following definitions.

? The trace of a matrix A ¡Ê Rd¡Ád is defined as the sum of A¡¯s diagonal elements, i.e., tr(A) =

Pd

i

Aii .

? The determinant of a matrix A ¡Ê Rd¡Ád is defined as the product of its eigenvalues. A positive definite

matrix A has positive eigenvalues, so the determinant will always be positive.

? Symmetric positive definite matrices (as we will consider for our covariance matrices) are defined as

having eigenvalues that are strictly positive.

? Trace has the cyclic permutation property:

tr(ABC) = tr(CAB) = tr(BCA)

Given vectors a, b and matrices A, B, C, we have the following facts:

?

?bT a

?a

?

?(aT Aa)

?a

= (A + AT )a. (Note that if A is symmetric, this equals to 2Aa)

?

?tr(BA)

?A

= BT

?

? log |A|

?A

= A?T , (A?1 )T

=b

? Trace trick: aT Aa = tr(aT Aa) = tr(aaT A) = tr(AaaT )

We will be using these aforementioned facts in deriving the MLEs, ??M LE and ¦²?M LE , for a MVN given a

data set D = {x1 , x2 , x3 , ..., xn }, where xi ¡Ê Rd is a sample vector from the MVN. The log likelihood of the

data set D given MVN parameters ?, ¦² can be written as

L(?, ¦²; D)

=

log

n

Y

p(xi |?, ¦²)

(3)

i=1

n

=

1X

n

log |¦²?1 | ?

tr[(xi ? ?)(xi ? ?)T ¦²?1 ]

2

2 i=1

Set the partial derivative with respect to ? to 0,

n

?L(?, ¦²; D)

1X

=?

?2¦²?1 (xi ? ?)] = 0

??

2 i=1

(4)

Gaussian Models

3

we get MLE of ? as follows,

n

X

?

i=1

n

X

(xi ? ?) = 0

xi ?

i=1

n

X

?=0

i=1

n

?

?M LE =

1X

xi

n i=1

(5)

This means that the MLE of ? for MVN is just the empirical mean of the samples.

Similarly, setting the partial derivative of the log likelihood (Equation 4) with respect to ¦²?1 to 0,

n

n

?L(?, ¦²; D)

1 X

=

¦²

?

(

(xi ? ?)(xi ? ?)T ) = 0

?¦²?1

2

2 i=1

we get MLE of ¦² as follows,

n

¦²M LE =

1X

(xi ? ?)(xi ? ?)T

n i=1

(6)

This expression is just the empirical covariance of the data, centered on ?.

3

The MVN is in the Exponential Family

We have already seen that if x ¡« N (?, ¦²) then E[x] = ? and cov[x] = ¦². These are also called the mean

or the moment parameters of the distribution. We can also express the MVN in exponential family form in

terms of the natural parameters as

¦« = ¦²?1 , ¦Ç = ¦²?1 ?.

(7)

Similarly, we can convert the natural parameters back to moment parameters as

¦² = ¦«?1 , ? = ¦«?1 ¦Ç.

(8)

Note that the natural parameter covariance matrix is the precision matrix. Also note that the relationship

between the mean parameters and the natural parameters is an invertible relationship, so the MLE for the

natural parameters can be converted into the MLE for the mean parameters (and vice versa). This enables

us to work in the most mathematically convenient space, and convert afterwards between parameterizations.

We can rewrite the MVN density, in Eqn (1), in exponential family form using the natural parameters as

follows:

h 1

i

P (x|¦Ç, ¦«) = (2¦Ð)?d/2 |¦«|1/2 exp ? (xT ¦«x + ¦Ç T ¦«¦Ç ? 2xT ¦Ç)

2

h

i

1

1

= (2¦Ð)?d/2 |¦«|1/2 exp ¦Ç T x ? xT ¦«x ? ¦Ç T ¦«¦Ç

2

2

h

i

1

1

= (2¦Ð)?d/2 |¦«|1/2 exp ¦Ç T x ? tr(¦«xxT ) ? ¦Ç T ¦«¦Ç

(9)

2

2

Recall the exponential family form:

P (X|¦Ç) = h(X) exp{¦Ç T T (X) ? A(¦Ç)},

(10)

4

Gaussian Models

where ¦Ç in Eqn (10) denotes the natural parameter vector, and T(x) is the vector of sufficient statistics for

the MVN.

We can see then that Eqn (9) is in exponential family form, and we read out the sufficient statistics of MVN:





x

T (x) =

,

xxT

and the natural parameters are



¦Ç? =

¦Ç



? 21 ¦«

and the log partition function is

A(¦Ç, ¦«) =

1 T

1

¦Ç ¦«¦Ç = ¦«¦Ç¦Ç T .

2

2

Thus, the sufficient statistics for a MVN are the empirical mean and the empirical covariance.

4

Marginals and Conditionals for an MVN

Let¡¯s consider an example where d = 2. If it is simpler, let X = (x1 , x2 ) where x1 and x2 are scalar. However,

if we consider x1 and x2 to be a split of the MVN data in dimension d > 2, where each x1 and x2 is a vector,

all of this subsequent section goes through naturally. Suppose that x1 and x2 are jointly Gaussian:

 









?

¦²11 ¦²12

¦«

¦«12

?= 1

,¦² =

and hence ¦« = ¦²?1 = 11

?2

¦²21 ¦²22

¦«21 ¦«22

Can we find the marginal and conditional distributions in this space? Recall that

Z

P (x1 ) =

N (x1 , x2 |?, ¦²)dx2 .

x2

Using this model we can derive the following distributions with Eqn. (8):

?1

? Marginal: P (x1 ) = N (x1 |?1 , ¦²11 ), where ?1 = ¦Ç2 ? ¦«21 ¦«?1

11 ¦Ç1 and ¦²11 = ¦«22 ? ¦«21 ¦«11 ¦«12 .

?1

? Marginal (equivalent): P (x2 ) = N (x2 |?2 , ¦²22 ), where ?2 = ¦Ç1 ?¦«12 ¦«?1

22 ¦Ç2 and ¦²22 = ¦«11 ?¦«12 ¦«22 ¦«21 .

? Conditional distribution: P (x1 |x2 ) = N (x1 |?1|2 , ¦²1|2 ) , where ?1|2 = ?1 + ¦²12 ¦²?1

22 (x2 ? ?2 ), ¦²1|2 =

¦²11 ? ¦²12 ¦²?1

¦²

,

or

more

concisely

(natural

parameters),

?

=

¦Ç

?

¦«

x

,

¦²

1

12 2

1|2

1|2 = ¦«11 .

22 21

The converse conditional distribution, p(x2 |x1 ) is written out equivalently (swapping the 1, 2 indices). These

formulas are derived using the Schur complement of a matrix and the matrix inversion lemma. Note that

conditional probabilities are straightforward to consider in the natural parameter space, where marginal

probabilities are much simpler in the mean parameter space.

The marginal distribution in the mean parameter space is a simple projection of a (for example) 2D MVN

cloud onto each of the univariate Gaussian distributions i one dimension. The conditional distribution is a

similar projection, but considering only a slice of the space at the conditional random variable. When the

off-diagonal elements of the covariance matrix are 0, the conditional distribution is identical to the marginal

distribution (as the two univariate Gaussians are independent).

Gaussian Models

5

5

Conjugate prior

The conjugate prior for the mean term ? of a multivariate normal distribution is a multivariate normal

distribution:

p(?|X) ¡Ø p(?)p(X|?),

(11)

where p(?) is a multivariate normal distribution, ? ¡« N (?0 , ¦²0 ). The implication of this prior is that the

mean term has a Gaussian distribution across the space that it might lie in: generally large values of ¦²0

are preferable unless we have good prior information about the mean term (e.g., that it will be right around

zero).

The conjugate prior for the covariance matrix ¦² of a multivariate normal distribution is the inverse Wishart

distribution:

p(¦²|X) ¡Ø p(¦²)p(X|¦²),

(12)

where p(¦²) is an inverse Wishart distribution ¦² ¡« IW(¦Í, ¦·). The inverse Wishart is a PDF for positive

definite matrices.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download