1 Multivariate Normal Distribution - Princeton University
STA561: Probabilistic machine learning
Gaussian Models (9/9/13)
Lecturer: Barbara Engelhardt
1
Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava
Multivariate Normal Distribution
The multivariate normal distribution (MVN), also known as multivariate gaussian, is a generalization of
the one-dimensional normal distribution to higher dimensions. The probability density function (pdf) of an
MVN for a random vector x ¡Ê Rd as follows:
N (x|?, ¦²) ,
1
(2¦Ð)d/2 |¦²|1/2
1
exp ? (x ? ?)T ¦²?1 (x ? ?)
2
(1)
X2
where ? = E[x] ¡Ê Rd is the mean vector, and ¦² = cov[x] is d ¡Á d symmetric positive definite matrix, known
as the covariance matrix. ¦²?1 is known as the precision matrix.
1st eigenvector of ¦²
?2
2nd eigenvector of ¦²
X1
?1
Figure 1: 2 dimensional Gaussian density.
Fig. 1 shows a 2-dimensional Gaussian density. The random vectors span two dimensions and are denoted in
the plot by X1 (x-axis) and X2 (y-axis). The means of X1 and X2 are ?1 and ?2 respectively. The density at
? is highest, and as the random vector moves away from ?, the density goes down. All of the points on the
red contour (level set) have the same density. The first and second eigenvectors of the covariance matrix are
orthogonal to each other as shown in the Fig. 1. The first eigenvalue is the direction of maximum variance
in the MVN; the second eigenvector is orthogonal to the first.
p
The expression inside the exponent can be rewritten as: (x ? ?)T ¦²?1 (x ? ?). The Mahalanobis distance
between two vectors x1 and x2 is equivalent to the MVN, calculating the pdf of one of the two points, with
1
2
Gaussian Models
the other as the mean:
md(x1 , x2 ) =
q
(x1 ? x2 )T ¦²?1 (x1 ? x2 ).
(2)
Note that this distance is symmetric.
2
MLE for MVN
In order to determine the MLE for a MVN, we need some basic results from linear algebra. Recall the
following definitions.
? The trace of a matrix A ¡Ê Rd¡Ád is defined as the sum of A¡¯s diagonal elements, i.e., tr(A) =
Pd
i
Aii .
? The determinant of a matrix A ¡Ê Rd¡Ád is defined as the product of its eigenvalues. A positive definite
matrix A has positive eigenvalues, so the determinant will always be positive.
? Symmetric positive definite matrices (as we will consider for our covariance matrices) are defined as
having eigenvalues that are strictly positive.
? Trace has the cyclic permutation property:
tr(ABC) = tr(CAB) = tr(BCA)
Given vectors a, b and matrices A, B, C, we have the following facts:
?
?bT a
?a
?
?(aT Aa)
?a
= (A + AT )a. (Note that if A is symmetric, this equals to 2Aa)
?
?tr(BA)
?A
= BT
?
? log |A|
?A
= A?T , (A?1 )T
=b
? Trace trick: aT Aa = tr(aT Aa) = tr(aaT A) = tr(AaaT )
We will be using these aforementioned facts in deriving the MLEs, ??M LE and ¦²?M LE , for a MVN given a
data set D = {x1 , x2 , x3 , ..., xn }, where xi ¡Ê Rd is a sample vector from the MVN. The log likelihood of the
data set D given MVN parameters ?, ¦² can be written as
L(?, ¦²; D)
=
log
n
Y
p(xi |?, ¦²)
(3)
i=1
n
=
1X
n
log |¦²?1 | ?
tr[(xi ? ?)(xi ? ?)T ¦²?1 ]
2
2 i=1
Set the partial derivative with respect to ? to 0,
n
?L(?, ¦²; D)
1X
=?
?2¦²?1 (xi ? ?)] = 0
??
2 i=1
(4)
Gaussian Models
3
we get MLE of ? as follows,
n
X
?
i=1
n
X
(xi ? ?) = 0
xi ?
i=1
n
X
?=0
i=1
n
?
?M LE =
1X
xi
n i=1
(5)
This means that the MLE of ? for MVN is just the empirical mean of the samples.
Similarly, setting the partial derivative of the log likelihood (Equation 4) with respect to ¦²?1 to 0,
n
n
?L(?, ¦²; D)
1 X
=
¦²
?
(
(xi ? ?)(xi ? ?)T ) = 0
?¦²?1
2
2 i=1
we get MLE of ¦² as follows,
n
¦²M LE =
1X
(xi ? ?)(xi ? ?)T
n i=1
(6)
This expression is just the empirical covariance of the data, centered on ?.
3
The MVN is in the Exponential Family
We have already seen that if x ¡« N (?, ¦²) then E[x] = ? and cov[x] = ¦². These are also called the mean
or the moment parameters of the distribution. We can also express the MVN in exponential family form in
terms of the natural parameters as
¦« = ¦²?1 , ¦Ç = ¦²?1 ?.
(7)
Similarly, we can convert the natural parameters back to moment parameters as
¦² = ¦«?1 , ? = ¦«?1 ¦Ç.
(8)
Note that the natural parameter covariance matrix is the precision matrix. Also note that the relationship
between the mean parameters and the natural parameters is an invertible relationship, so the MLE for the
natural parameters can be converted into the MLE for the mean parameters (and vice versa). This enables
us to work in the most mathematically convenient space, and convert afterwards between parameterizations.
We can rewrite the MVN density, in Eqn (1), in exponential family form using the natural parameters as
follows:
h 1
i
P (x|¦Ç, ¦«) = (2¦Ð)?d/2 |¦«|1/2 exp ? (xT ¦«x + ¦Ç T ¦«¦Ç ? 2xT ¦Ç)
2
h
i
1
1
= (2¦Ð)?d/2 |¦«|1/2 exp ¦Ç T x ? xT ¦«x ? ¦Ç T ¦«¦Ç
2
2
h
i
1
1
= (2¦Ð)?d/2 |¦«|1/2 exp ¦Ç T x ? tr(¦«xxT ) ? ¦Ç T ¦«¦Ç
(9)
2
2
Recall the exponential family form:
P (X|¦Ç) = h(X) exp{¦Ç T T (X) ? A(¦Ç)},
(10)
4
Gaussian Models
where ¦Ç in Eqn (10) denotes the natural parameter vector, and T(x) is the vector of sufficient statistics for
the MVN.
We can see then that Eqn (9) is in exponential family form, and we read out the sufficient statistics of MVN:
x
T (x) =
,
xxT
and the natural parameters are
¦Ç? =
¦Ç
? 21 ¦«
and the log partition function is
A(¦Ç, ¦«) =
1 T
1
¦Ç ¦«¦Ç = ¦«¦Ç¦Ç T .
2
2
Thus, the sufficient statistics for a MVN are the empirical mean and the empirical covariance.
4
Marginals and Conditionals for an MVN
Let¡¯s consider an example where d = 2. If it is simpler, let X = (x1 , x2 ) where x1 and x2 are scalar. However,
if we consider x1 and x2 to be a split of the MVN data in dimension d > 2, where each x1 and x2 is a vector,
all of this subsequent section goes through naturally. Suppose that x1 and x2 are jointly Gaussian:
?
¦²11 ¦²12
¦«
¦«12
?= 1
,¦² =
and hence ¦« = ¦²?1 = 11
?2
¦²21 ¦²22
¦«21 ¦«22
Can we find the marginal and conditional distributions in this space? Recall that
Z
P (x1 ) =
N (x1 , x2 |?, ¦²)dx2 .
x2
Using this model we can derive the following distributions with Eqn. (8):
?1
? Marginal: P (x1 ) = N (x1 |?1 , ¦²11 ), where ?1 = ¦Ç2 ? ¦«21 ¦«?1
11 ¦Ç1 and ¦²11 = ¦«22 ? ¦«21 ¦«11 ¦«12 .
?1
? Marginal (equivalent): P (x2 ) = N (x2 |?2 , ¦²22 ), where ?2 = ¦Ç1 ?¦«12 ¦«?1
22 ¦Ç2 and ¦²22 = ¦«11 ?¦«12 ¦«22 ¦«21 .
? Conditional distribution: P (x1 |x2 ) = N (x1 |?1|2 , ¦²1|2 ) , where ?1|2 = ?1 + ¦²12 ¦²?1
22 (x2 ? ?2 ), ¦²1|2 =
¦²11 ? ¦²12 ¦²?1
¦²
,
or
more
concisely
(natural
parameters),
?
=
¦Ç
?
¦«
x
,
¦²
1
12 2
1|2
1|2 = ¦«11 .
22 21
The converse conditional distribution, p(x2 |x1 ) is written out equivalently (swapping the 1, 2 indices). These
formulas are derived using the Schur complement of a matrix and the matrix inversion lemma. Note that
conditional probabilities are straightforward to consider in the natural parameter space, where marginal
probabilities are much simpler in the mean parameter space.
The marginal distribution in the mean parameter space is a simple projection of a (for example) 2D MVN
cloud onto each of the univariate Gaussian distributions i one dimension. The conditional distribution is a
similar projection, but considering only a slice of the space at the conditional random variable. When the
off-diagonal elements of the covariance matrix are 0, the conditional distribution is identical to the marginal
distribution (as the two univariate Gaussians are independent).
Gaussian Models
5
5
Conjugate prior
The conjugate prior for the mean term ? of a multivariate normal distribution is a multivariate normal
distribution:
p(?|X) ¡Ø p(?)p(X|?),
(11)
where p(?) is a multivariate normal distribution, ? ¡« N (?0 , ¦²0 ). The implication of this prior is that the
mean term has a Gaussian distribution across the space that it might lie in: generally large values of ¦²0
are preferable unless we have good prior information about the mean term (e.g., that it will be right around
zero).
The conjugate prior for the covariance matrix ¦² of a multivariate normal distribution is the inverse Wishart
distribution:
p(¦²|X) ¡Ø p(¦²)p(X|¦²),
(12)
where p(¦²) is an inverse Wishart distribution ¦² ¡« IW(¦Í, ¦·). The inverse Wishart is a PDF for positive
definite matrices.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- the gaussian distribution washington university in st louis
- joint and marginal distributions university of arizona
- 7 joint marginal and conditional distributions
- chapter 3 multivariate distributions university of chicago
- the marginal distribution an example using the normal distributions
- bayesian inference chapter 9 linear models and regression
- chapters 5 multivariate probability distributions brown university
- chapter 5 multivariate probability distributions umass
- formal modeling in cognitive science school of informatics
- joint distributions discrete case university of illinois urbana
Related searches
- normal distribution minimum sample size
- princeton university admissions staff
- princeton university hospital princeton nj
- normal distribution sample size calculator
- normal distribution curve standard deviation
- normal distribution excel
- normal distribution pdf
- normal distribution of intelligence
- normal distribution pdf calculator
- normal distribution cumulative distribution function
- princeton university acceptance rate
- princeton university acceptance