Mathematics for Machine Learning

Mathematics for Machine Learning

Garrett Thomas

Department of Electrical Engineering and Computer Sciences

University of California, Berkeley

January 11, 2018

1

About

Machine learning uses tools from a variety of mathematical fields. This document is an attempt to

provide a summary of the mathematical background needed for an introductory class in machine

learning, which at UC Berkeley is known as CS 189/289A.

Our assumption is that the reader is already familiar with the basic concepts of multivariable calculus

and linear algebra (at the level of UCB Math 53/54). We emphasize that this document is not a

replacement for the prerequisite classes. Most subjects presented here are covered rather minimally;

we intend to give an overview and point the interested reader to more comprehensive treatments for

further details.

Note that this document concerns math background for machine learning, not machine learning

itself. We will not discuss specific machine learning models or algorithms except possibly in passing

to highlight the relevance of a mathematical concept.

Earlier versions of this document did not include proofs. We have begun adding in proofs where

they are reasonably short and aid in understanding. These proofs are not necessary background for

CS 189 but can be used to deepen the reader¡¯s understanding.

You are free to distribute this document as you wish. The latest version can be found at http://

gwthomas.github.io/docs/math4ml.pdf. Please report any mistakes to gwthomas@berkeley.edu.

1

Contents

1 About

1

2 Notation

5

3 Linear Algebra

6

3.1

Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.1.1

Euclidean space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.1.2

Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Linear maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.2.1

The matrix of a linear map . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.2.2

Nullspace, range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.3

Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.4

Normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.5

Inner product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.5.1

Pythagorean Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.5.2

Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.5.3

Orthogonal complements and projections . . . . . . . . . . . . . . . . . . . .

12

3.6

Eigenthings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.7

Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.8

Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.9

Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.10 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.10.1 Rayleigh quotients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.11 Positive (semi-)definite matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.11.1 The geometry of positive definite quadratic forms . . . . . . . . . . . . . . . .

19

3.12 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.13 Fundamental Theorem of Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.14 Operator and matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.15 Low-rank approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.16 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.17 Some useful matrix identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.17.1 Matrix-vector product as linear combination of matrix columns . . . . . . . .

26

3.17.2 Sum of outer products as matrix-matrix product . . . . . . . . . . . . . . . .

26

3.17.3 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.2

4 Calculus and Optimization

27

2

4.1

Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.2

Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.3

The Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.4

The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.5

Matrix calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.5.1

The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.6

Taylor¡¯s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.7

Conditions for local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.8

Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.8.1

Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.8.2

Basics of convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.8.3

Consequences of convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.8.4

Showing that a function is convex . . . . . . . . . . . . . . . . . . . . . . . .

33

4.8.5

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5 Probability

5.1

5.2

5.3

5.4

5.5

5.6

Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

5.1.1

Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.1.2

Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.1.3

Bayes¡¯ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.2.1

The cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . .

39

5.2.2

Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.2.3

Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.2.4

Other kinds of random variables . . . . . . . . . . . . . . . . . . . . . . . . .

40

Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.3.1

Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . .

41

5.3.2

Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Great Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4.1

Properties of expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.5.1

Properties of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.5.2

Standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Covariance

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.6.1

5.7

37

3

5.8

5.9

Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.8.1

Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.8.2

Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . . . . . . .

45

The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.9.1

45

The geometry of multivariate Gaussians . . . . . . . . . . . . . . . . . . . . .

References

47

4

2

Notation

Notation

R

Rn

Rm¡Án

¦Äij

?f (x)

?2 f (x)

A>

?

P(A)

p(X)

p(x)

Ac

A ¡È¨B B

E[X]

Var(X)

Cov(X, Y )

Meaning

set of real numbers

set (vector space) of n-tuples of real numbers, endowed with the usual inner product

set (vector space) of m-by-n matrices

Kronecker delta, i.e. ¦Äij = 1 if i = j, 0 otherwise

gradient of the function f at x

Hessian of the function f at x

transpose of the matrix A

sample space

probability of event A

distribution of random variable X

probability density/mass function evaluated at x

complement of event A

union of A and B, with the extra requirement that A ¡É B = ?

expected value of random variable X

variance of random variable X

covariance of random variables X and Y

Other notes:

? Vectors and matrices are in bold (e.g. x, A). This is true for vectors in Rn as well as for

vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman

letters for matrices and random variables.

? To stay focused at an appropriate level of abstraction, we restrict ourselves to real values. In

many places in this document, it is entirely possible to generalize to the complex case, but we

will simply state the version that applies to the reals.

? We assume that vectors are column vectors, i.e. that a vector in Rn can be interpreted as an

n-by-1 matrix. As such, taking the transpose of a vector is well-defined (and produces a row

vector, which is a 1-by-n matrix).

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download