A Brief Introduction to Machine Learning for Engineers
[Pages:237]arXiv:1709.02840v3 [cs.LG] 17 May 2018
A Brief Introduction to Machine Learning for Engineers
(2018), "A Brief Introduction to Machine Learning for Engineers", : Vol. XX, No. XX, pp 1?231. DOI: XXX.
Osvaldo Simeone Department of Informatics
King's College London osvaldo.simeone@kcl.ac.uk
Contents
I Basics
5
1 Introduction
6
1.1 What is Machine Learning? . . . . . . . . . . . . . . . . . 6
1.2 When to Use Machine Learning? . . . . . . . . . . . . . . 8
1.3 Goals and Outline . . . . . . . . . . . . . . . . . . . . . . 11
2 A Gentle Introduction through Linear Regression
15
2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 15
2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Frequentist Approach . . . . . . . . . . . . . . . . . . . . 19
2.4 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . 36
2.5 Minimum Description Length (MDL) . . . . . . . . . . . 42
2.6 Information-Theoretic Metrics . . . . . . . . . . . . . . . . 44
2.7 Interpretation and Causality . . . . . . . . . . . . . . . . 47
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Probabilistic Models for Learning
51
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 The Exponential Family . . . . . . . . . . . . . . . . . . . 53
3.3 Frequentist Learning . . . . . . . . . . . . . . . . . . . . . 59
3.4 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Supervised Learning via Generalized Linear Models (GLM) 69 3.6 Maximum Entropy Property . . . . . . . . . . . . . . . . 71 3.7 Energy-based Models . . . . . . . . . . . . . . . . . . . . 72 3.8 Some Advanced Topics . . . . . . . . . . . . . . . . . . . 73 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
II Supervised Learning
75
4 Classification
76
4.1 Preliminaries: Stochastic Gradient Descent . . . . . . . . . 77
4.2 Classification as a Supervised Learning Problem . . . . . . 78
4.3 Discriminative Deterministic Models . . . . . . . . . . . . 80
4.4 Discriminative Probabilistic Models: Generalized Linear Mod-
els . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5 Discriminative Probabilistic Models: Beyond GLM . . . . . 96
4.6 Generative Probabilistic Models . . . . . . . . . . . . . . . 102 4.7 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5 Statistical Learning Theory
113
5.1 A Formal Framework for Supervised Learning . . . . . . . 114
5.2 PAC Learnability and Sample Complexity . . . . . . . . . . 119
5.3 PAC Learnability for Finite Hypothesis Classes . . . . . . . 120
5.4 VC Dimension and Fundamental Theorem of PAC Learning 124
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
III Unsupervised Learning
129
6 Unsupervised Learning
130
6.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 131
6.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . 134
6.3 ML, ELBO and EM . . . . . . . . . . . . . . . . . . . . . 136
6.4 Directed Generative Models . . . . . . . . . . . . . . . . 148
6.5 Undirected Generative Models . . . . . . . . . . . . . . . 155
6.6 Discriminative Models . . . . . . . . . . . . . . . . . . . . 159 6.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 161 6.8 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
IV Advanced Modelling and Inference
165
7 Probabilistic Graphical Models
166
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 170
7.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . 178
7.4 Bayesian Inference in Probabilistic Graphical Models . . . . 182
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8 Approximate Inference and Learning
186
8.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 187
8.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . 189
8.3 Monte Carlo-Based Variational Inference . . . . . . . . . 197
8.4 Approximate Learning . . . . . . . . . . . . . . . . . . . 199
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
V Conclusions
202
9 Concluding Remarks
203
Appendices
206
A Appendix A: Information Measures
207
A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.2 Conditional Entropy and Mutual Information . . . . . . . . 210
A.3 Divergence Measures . . . . . . . . . . . . . . . . . . . . 212
B Appendix B: KL Divergence and Exponential Family
215
Acknowledgements
217
References
218
A Brief Introduction to Machine Learning for Engineers
Osvaldo Simeone1 1Department of Informatics, King's College London; osvaldo.simeone@kcl.ac.uk
ABSTRACT This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. The material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with an engineering background in probability and linear algebra.
ISSN ; DOI XXXXXXXX c 2018 XXXXXXXX
Notation
? Random variables or random vectors ? both abbreviated as rvs ? are represented using roman typeface, while their values and realizations are indicated by the corresponding standard font. For instance, the equality x = x indicates that rv x takes value x.
? Matrices are indicated using uppercase fonts, with roman typeface used for random matrices.
? Vectors will be taken to be in column form. ? XT and X are the transpose and the pseudoinverse of matrix X, respectively. ? The distribution of a rv x, either probability mass function (pmf) for a discrete rv or probability density function (pdf) for continuous rvs, is denoted as px, px(x), or p(x). ? The notation x px indicates that rv x is distributed according to px. ? For jointly distributed rvs (x, y) pxy, the conditional distribution of x given the observation y = y is indicated as px|y=y, px|y(x|y) or p(x|y). ? The notation x|y = y px|y=y indicates that rv x is drawn according to the conditional distribution px|y=y. ? The notation Expx[?] indicates the expectation of the argument with respect to the distribution of the rv x px. Accordingly, we will also write Expx|y[?|y] for the conditional expectation with respect to the distribution px|y=y. When clear from the context, the distribution over which the expectation is computed may be omitted. ? The notation Prxpx[?] indicates the probability of the argument event with respect to the distribution of the rv x px. When clear
1
2
Notation
from the context, the subscript is dropped.
? The notation log represents the logarithm in base two, while ln
represents the natural logarithm.
? x N (?, ) indicates that random vector x is distributed accord-
ing to a multivariate Gaussian pdf with mean vector ? and covariance
matrix . The multivariate Gaussian pdf is denoted as N (x|?, ) as a
function of x.
? x U(a, b) indicates that rv x is distributed according to a uni-
form distribution in the interval [a, b]. The corresponding uniform pdf
is denoted as U(x|a, b).
? (x) denotes the Dirac delta function or the Kronecker delta func-
tion, as clear from the context.
? ||a||2 =
N i=1
a2i
is
the
quadratic,
or
l2,
norm
of
a
vector
a
=
[a1, ..., aN ]T . We similarly define the l1 norm as ||a||1 =
N i=1
|ai|,
and
the l0 pseudo-norm ||a||0 as the number of non-zero entries of vector a.
? I denotes the identity matrix, whose dimensions will be clear from
the context. Similarly, 1 represents a vector of all ones.
? R is the set of real numbers; R+ the set of non-negative real num-
bers; R- the set of non-positive real numbers; and RN is the set of all
vectors of N real numbers.
? 1 (?) is the indicator function: 1 (x) = 1 if x is true, and 1 (x) = 0
otherwise.
? |S| represents the cardinality of a set S.
? xS represents a set of rvs xk indexed by the integers k S.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- midterm exam solutions eta kappa nu
- differential equations
- introduction to random matrices theory and practice
- human evelopment eport beyond income beyond
- mathematical cryptology tut
- expected value and variance
- guide book el w531 07 2 5 2 48 pm ページ 35 scientific
- elementary number theory joshua
- a brief introduction to machine learning for engineers
Related searches
- brief introduction of yourself sample
- introduction to statistical learning theory
- introduction to matlab for engineers pdf
- probability for machine learning pdf
- brief introduction to class
- brief introduction of yourself examples
- brief introduction of mark twain
- template for introduction to a research paper
- a brief introduction to sociology
- brief introduction of yourself example
- brief introduction to economics
- brief introduction of yourself