Mathematical Statistics Review - Rice University
Summary of Stat 531/532
Taught by Dr. Dennis Cox
Text: A work-in-progress by Dr. Cox
Supplemented by Statistical Inference by Cassella & Berger
Chapter 1: Probability Theory and Measure Spaces
1.1: Set Theory (Casella-Berger)
The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment.
An event is any collection of possible outcomes of an experiment (i.e. any subset of S).
A and B are disjoint (or mutually exclusive) if[pic]. The events A1, A2, … are pairwise disjoint if [pic]
If A1, A2, … are pairwise disjoint and [pic], then the collection A1, A2, … forms a partition of S.
1.2: Probability Theory (Casella-Berger)
A collection of subsets of S is called a Borel field (or sigma algebra) denoted by β, if it satisfies the following three properties:
1. [pic]
2. [pic]
3. [pic]
i.e. the empty set is contained in β, β is closed under complementation, and β is closed under countable unions.
Given a sample space S and an associated Borel field β, a probability function is a function P with domain β that satisfies:
1. P(A)[pic]0 for all [pic]
2. P(S)=1
3. If A1, A2,[pic] are pairwise disjoint, then P([pic])=[pic]P(Ai).
Axiom of Finite Additivity: If [pic]and [pic]are disjoint, then P([pic])=P(A) + P(B).
1.1: Measures (Cox)
A measure space is denoted as[pic], where Ω is an underlying set, F is a sigma-algebra, and μ is a measure, meaning that it satisfies:
1. [pic]
2. [pic]
3. If A1, A2, … is a sequence of disjoint elements of F, then [pic]
A measure, P, where P(Ω)=1 is called a probability measure.
A measure on [pic] is called a Borel measure.
A measure, #, where #(A)=number of elements in A, is called a counting measure.
A counting measure may be written in terms of unit point masses (UPM) as [pic] where [pic]. UPM measure is also a probability measure.
There is a unique Borel Measure, m, satisfying m([a,b]) = b - a, for every finite closed interval [a,b], [pic].
Properties of Measures:
a) (monotonicity) [pic]
b) (subadditivity) [pic], where Ai is any sequence of measurable sets.
c) If Ai , i=1,2,… is a decreasing sequence of measurable sets (i.e.[pic]) and if [pic], then [pic].
1.2: Measurable Functions and Integration (Cox)
Properties of the Integral:
a) If [pic] exists and [pic], then [pic] exists and equals [pic].
b) If [pic] and [pic] both exist and [pic] is defined (i.e. not of the form [pic]), then [pic] is defined and equals [pic].
c) If [pic], then [pic], provided the integrals exist.
d) [pic], provided [pic] exists.
S holds μ-almost everywhere (μ-a.e.) iff [pic] a set [pic] and [pic] is true for [pic].
Consider all extended Borel functions on [pic]:
Monotone Convergence Theorem: [pic] is an increasing sequence of nonnegative functions (i.e. [pic]) and [pic] [pic] [pic].
Lebesgue’s Dominated Convergence Theorem: [pic] as [pic] and [pic] an integrable function [pic][pic].
[pic] is called the dominating function.
Change of Variables Theorem: Suppose [pic] and [pic]. Then [pic].
Interchange of Differentiation and Integration Theorem: For each fixed[pic], consider a function [pic]. If
1. The integral of [pic] is finite,
2. The partial derivative w.r.t.[pic] exists, and
3. the abs. value of the partial derivative is bounded by [pic], then
[pic] is integrable w.r.t. μ and [pic].
1.3: Measures on Product Spaces (Cox)
A measure space [pic] is called σ-finite iff [pic] an infinite sequence [pic]
i) [pic] for each i, and
ii) [pic].
Product Measure Theorem: Let [pic] and [pic] be σ-finite measure spaces. Then [pic] a unique product measure [pic] on [pic], [pic].
Fubini’s Theorem: Let [pic], [pic] and [pic], where [pic] and [pic] are σ-finite. If[pic]is a Borel function on [pic] whose integral w.r.t. [pic] exists, then [pic].
Conclude that [pic] exists [pic] and defines a Borel function on [pic] whose integral w.r.t. [pic] exists.
A function [pic] is called an n-dimensional random vector (a.k.a random n-vector). The distribution of [pic] is the same as the law of [pic]. (i.e. [pic]).
• [pic].
1.4: Densities and The Radon-Nikodym Theorem (Cox)
Let [pic] and [pic] be measures on [pic]. [pic] is absolutely continuous w.r.t. [pic] (i.e. [pic]) iff for all [pic]. This can also be said as [pic] dominates [pic] (i.e. [pic] is a dominating measure for [pic]). If both measures dominate one another, then they are equivalent.
Radon-Nikodym Theorem: Let [pic] be a σ-finite measure space and suppose [pic]. Then [pic] a nonnegative Borel function [pic]. Furthermore, [pic] is unique [pic]
1.5: Conditional Expectation (Cox)
Theorem 1.5.1: Suppose [pic] and [pic]. Then Z is [pic]-measurable [pic] a Borel function [pic] such that Z = h(Y).
Let [pic] be a random variable with [pic], and suppose G is a sub-σ-field of F. Then the conditional expectation of X given G, denoted [pic], is the essentially unique random variable [pic] satisfying (where [pic] = [pic])
i) [pic] is G measurable
ii) [pic]G.
Proposition 1.5.4: Suppose [pic] and [pic] are random elements and [pic] is a σ-finite measure on [pic] for I=1,2 such that [pic]. Let [pic] denote the corresponding joint density. Let [pic] be any Borel function [pic]. Then [pic], a.s.
The conditional density of X given Y is denoted by [pic].
Proposition 1.5.5: Suppose the assumptions of proposition 1.5.4 hold. Then the following are true:
i) the family of regular conditional distributions [pic] exists;
ii) [pic] for [pic];
iii) the Radon-Nikodym derivative is given by [pic]
Theorem 1.5.7 - Basic Properties of Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on [pic], and let G be a fixed sub-σ-field of [pic].
a) If [pic] a.s., [pic] a constant, then [pic] a.s.
b) If [pic] a.s., then [pic] a.s.
c) If [pic], then [pic], a.s.
d) (Law of Total Expectation) [pic].
e) [pic].
f) If [pic] then [pic] a.s.
g) (Law of Successive Conditioning) If [pic] is a sub-σ-field of G, then [pic] a.s.
h) If [pic] and [pic] then [pic] a.s.
Theorem 1.5.8 – Convergence Theorems for Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on [pic], and let G be a fixed sub-σ-field of [pic].
a) (Monotone Convergence Theorem) If [pic] a.s. then [pic] a.s.
b) (Dominated Convergence Theorem) Suppose there is an integrable r.v. Y such that [pic] a.s. for all I, and suppose that [pic] a.s. Then [pic] a.s.
To Stage Experiment Theorem: Let [pic] be a measurable space and suppose [pic] satisfies the following:
i) [pic] is Borel measurable for each fixed [pic];
ii) [pic] is a Borel p.m. for each fixed [pic]
Let ν be any p.m. on [pic]. Then there is a unique p.m. P on [pic]
[pic]
Theorem 1.5.12 - Bayes Formula: Suppose [pic] is a random element and let λ be a σ-finite measure on [pic]. Denote the corresponding density by
[pic].
Let μ be a σ-finite measure on [pic]. Suppose that for each [pic] there is a given pdf w.r.t μ denoted [pic]. Denote by X a random element taking values in [pic] with [pic]. Then there is a version of [pic] given by
[pic].
Chapter 2: Transformations and Expectations
2.1: Moments and Moment Inequalities (Cox)
A moment refers to an expectation of a random variable or a function of a random variable.
• The first moment is the mean
• The kth moment is E[Xk]
• The kth central moment is E[(X-E[X])k]= E[(X-μ)k]
Finite second moments ( Finite first moments (i.e. [pic]
Markov’s Inequality: Suppose [pic] a.s., then [pic].
Corollary to Markov’s Inequality: [pic]
Chebyshev’s Inequality: Suppose X is a random variable with E[X2]0
[pic].
Jensen’s Inequality: Let f be a convex function on a convex set [pic] and suppose[pic] is a random n-vector with [pic] and [pic] a.s. The E[[pic]][pic] and [pic].
• If f concave, change the sign.
• If strictly convex/concave, can use strict inequality holds.
Cauchy-Schwarz Inequality: [pic].
Theorem 2.1.6 (Spectral Decomposition of a Symmetric Matrix): Let A be a symmetric matrix. Then there is an orthogonal matrix U and a diagonal matrix [pic] such that [pic].
• The diagonal entries of [pic] are eigenvalues of A
• The columns of U are the corresponding eigenvectors.
2.2: Characteristic and Moment Generating Functions (Cox)
The characteristic function (chf) of a random n-vector[pic] is the complex valued function [pic] given by [pic]
The moment generating function (mgf) is defined as [pic]
Theorem 2.2.1: Let[pic] be a random n-vector with chf [pic] and mgf [pic].
a) (Continuity) [pic] is uniformly continuous on [pic], and [pic] is continuous at every point [pic] such that [pic] for all [pic] in a neighborhood of [pic].
b) (Relation to moments) If[pic] is integrable, then the gradient [pic], is defined at [pic] and equals iE[[pic]]. Also, [pic] has finite second moments iff the Hessian [pic] of [pic] exists at [pic] and then [pic].
c) (Linear Transformation Formulae) Let [pic] for some [pic] matrix A and some m-vector [pic]. Then for all [pic]
[pic]
[pic]
d) (Uniqueness) If [pic] is a random n-vector and if [pic] for all [pic] then [pic]. If both [pic] are defined and equal in a neighborhood of [pic], then [pic].
e) (Chf for Sums of Independent Random Variables) Suppose[pic] and[pic] are independent random p-vectors and let [pic]. Then [pic].
Cumulant Generating Function is ln[pic]=K(u).
2.3: Common Distributions Used in Statistics (Cox)
Location-Scale Families
A distribution is a location family if [pic].
• I.e. [pic]
A distribution is a scale family if [pic].
• [pic]
A distribution is a location-scale family if [pic].
• [pic]
A class of transformations T on [pic] is called a transformation group iff the following hold:
i) Every [pic] is measurable [pic].
ii) T is closed under composition. i.e. if g1 and g2 are in T, then so is [pic].
iii) T is closed under taking inverses. i.e. [pic].
Let T be a transformation group and let P be a family of probability measures on [pic]. Then P is T-invariant iff [pic].
2.4: Distributional Calculations (Cox)
Jensen’s Inequality for Conditional Expectation: [pic], Law[Y]-a.s., g(.,y) is a convex function.
2.1: Distribution of Functions of a Random Variable (Casella-Berger)
Transformation for Monotone Functions:
Theorem 2.1.1 (for cdf): Let X have cdf FX(x), let Y=g(X), and let [pic] and [pic] be defined as [pic] and [pic]
1. If g is an increasing function on [pic], then FY(y)=FX(g-1(y)) for [pic].
2. If g is a decreasing function on [pic] and X is a continuous random variable, then
FY(y)=1-FX(g-1(y)) for [pic].
Theorem 2.1.2 (for pdf): Let X have pdf fX(x) and let Y=g(X), where g is a monotone function. Let [pic] and [pic] be defined as above. Suppose fX(x) is continuous on [pic] and that g-1(y) has a continuous derivative on [pic]. Then the pdf of Y is given by
[pic].
The Probability Integral Transformation:
Theorem 2.1.4: Let X have a continuous cdf FX(x) and define the random variable Y as
Y = FX(X). Then Y is uniformly distributed on (0,1). I.e. P(Y[pic]y) = y, 0 < y < 1. (interpretation: If X is continuous and Y = cdf of X, then Y is U(0,1)).
2.2: Expected Values (Casella-Berger)
The expected value or mean of a random variable g(X), denoted Eg(X), is
[pic].
2.3: Moments and Moment Generating Functions (Casella-Berger)
For each integer n, the nth moment of X, [pic], is [pic]. The nth central moment of X is [pic]
• Mean = 1st moment of X
• Variance = 2nd central moment of X
The moment generating function of X is [pic].
NOTE: The nth moment is equal to the nth derivative of the mgf evaluated at t=0.
i.e. [pic].
Useful relationship: Binomial Formula is [pic].
Lemma 2.3.1: Let a1,a2,… be a sequence of numbers converging to ‘a’. Then [pic].
Chapter 3: Fundamental Concepts of Statistics
3.1: Basic Notations of Statistics (Cox)
Statistics is the art and science of gathering, analyzing, and making inferences from data.
Four parts to statistics:
1. Models
2. Inference
3. Decision Theory
4. Data Analysis, model checking, robustness, study design, etc…
Statistical Models
A statistical model consists of three things:
1. a measurable space [pic],
2. a collection of probability measures P on [pic], and
3. a collection of possible observable random vectors [pic].
Statistical Inference
i) Point Estimation: Goal – estimate the true [pic] from the data.
ii) Hypothesis Testing: Goal – choose between two hypotheses: the null or the alternative.
iii) Interval & Set Estimation: Goal – find [pic] it’s likely true that [pic].
• [pic]
• [pic]
• [pic]
Decision Theory
Nature vs. the Statistician
Nature picks a value of [pic] and generates[pic] according to[pic].
There exists an action space of allowable decisions/actions and the Statistician must choose a decision rule from a class of allowable decision rules.
There is also a loss function based on the Statistician’s decision rule chosen and Nature’s true value picked.
Risk is Expected Loss denoted[pic].
A Δ-optimal decision rule [pic] is one that satisfies[pic].
Bayesian Decision Theory
Suppose Nature chooses the parameter as well as the data at random.
We know the distribution Nature uses for selecting [pic] is[pic], the prior distribution.
Goal is the minimize Bayes risk:[pic].
[pic]a = any allowable action.
Want to find [pic] by minimizing [pic] ( [pic]
This then implies that [pic] for any other decision rule[pic].
3.2: Sufficient Statistics (Cox)
A statistic [pic] is sufficient for [pic] iff the conditional distribution of X given T=t is independent of[pic]. Intuitively, a sufficient statistic contains all the information about [pic] that is in X.
A loss function, L([pic],a), is convex if (i) action space A is a convex set and (ii) for each [pic], [pic] is a convex function.
Rao-Blackwell Theorem: Let [pic] be iid random variables with pdf [pic]. Let T be a sufficient statistic for [pic] and U be an unbiased estimate for[pic], which is not a function of T alone. Set[pic]. Then we have that:
1) The random variable [pic] is a function of the sufficient statistic T alone.
2) [pic] is an unbiased estimator for [pic].
3) Varθ([pic]) < Varθ(U), [pic] provided [pic].
Factorization Theorem: [pic] is a sufficient statistic for [pic] iff [pic] for some g and h.
• If you can put the distribution if the form [pic], then [pic] are sufficient statistics.
Minimal Sufficient Statistic
If m is the smallest number for which T = [pic], [pic], is a sufficient statistic for [pic], then T is called a minimal sufficient statistic for [pic].
Alternative definition: If T is a sufficient statistic for a family P, then T is minimal sufficient iff for any other sufficient statistic S, T = h(S) P-a.s. i.e. Any sufficient statistic can be written in terms of the minimal sufficient statistic.
Proposition 4.2.5: Let P be an exponential family on a Euclidean space with densities[pic], where [pic] and [pic] are p-dimensional. Suppose there exists [pic] such that the vectors [pic] are linearly independent in[pic]. Then [pic] is minimal sufficient for[pic]. In particular, if the exponential family is full rank, then T is minimal sufficient.
3.3: Complete Statistics and Ancillary Statistics (Cox)
A statistic X is complete if for every g, [pic].
Example: Consider the Poisson Family, [pic] where A = [1,2,…]. [pic]. So F is complete.
A statistic V is ancillary iff [pic] does not depend on[pic]. i.e. [pic]
Basu’s Theorem: Suppose T is a complete and sufficient statistic and V is ancillary for[pic]. Then T and V are independent.
A statistic [pic] is:
• Location invariant iff [pic]
• Location equivariant iff [pic]
• Scale invariant iff [pic]
• Scale equivariant iff [pic]
PPN: If the pdf is a location family and T is location invariant, then T is also ancillary.
PPN: If the pdf is a scale family with iid observations and T is scale invariant, then T is also ancillary.
❖ A sufficient statistic has ALL the information with respect to a parameter.
❖ An ancillary statistic has NO information with respect to a parameter.
3.1: Discrete Distributions (Casella-Berger)
Uniform, U(N0,N1): [pic]
• puts equal mass on each outcome (i.e. x = 1,2,...,N)
Hypergeometric: [pic]
• sampling without replacement
• example: N balls, M or which are one color, N-M another, and you select a sample of size K.
• restriction: [pic]
Bernoulli: [pic]
• has only two possible outcomes
Binomial: [pic]
• [pic] (binomial is the sum of i.i.d. bernoulli trials)
• counts the number of successes in a fixed number of bernoulli trials
Binomial Theorem: For any real numbers x and y and integer n[pic]0, [pic].
Poisson: [pic]
• Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up.
• other applications: spatial distributions (i.e. fish in a lake)
Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let λ = np.
Negative Binomial: [pic]
• X = trial at which the rth success occurs
• counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes)
• independent bernoulli trials (not necessarily identical)
• must be r-1 successes in first x-1 trials...and then the rth success
• can also be viewed as Y, the number of failures before rth success, where Y=X-r.
• NB(r,p) [pic] P(λ)
Geometric: [pic]
• Same as NB(1,p)
• X = trial at which first success occurs
• distribution is “memoryless” (i.e. [pic])
o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start.
3.2: Continuous Distributions (Casella-Berger)
Uniform, U(a,b): [pic]
• spreads mass uniformly over an interval
Gamma: [pic]
• α = shape parameter; β = scale parameter
• [pic]
• [pic] for any integer n>0
• [pic]
• [pic] = Gamma Function
• if α is an integer, then gamma is related to Poisson via λ = x/β.
• [pic]
• Exponential ~ Gamma(1,β)
• If X ~ exponential(β), then [pic]~Weibull(γ,β)
Normal: [pic]
• symmetric, bell-shaped
• often used to approximate other distributions
Beta: [pic]
• used to model proportions (since domain is (0,1))
• Beta(α,β) = U(0,1)
Cauchy: [pic]
• Symmetric, bell-shaped
• Similar to the normal distribution, but the mean does not exist
• [pic] is the median
• the ratio of two standard normals is Cauchy
Lognormal: [pic]
• used when the logarithm of a random variable is normally distributed
• used in applications where the variable of interest is right skewed
Double Exponential: [pic]
• reflects the exponential around its mean
• symmetric with “fat” tails
• has a peak ( sharp point = nondifferentiable) at x=μ
3.3: Exponential Families (Casella-Berger)
A pdf is an exponential family if it can be expressed as
[pic].
This can also be written as [pic].
The set [pic] is called the natural parameter space. H is convex.
3.4: Location and Scale Families (Casella-Berger)
SEE COX [pic]FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY.
3.1: Discrete Distributions (Casella-Berger)
Uniform, U(N0,N1): [pic]
• puts equal mass on each outcome (i.e. x = 1,2,...,N)
Hypergeometric: [pic]
• sampling without replacement
• example: N balls, M or which are one color, N-M another, and you select a sample of size K.
• restriction: [pic]
Bernoulli: [pic]
• has only two possible outcomes
Binomial: [pic]
• [pic] (binomial is the sum of i.i.d. bernoulli trials)
• counts the number of successes in a fixed number of bernoulli trials
Binomial Theorem: For any real numbers x and y and integer n[pic]0, [pic].
Poisson: [pic]
• Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up.
• other applications: spatial distributions (i.e. fish in a lake)
Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let λ = np.
Negative Binomial: [pic]
• X = trial at which the rth success occurs
• counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes)
• independent bernoulli trials (not necessarily identical)
• must be r-1 successes in first x-1 trials...and then the rth success
• can also be viewed as Y, the number of failures before rth success, where Y=X-r.
• NB(r,p) [pic] P(λ)
Geometric: [pic]
• Same as NB(1,p)
• X = trial at which first success occurs
• distribution is “memoryless” (i.e. [pic])
o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start.
3.2: Continuous Distributions (Casella-Berger)
Uniform, U(a,b): [pic]
• spreads mass uniformly over an interval
Gamma: [pic]
• α = shape parameter; β = scale parameter
• [pic]
• [pic] for any integer n>0
• [pic]
• [pic] = Gamma Function
• if α is an integer, then gamma is related to Poisson via λ = x/β.
• [pic]
• Exponential ~ Gamma(1,β)
• If X ~ exponential(β), then [pic]~Weibull(γ,β)
Normal: [pic]
• symmetric, bell-shaped
• often used to approximate other distributions
Beta: [pic]
• used to model proportions (since domain is (0,1))
• Beta(α,β) = U(0,1)
Cauchy: [pic]
• Symmetric, bell-shaped
• Similar to the normal distribution, but the mean does not exist
• [pic] is the median
• the ratio of two standard normals is Cauchy
Lognormal: [pic]
• used when the logarithm of a random variable is normally distributed
• used in applications where the variable of interest is right skewed
Double Exponential: [pic]
• reflects the exponential around its mean
• symmetric with “fat” tails
• has a peak ( sharp point = nondifferentiable) at x=μ
3.3: Exponential Families (Casella-Berger)
A pdf is an exponential family if it can be expressed as
[pic].
This can also be written as [pic].
The set [pic] is called the natural parameter space. H is convex.
3.4: Location and Scale Families (Casella-Berger)
SEE COX [pic]FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY.
Chapter 4: Multiple Random Variables
4.1: Joint and Marginal Distributions (Casella-Berger)
An n-dimensional random vector is a function from a sample space S into Rn, n-dimensional Euclidean space.
Let (X,Y) be a discrete bivariate random vector. Then the function f(x,y) from R2 to R defined by f(x,y)=fX,Y(x,y)=P(X=x,Y=y) is called the joint probability mass function or the joint pmf of (X,Y).
Let (X,Y) be a discrete bivariate random vector with joint pmf fX,Y(x,y). Then the marginal pmfs of X and Y are [pic] and [pic].
On the continuous side, the joint probability density function or joint pdf is
[pic].
The continuous marginal pdfs are [pic] and [pic], -[pic]0 and let p[pic]1 and q[pic]1 such that [pic]. Then [pic] with equality only if ap=bq.
[pic]: Let X and Y be random variables and p and q satisfy Lemma 4.7.1. Then
[pic]
When p=q=2, this yields the[pic],
[pic].
Liapunov’s Inequality: [pic] 1 ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- definition definite integral let be continuous on the
- mathematical statistics review rice university
- defining and computing definite integrals
- integration by substitution
- primer on integration
- indefinite integrals calculus
- linear differential games of pursuit with integral block
- new chapter 3 texas a m university
Related searches
- grade 11 mathematical literacy exam papers
- mathematical literacy question papers
- sig figs with mathematical operations
- mathematical journals ranking
- white rice vs jasmine rice nutrition facts
- minute rice brown rice directions
- rice to water ratio in rice cooker
- minute rice brown rice cup
- brown rice vs white rice calories
- brown rice vs white rice nutrition
- dry rice to cooked rice ratio
- uncooked rice to cooked rice chart