Mathematical Statistics Review - Rice University



Summary of Stat 531/532

Taught by Dr. Dennis Cox

Text: A work-in-progress by Dr. Cox

Supplemented by Statistical Inference by Cassella & Berger

Chapter 1: Probability Theory and Measure Spaces

1.1: Set Theory (Casella-Berger)

The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment.

An event is any collection of possible outcomes of an experiment (i.e. any subset of S).

A and B are disjoint (or mutually exclusive) if[pic]. The events A1, A2, … are pairwise disjoint if [pic]

If A1, A2, … are pairwise disjoint and [pic], then the collection A1, A2, … forms a partition of S.

1.2: Probability Theory (Casella-Berger)

A collection of subsets of S is called a Borel field (or sigma algebra) denoted by β, if it satisfies the following three properties:

1. [pic]

2. [pic]

3. [pic]

i.e. the empty set is contained in β, β is closed under complementation, and β is closed under countable unions.

Given a sample space S and an associated Borel field β, a probability function is a function P with domain β that satisfies:

1. P(A)[pic]0 for all [pic]

2. P(S)=1

3. If A1, A2,[pic] are pairwise disjoint, then P([pic])=[pic]P(Ai).

Axiom of Finite Additivity: If [pic]and [pic]are disjoint, then P([pic])=P(A) + P(B).

1.1: Measures (Cox)

A measure space is denoted as[pic], where Ω is an underlying set, F is a sigma-algebra, and μ is a measure, meaning that it satisfies:

1. [pic]

2. [pic]

3. If A1, A2, … is a sequence of disjoint elements of F, then [pic]

A measure, P, where P(Ω)=1 is called a probability measure.

A measure on [pic] is called a Borel measure.

A measure, #, where #(A)=number of elements in A, is called a counting measure.

A counting measure may be written in terms of unit point masses (UPM) as [pic] where [pic]. UPM measure is also a probability measure.

There is a unique Borel Measure, m, satisfying m([a,b]) = b - a, for every finite closed interval [a,b], [pic].

Properties of Measures:

a) (monotonicity) [pic]

b) (subadditivity) [pic], where Ai is any sequence of measurable sets.

c) If Ai , i=1,2,… is a decreasing sequence of measurable sets (i.e.[pic]) and if [pic], then [pic].

1.2: Measurable Functions and Integration (Cox)

Properties of the Integral:

a) If [pic] exists and [pic], then [pic] exists and equals [pic].

b) If [pic] and [pic] both exist and [pic] is defined (i.e. not of the form [pic]), then [pic] is defined and equals [pic].

c) If [pic], then [pic], provided the integrals exist.

d) [pic], provided [pic] exists.

S holds μ-almost everywhere (μ-a.e.) iff [pic] a set [pic] and [pic] is true for [pic].

Consider all extended Borel functions on [pic]:

Monotone Convergence Theorem: [pic] is an increasing sequence of nonnegative functions (i.e. [pic]) and [pic] [pic] [pic].

Lebesgue’s Dominated Convergence Theorem: [pic] as [pic] and [pic] an integrable function [pic][pic].

[pic] is called the dominating function.

Change of Variables Theorem: Suppose [pic] and [pic]. Then [pic].

Interchange of Differentiation and Integration Theorem: For each fixed[pic], consider a function [pic]. If

1. The integral of [pic] is finite,

2. The partial derivative w.r.t.[pic] exists, and

3. the abs. value of the partial derivative is bounded by [pic], then

[pic] is integrable w.r.t. μ and [pic].

1.3: Measures on Product Spaces (Cox)

A measure space [pic] is called σ-finite iff [pic] an infinite sequence [pic]

i) [pic] for each i, and

ii) [pic].

Product Measure Theorem: Let [pic] and [pic] be σ-finite measure spaces. Then [pic] a unique product measure [pic] on [pic], [pic].

Fubini’s Theorem: Let [pic], [pic] and [pic], where [pic] and [pic] are σ-finite. If[pic]is a Borel function on [pic] whose integral w.r.t. [pic] exists, then [pic].

Conclude that [pic] exists [pic] and defines a Borel function on [pic] whose integral w.r.t. [pic] exists.

A function [pic] is called an n-dimensional random vector (a.k.a random n-vector). The distribution of [pic] is the same as the law of [pic]. (i.e. [pic]).

• [pic].

1.4: Densities and The Radon-Nikodym Theorem (Cox)

Let [pic] and [pic] be measures on [pic]. [pic] is absolutely continuous w.r.t. [pic] (i.e. [pic]) iff for all [pic]. This can also be said as [pic] dominates [pic] (i.e. [pic] is a dominating measure for [pic]). If both measures dominate one another, then they are equivalent.

Radon-Nikodym Theorem: Let [pic] be a σ-finite measure space and suppose [pic]. Then [pic] a nonnegative Borel function [pic]. Furthermore, [pic] is unique [pic]

1.5: Conditional Expectation (Cox)

Theorem 1.5.1: Suppose [pic] and [pic]. Then Z is [pic]-measurable [pic] a Borel function [pic] such that Z = h(Y).

Let [pic] be a random variable with [pic], and suppose G is a sub-σ-field of F. Then the conditional expectation of X given G, denoted [pic], is the essentially unique random variable [pic] satisfying (where [pic] = [pic])

i) [pic] is G measurable

ii) [pic]G.

Proposition 1.5.4: Suppose [pic] and [pic] are random elements and [pic] is a σ-finite measure on [pic] for I=1,2 such that [pic]. Let [pic] denote the corresponding joint density. Let [pic] be any Borel function [pic]. Then [pic], a.s.

The conditional density of X given Y is denoted by [pic].

Proposition 1.5.5: Suppose the assumptions of proposition 1.5.4 hold. Then the following are true:

i) the family of regular conditional distributions [pic] exists;

ii) [pic] for [pic];

iii) the Radon-Nikodym derivative is given by [pic]

Theorem 1.5.7 - Basic Properties of Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on [pic], and let G be a fixed sub-σ-field of [pic].

a) If [pic] a.s., [pic] a constant, then [pic] a.s.

b) If [pic] a.s., then [pic] a.s.

c) If [pic], then [pic], a.s.

d) (Law of Total Expectation) [pic].

e) [pic].

f) If [pic] then [pic] a.s.

g) (Law of Successive Conditioning) If [pic] is a sub-σ-field of G, then [pic] a.s.

h) If [pic] and [pic] then [pic] a.s.

Theorem 1.5.8 – Convergence Theorems for Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on [pic], and let G be a fixed sub-σ-field of [pic].

a) (Monotone Convergence Theorem) If [pic] a.s. then [pic] a.s.

b) (Dominated Convergence Theorem) Suppose there is an integrable r.v. Y such that [pic] a.s. for all I, and suppose that [pic] a.s. Then [pic] a.s.

To Stage Experiment Theorem: Let [pic] be a measurable space and suppose [pic] satisfies the following:

i) [pic] is Borel measurable for each fixed [pic];

ii) [pic] is a Borel p.m. for each fixed [pic]

Let ν be any p.m. on [pic]. Then there is a unique p.m. P on [pic]

[pic]

Theorem 1.5.12 - Bayes Formula: Suppose [pic] is a random element and let λ be a σ-finite measure on [pic]. Denote the corresponding density by

[pic].

Let μ be a σ-finite measure on [pic]. Suppose that for each [pic] there is a given pdf w.r.t μ denoted [pic]. Denote by X a random element taking values in [pic] with [pic]. Then there is a version of [pic] given by

[pic].

Chapter 2: Transformations and Expectations

2.1: Moments and Moment Inequalities (Cox)

A moment refers to an expectation of a random variable or a function of a random variable.

• The first moment is the mean

• The kth moment is E[Xk]

• The kth central moment is E[(X-E[X])k]= E[(X-μ)k]

Finite second moments ( Finite first moments (i.e. [pic]

Markov’s Inequality: Suppose [pic] a.s., then [pic].

Corollary to Markov’s Inequality: [pic]

Chebyshev’s Inequality: Suppose X is a random variable with E[X2]0

[pic].

Jensen’s Inequality: Let f be a convex function on a convex set [pic] and suppose[pic] is a random n-vector with [pic] and [pic] a.s. The E[[pic]][pic] and [pic].

• If f concave, change the sign.

• If strictly convex/concave, can use strict inequality holds.

Cauchy-Schwarz Inequality: [pic].

Theorem 2.1.6 (Spectral Decomposition of a Symmetric Matrix): Let A be a symmetric matrix. Then there is an orthogonal matrix U and a diagonal matrix [pic] such that [pic].

• The diagonal entries of [pic] are eigenvalues of A

• The columns of U are the corresponding eigenvectors.

2.2: Characteristic and Moment Generating Functions (Cox)

The characteristic function (chf) of a random n-vector[pic] is the complex valued function [pic] given by [pic]

The moment generating function (mgf) is defined as [pic]

Theorem 2.2.1: Let[pic] be a random n-vector with chf [pic] and mgf [pic].

a) (Continuity) [pic] is uniformly continuous on [pic], and [pic] is continuous at every point [pic] such that [pic] for all [pic] in a neighborhood of [pic].

b) (Relation to moments) If[pic] is integrable, then the gradient [pic], is defined at [pic] and equals iE[[pic]]. Also, [pic] has finite second moments iff the Hessian [pic] of [pic] exists at [pic] and then [pic].

c) (Linear Transformation Formulae) Let [pic] for some [pic] matrix A and some m-vector [pic]. Then for all [pic]

[pic]

[pic]

d) (Uniqueness) If [pic] is a random n-vector and if [pic] for all [pic] then [pic]. If both [pic] are defined and equal in a neighborhood of [pic], then [pic].

e) (Chf for Sums of Independent Random Variables) Suppose[pic] and[pic] are independent random p-vectors and let [pic]. Then [pic].

Cumulant Generating Function is ln[pic]=K(u).

2.3: Common Distributions Used in Statistics (Cox)

Location-Scale Families

A distribution is a location family if [pic].

• I.e. [pic]

A distribution is a scale family if [pic].

• [pic]

A distribution is a location-scale family if [pic].

• [pic]

A class of transformations T on [pic] is called a transformation group iff the following hold:

i) Every [pic] is measurable [pic].

ii) T is closed under composition. i.e. if g1 and g2 are in T, then so is [pic].

iii) T is closed under taking inverses. i.e. [pic].

Let T be a transformation group and let P be a family of probability measures on [pic]. Then P is T-invariant iff [pic].

2.4: Distributional Calculations (Cox)

Jensen’s Inequality for Conditional Expectation: [pic], Law[Y]-a.s., g(.,y) is a convex function.

2.1: Distribution of Functions of a Random Variable (Casella-Berger)

Transformation for Monotone Functions:

Theorem 2.1.1 (for cdf): Let X have cdf FX(x), let Y=g(X), and let [pic] and [pic] be defined as [pic] and [pic]

1. If g is an increasing function on [pic], then FY(y)=FX(g-1(y)) for [pic].

2. If g is a decreasing function on [pic] and X is a continuous random variable, then

FY(y)=1-FX(g-1(y)) for [pic].

Theorem 2.1.2 (for pdf): Let X have pdf fX(x) and let Y=g(X), where g is a monotone function. Let [pic] and [pic] be defined as above. Suppose fX(x) is continuous on [pic] and that g-1(y) has a continuous derivative on [pic]. Then the pdf of Y is given by

[pic].

The Probability Integral Transformation:

Theorem 2.1.4: Let X have a continuous cdf FX(x) and define the random variable Y as

Y = FX(X). Then Y is uniformly distributed on (0,1). I.e. P(Y[pic]y) = y, 0 < y < 1. (interpretation: If X is continuous and Y = cdf of X, then Y is U(0,1)).

2.2: Expected Values (Casella-Berger)

The expected value or mean of a random variable g(X), denoted Eg(X), is

[pic].

2.3: Moments and Moment Generating Functions (Casella-Berger)

For each integer n, the nth moment of X, [pic], is [pic]. The nth central moment of X is [pic]

• Mean = 1st moment of X

• Variance = 2nd central moment of X

The moment generating function of X is [pic].

NOTE: The nth moment is equal to the nth derivative of the mgf evaluated at t=0.

i.e. [pic].

Useful relationship: Binomial Formula is [pic].

Lemma 2.3.1: Let a1,a2,… be a sequence of numbers converging to ‘a’. Then [pic].

Chapter 3: Fundamental Concepts of Statistics

3.1: Basic Notations of Statistics (Cox)

Statistics is the art and science of gathering, analyzing, and making inferences from data.

Four parts to statistics:

1. Models

2. Inference

3. Decision Theory

4. Data Analysis, model checking, robustness, study design, etc…

Statistical Models

A statistical model consists of three things:

1. a measurable space [pic],

2. a collection of probability measures P on [pic], and

3. a collection of possible observable random vectors [pic].

Statistical Inference

i) Point Estimation: Goal – estimate the true [pic] from the data.

ii) Hypothesis Testing: Goal – choose between two hypotheses: the null or the alternative.

iii) Interval & Set Estimation: Goal – find [pic] it’s likely true that [pic].

• [pic]

• [pic]

• [pic]

Decision Theory

Nature vs. the Statistician

Nature picks a value of [pic] and generates[pic] according to[pic].

There exists an action space of allowable decisions/actions and the Statistician must choose a decision rule from a class of allowable decision rules.

There is also a loss function based on the Statistician’s decision rule chosen and Nature’s true value picked.

Risk is Expected Loss denoted[pic].

A Δ-optimal decision rule [pic] is one that satisfies[pic].

Bayesian Decision Theory

Suppose Nature chooses the parameter as well as the data at random.

We know the distribution Nature uses for selecting [pic] is[pic], the prior distribution.

Goal is the minimize Bayes risk:[pic].

[pic]a = any allowable action.

Want to find [pic] by minimizing [pic] ( [pic]

This then implies that [pic] for any other decision rule[pic].

3.2: Sufficient Statistics (Cox)

A statistic [pic] is sufficient for [pic] iff the conditional distribution of X given T=t is independent of[pic]. Intuitively, a sufficient statistic contains all the information about [pic] that is in X.

A loss function, L([pic],a), is convex if (i) action space A is a convex set and (ii) for each [pic], [pic] is a convex function.

Rao-Blackwell Theorem: Let [pic] be iid random variables with pdf [pic]. Let T be a sufficient statistic for [pic] and U be an unbiased estimate for[pic], which is not a function of T alone. Set[pic]. Then we have that:

1) The random variable [pic] is a function of the sufficient statistic T alone.

2) [pic] is an unbiased estimator for [pic].

3) Varθ([pic]) < Varθ(U), [pic] provided [pic].

Factorization Theorem: [pic] is a sufficient statistic for [pic] iff [pic] for some g and h.

• If you can put the distribution if the form [pic], then [pic] are sufficient statistics.

Minimal Sufficient Statistic

If m is the smallest number for which T = [pic], [pic], is a sufficient statistic for [pic], then T is called a minimal sufficient statistic for [pic].

Alternative definition: If T is a sufficient statistic for a family P, then T is minimal sufficient iff for any other sufficient statistic S, T = h(S) P-a.s. i.e. Any sufficient statistic can be written in terms of the minimal sufficient statistic.

Proposition 4.2.5: Let P be an exponential family on a Euclidean space with densities[pic], where [pic] and [pic] are p-dimensional. Suppose there exists [pic] such that the vectors [pic] are linearly independent in[pic]. Then [pic] is minimal sufficient for[pic]. In particular, if the exponential family is full rank, then T is minimal sufficient.

3.3: Complete Statistics and Ancillary Statistics (Cox)

A statistic X is complete if for every g, [pic].

Example: Consider the Poisson Family, [pic] where A = [1,2,…]. [pic]. So F is complete.

A statistic V is ancillary iff [pic] does not depend on[pic]. i.e. [pic]

Basu’s Theorem: Suppose T is a complete and sufficient statistic and V is ancillary for[pic]. Then T and V are independent.

A statistic [pic] is:

• Location invariant iff [pic]

• Location equivariant iff [pic]

• Scale invariant iff [pic]

• Scale equivariant iff [pic]

PPN: If the pdf is a location family and T is location invariant, then T is also ancillary.

PPN: If the pdf is a scale family with iid observations and T is scale invariant, then T is also ancillary.

❖ A sufficient statistic has ALL the information with respect to a parameter.

❖ An ancillary statistic has NO information with respect to a parameter.

3.1: Discrete Distributions (Casella-Berger)

Uniform, U(N0,N1): [pic]

• puts equal mass on each outcome (i.e. x = 1,2,...,N)

Hypergeometric: [pic]

• sampling without replacement

• example: N balls, M or which are one color, N-M another, and you select a sample of size K.

• restriction: [pic]

Bernoulli: [pic]

• has only two possible outcomes

Binomial: [pic]

• [pic] (binomial is the sum of i.i.d. bernoulli trials)

• counts the number of successes in a fixed number of bernoulli trials

Binomial Theorem: For any real numbers x and y and integer n[pic]0, [pic].

Poisson: [pic]

• Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up.

• other applications: spatial distributions (i.e. fish in a lake)

Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let λ = np.

Negative Binomial: [pic]

• X = trial at which the rth success occurs

• counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes)

• independent bernoulli trials (not necessarily identical)

• must be r-1 successes in first x-1 trials...and then the rth success

• can also be viewed as Y, the number of failures before rth success, where Y=X-r.

• NB(r,p) [pic] P(λ)

Geometric: [pic]

• Same as NB(1,p)

• X = trial at which first success occurs

• distribution is “memoryless” (i.e. [pic])

o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start.

3.2: Continuous Distributions (Casella-Berger)

Uniform, U(a,b): [pic]

• spreads mass uniformly over an interval

Gamma: [pic]

• α = shape parameter; β = scale parameter

• [pic]

• [pic] for any integer n>0

• [pic]

• [pic] = Gamma Function

• if α is an integer, then gamma is related to Poisson via λ = x/β.

• [pic]

• Exponential ~ Gamma(1,β)

• If X ~ exponential(β), then [pic]~Weibull(γ,β)

Normal: [pic]

• symmetric, bell-shaped

• often used to approximate other distributions

Beta: [pic]

• used to model proportions (since domain is (0,1))

• Beta(α,β) = U(0,1)

Cauchy: [pic]

• Symmetric, bell-shaped

• Similar to the normal distribution, but the mean does not exist

• [pic] is the median

• the ratio of two standard normals is Cauchy

Lognormal: [pic]

• used when the logarithm of a random variable is normally distributed

• used in applications where the variable of interest is right skewed

Double Exponential: [pic]

• reflects the exponential around its mean

• symmetric with “fat” tails

• has a peak ( sharp point = nondifferentiable) at x=μ

3.3: Exponential Families (Casella-Berger)

A pdf is an exponential family if it can be expressed as

[pic].

This can also be written as [pic].

The set [pic] is called the natural parameter space. H is convex.

3.4: Location and Scale Families (Casella-Berger)

SEE COX [pic]FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY.

3.1: Discrete Distributions (Casella-Berger)

Uniform, U(N0,N1): [pic]

• puts equal mass on each outcome (i.e. x = 1,2,...,N)

Hypergeometric: [pic]

• sampling without replacement

• example: N balls, M or which are one color, N-M another, and you select a sample of size K.

• restriction: [pic]

Bernoulli: [pic]

• has only two possible outcomes

Binomial: [pic]

• [pic] (binomial is the sum of i.i.d. bernoulli trials)

• counts the number of successes in a fixed number of bernoulli trials

Binomial Theorem: For any real numbers x and y and integer n[pic]0, [pic].

Poisson: [pic]

• Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up.

• other applications: spatial distributions (i.e. fish in a lake)

Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let λ = np.

Negative Binomial: [pic]

• X = trial at which the rth success occurs

• counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes)

• independent bernoulli trials (not necessarily identical)

• must be r-1 successes in first x-1 trials...and then the rth success

• can also be viewed as Y, the number of failures before rth success, where Y=X-r.

• NB(r,p) [pic] P(λ)

Geometric: [pic]

• Same as NB(1,p)

• X = trial at which first success occurs

• distribution is “memoryless” (i.e. [pic])

o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start.

3.2: Continuous Distributions (Casella-Berger)

Uniform, U(a,b): [pic]

• spreads mass uniformly over an interval

Gamma: [pic]

• α = shape parameter; β = scale parameter

• [pic]

• [pic] for any integer n>0

• [pic]

• [pic] = Gamma Function

• if α is an integer, then gamma is related to Poisson via λ = x/β.

• [pic]

• Exponential ~ Gamma(1,β)

• If X ~ exponential(β), then [pic]~Weibull(γ,β)

Normal: [pic]

• symmetric, bell-shaped

• often used to approximate other distributions

Beta: [pic]

• used to model proportions (since domain is (0,1))

• Beta(α,β) = U(0,1)

Cauchy: [pic]

• Symmetric, bell-shaped

• Similar to the normal distribution, but the mean does not exist

• [pic] is the median

• the ratio of two standard normals is Cauchy

Lognormal: [pic]

• used when the logarithm of a random variable is normally distributed

• used in applications where the variable of interest is right skewed

Double Exponential: [pic]

• reflects the exponential around its mean

• symmetric with “fat” tails

• has a peak ( sharp point = nondifferentiable) at x=μ

3.3: Exponential Families (Casella-Berger)

A pdf is an exponential family if it can be expressed as

[pic].

This can also be written as [pic].

The set [pic] is called the natural parameter space. H is convex.

3.4: Location and Scale Families (Casella-Berger)

SEE COX [pic]FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY.

Chapter 4: Multiple Random Variables

4.1: Joint and Marginal Distributions (Casella-Berger)

An n-dimensional random vector is a function from a sample space S into Rn, n-dimensional Euclidean space.

Let (X,Y) be a discrete bivariate random vector. Then the function f(x,y) from R2 to R defined by f(x,y)=fX,Y(x,y)=P(X=x,Y=y) is called the joint probability mass function or the joint pmf of (X,Y).

Let (X,Y) be a discrete bivariate random vector with joint pmf fX,Y(x,y). Then the marginal pmfs of X and Y are [pic] and [pic].

On the continuous side, the joint probability density function or joint pdf is

[pic].

The continuous marginal pdfs are [pic] and [pic], -[pic]0 and let p[pic]1 and q[pic]1 such that [pic]. Then [pic] with equality only if ap=bq.

[pic]: Let X and Y be random variables and p and q satisfy Lemma 4.7.1. Then

[pic]

When p=q=2, this yields the[pic],

[pic].

Liapunov’s Inequality: [pic] 1 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download