Statistics 550 Notes 3 .edu



Statistics 550 Notes 5

Reading: Section 1.3-1.4

I. More details for Admissibility Example from last class.

For X1,...,Xn iid [pic], [pic]is an admissible point estimator of [pic]for squared error loss.

Proof: Suppose [pic]is inadmissible. Then there exists a decision procedure [pic]that dominates [pic]. This implies that [pic]. Hence,[pic]. Since [pic]is nonnegative, this implies [pic].

Let [pic]be the event that [pic]. We want to show that [pic]for all [pic]. This would mean that [pic] with probability 1 for all [pic], which means that [pic]for all [pic]; this would contradict [pic]dominates [pic] and prove that [pic]is admissible.

To show that [pic] for all [pic], we use the importance sampling idea that the expectation of a random variable X under a density f can be evaluated as the expectation of the random variable Xf(X)/g(X) under a density g as long as f and g have the same support:

[pic] (0.1)

Since [pic]=0, the random variable

[pic]

is zero with probability one under[pic] Thus, by (0.1), [pic]for all [pic].

II. Review of Example 2 from last class and Bayes criteria

Example 2: We are trying to decide whether to drill a location for oil. There are two possible states of nature,

[pic]location contains oil and [pic]location doesn’t contain oil. We are considering three actions, [pic]=drill for oil, [pic]=sell the location or [pic]=sell partial rights to the location.

The following loss function is decided on

| |(Drill) |(Sell) |(Partial rights) |

| |[pic] |[pic] |[pic] |

|(Oil) [pic] |0 |10 |5 |

|(No oil) [pic] |12 |1 |6 |

An experiment is conducted to obtain information about [pic] resulting in the random variable X with possible values 0,1 and frequency function [pic]given by the following table:

| | Rock formation |

| |X |

| |0 |1 |

|(Oil) [pic] |0.3 |0.7 |

|(No oil) [pic] |0.6 |0.4 |

X represents the presence of a certain geological formation that is more likely to be present when there is oil.

The possible nonrandomized decision procedures [pic]are

| | Rule |

| |1 |

| |1 |

| |1 |

|1 |2 |3 |4 |5 |6 |7 |8 |9 | |[pic] |9.6 |7.48 |8.38 |4.92 |2.8 |3.7 |7.02 |4.9 |5.8 | |Max

Risk |12 |7.6 |9.6 |5.4 |10 |6.5 |8.4 |8.5 |6 | |

For [pic]=0.1, decision rule 4 is the restricted risk Bayes procedure; for [pic]=0.25, decision rule 6 is the restricted risk Bayes procedure.

(2) Gamma minimaxity. Let [pic]be a class of prior distributions. A decision procedure [pic]is gamma-minimax (over a class of considered decision procedures) if

[pic]

Thus, the estimator [pic]minimizes the maximum Bayes risk over those priors in the class [pic].

Computational issues: We will study more on how to find Bayes and minimax point estimators in Chapter 3. The restricted risk Bayes procedure is appealing but it is difficult to compute.

V. Randomized decision procedures

A randomized decision procedure [pic]is a decision procedure which assigns to each possible outcome of the data [pic], a random variable [pic], where the values of [pic]are actions in the action space. When [pic], a draw from the distribution of [pic]will be taken and will constitute the action taken.

We will show in Chapter 3 that for any prior, there is always a nonrandomized decision procedure that has at least as small Bayes risk as a randomized decision procedure (so we can ignore randomized decision procedures in looking for the Bayes rule).

Students of game theory will realize that a randomized decision procedure may lead to a lower maximum risk than a nonrandomized decision procedure.

Example: For Example 2, a randomized decision procedure is to flip a fair coin and use decision rule 4 if the coin lands heads and decision rule 6 if the coin lands tails – i.e., [pic] with probability 1 and [pic]with probability 0.5 and [pic]with probability 0.5. The risk of this randomized decision procedure is

[pic],

which has lower maximum risk than [pic].

Randomized decision procedures are somewhat impractical – it makes the statistician’s inferences seem less credible if she has to explain to a scientist that she flipped a coin after observing the data to determine the inferences.

We will show in Chapter 1.5 that a randomized decision procedure cannot lower the maximum risk if the loss function is convex.

VI. Prediction (Chapter 1.4)

A common decision problem is that we want to predict a variable [pic]based on a covariate vector [pic].

Examples: (1) Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient; (2) Predict the price of a stock 6 months from now, on the basis of company performance measures and economic data; (3) Predict the numbers in a handwritten ZIP code, from a digitized image.

We typically have a “training” sample [pic] available from the joint distribution of [pic]and we want to predict [pic]for a new observation from the distribution of [pic]for which [pic].

In Section 1.4, we consider how to make predictions when we know the joint distribution of [pic]; in practice, we often have only an estimate of the joint distribution based on the training sample.

Let [pic]be a rule for predicting [pic]based on [pic]. A criterion that is often used for judging different prediction rules is mean squared prediction error:

[pic] --

this is the average squared prediction error when [pic]is used to predict [pic]for a particular [pic]. We want [pic]to be as small as possible.

Theorem 1.4.1: Let [pic]. [pic] is the best mean squared prediction error prediction rule.

Proof: For any prediction rule [pic],

[pic]

VII. Sufficiency (Chapter 1.5)

Once we have postulated a statistical model, we would like to separate out any aspects of the data that are irrelevant in the context of our model and that may obscure our understanding of the situation.

A sufficient statistic [pic]is a statistic that carries all the information in the data about the parameter vector [pic]. [pic]can be a scalar or a vector. If [pic]is of lower “dimension” than [pic], then we have a good summary of the data that does not discard any important information.

For example, consider a sequence of independent Bernoulli trials with unknown probability of success [pic]. We may have the intuitive feeling that the total number of successes contains all the information about [pic] that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following definition formalizes this idea:

Definition: A statistic [pic]is sufficient for [pic]if the conditional distribution of [pic]given [pic] does not depend on [pic]for any value of [pic].

Example 1: Let [pic]be a sequence of independent Bernoulli random variables with [pic]. We will verify that [pic]is sufficient for [pic].

Consider

[pic]

For [pic], the conditional probability is 0 and does not depend on [pic].

For [pic],

[pic]The conditional distribution thus does not involve [pic]at all and thus [pic]is sufficient for [pic].

Example 2:

Let [pic]be iid Uniform([pic]). Consider the statistic [pic].

We showed in Notes 4 that

[pic]

For [pic], we have

[pic]

which does not depend on [pic].

For [pic], [pic].

It is often hard to verify or disprove sufficiency of a statistic directly because we need to find the distribution of the sufficient statistic. The following theorem is often helpful.

Factorization Theorem: A statistic [pic]is sufficient for [pic]if and only if there exist functions [pic]and [pic]such that

[pic][pic]

for all x and all [pic].

(where [pic]denotes the probability mass function for discrete data given the parameter [pic]and the probability density function for continuous data).

Proof: We prove the theorem for discrete data; the proof for continuous distributions is similar. First, suppose that the probability mass function factors as given in the theorem. We have

[pic]

so that

[pic]

Thus, [pic]is sufficient for [pic]by the definition of sufficiency.

Conversely, suppose [pic]is sufficient for [pic]. Then the conditional distribution of [pic]does not depend on [pic]. Let [pic]. Then

[pic].

Thus, we can take [pic]

Example 1 Continued: [pic] a sequence of independent Bernoulli random variables with [pic]. To show that [pic]is sufficient for [pic], we factor the probability mass function as follows:

[pic]

The pmf is of the form [pic]where [pic].

Example 2 continued: Let [pic]be iid Uniform([pic]). To show that [pic]is sufficient, we factor the pdf as follows:

[pic]

The pdf is of the form [pic]where [pic]

Example 3: Let [pic]be iid Normal ([pic]). The pdf factors as

[pic]The pdf is thus of the form [pic]where [pic].

Thus, [pic]is a two-dimensional sufficient statistic for [pic], i.e., the distribution of [pic]is independent of [pic]given [pic].

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download