Statistics 512 Notes I D



Statistics 550 Notes 1

Reading: Section 1.1.

I. Basic definitions and examples of models (Section 1.1.1)

Goal of statistics: Draw useful information from data.

Model based approach to statistics: Treat data as the outcome of a random experiment that we model mathematically.

Random Experiment: Any procedure that (1) can be repeated, theoretically, over and over; and (2) has a well-defined set of possible outcomes (the sample space).

The outcome of the experiment is data [pic].

Examples of experiments and data:

▪ Experiment: Randomly select 1000 people without replacement from the U.S. adult population and ask them whether they are employed. Data: [pic]. [pic] if person i in sample is employed, [pic]if person i in sample is not employed.

▪ Experiment: Randomly sample 500 handwritten ZIP codes on envelopes from U.S. postal mail . Data: [pic]. [pic]= 216 x 72 matrix where elements in the matrix are numbers from 0 to 255 that represent the intensity of writing in each part of the image.

[pic] [pic]

The probability distribution for the data [pic]over repeated experiments is [pic].

Frequentist concept of probability:

[pic]Proportion of times in repeated experiments that the data [pic]falls in the set E.

(Statistical) Model: Family of possible [pic]’s:

[pic]. The [pic]’s label the[pic]’s and [pic]is a space of labels called the parameter space.

Goal of statistical inference: On the basis of the data[pic], make inferences about the true [pic] that generated the data.

We will study three types of inferences:

1) Point estimation – best estimate of [pic].

2) Hypothesis testing – decide whether [pic]is in a specified subset of [pic].

3) Interval (set) estimation – estimate a set that [pic] lies in.

Goal of this course: Study how to make “good” inferences.

Example of a statistical model:

Example 1: Shaquille O’Neal’s free show shooting.

[pic]

O'Neal's free throw shooting is regarded as one of his major weaknesses.

The following are the number of free throw attempts and number of free throws made by Shaquille O’Neal during each game of the 2000 NBA playoffs:

|Game |Number made |Number of Attempts |Game |Number made |Number of attempts |

|1 |4 |5 |13 |6 |12 |

|2 |5 |11 |14 |9 |9 |

|3 |5 |14 |15 |7 |12 |

|4 |5 |12 |16 |3 |10 |

|5 |2 |7 |17 |8 |12 |

|6 |7 |10 |18 |1 |6 |

|7 |6 |14 |19 |18 |39 |

|8 |9 |15 |20 |3 |13 |

|9 |4 |12 |21 |10 |17 |

|10 |1 |4 |22 |1 |6 |

|11 |13 |27 |23 |3 |12 |

|12 |5 |17 | | | |

Experiment: In a sequence of 23 games, Shaq shoots 5 free throws in the first game, 11 free throws in the second game, ..., 12 free throws in the 23rd game.

Data: [pic].

[pic]1 or 0 according to whether Shaq makes his jth free throw in his ith game.

Potential model: [pic]are independent and identically distributed (iid) Bernoulli random variables with [pic].

[pic], [pic].

Commentators remarked that Shaq’s shooting varied dramatically from game to game.

Another model:

[pic]are independent.

[pic]are Bernoulli [pic], [pic].

[pic], [pic].

Choosing models:

Consultation with subject matter experts and knowledge about how the data are collected are important for selecting a reasonable model.

George Box (1979): “Models, of course, are never true but fortunately it is only necessary that they be useful.”

We will focus mostly on making inferences about the true [pic]conditional on the model’s validity, i.e., [pic], but another important step in data analysis is to investigate the model’s validity through diagnostics (techniques for doing this will be discussed in Chapter 4).

II. Parameterization and Parameters (Section 1.1.2)

Model: [pic]. The vector [pic]is a way of labeling the distributions in the model.

Parameterization: Formally, an onto map from a parameter space [pic]is called a parameterization of [pic]. The parameterization is a way of labeling the distributions in the model.

The parameterization is not unique. For example in Example 1, Model 1: Instead of using the parameterization [pic], we can use the parameterization [pic] to label the distributions in the model.

We try to choose a parameterization in which the components of the parameterization are interpretable

in terms of the phenomenon we are trying to measure.

Example 3: Sal is a pizza inspector for the city health department. Recently, he has received a number of complaints directed against a certain pizzeria for allegedly failing to comply with their advertisements. The pizzeria claims that on the average, each of their large pepperoni pizzas is topped with 2 ounces of pepperoni. The dissatisfied customers feel that the actual average amount of pepperoni used is considerably less than that. To settle the matter, Sal takes a random sample of 10 pizzas. The data is [pic], the amount of pepperoni on each of the ten pizzas.

Sal assumes the model is

[pic]iid [pic] (where [pic]are the mean and variance of the normal distribution respectively).

Three possible parameterizations are ([pic]), ([pic]) and [pic]. The first two parameterizations are more interpretable because they contain one parameter [pic] that corresponds exactly to what we are interested in.

Parametric vs. Nonparametric models: Models in which [pic]is a nice subset of a finite dimensional Euclidean space are called “parametric” models, e.g., the model in Example 3 is parametric. Models in which [pic]is infinite dimensional are called “nonparametric.” For example, if in Example 3, we considered [pic]iid from any distribution with a density, the model would be nonparametric.

Identifiability: The parameterization is identifiable if the map [pic]is one-to-one, i.e., if [pic].

The parameterization is unidentifiable if there exists [pic]such that [pic].

When the parameterization is unidentifiable, then parts of [pic]remain unknowable even with “infinite amounts of data”, i.e., even if we knew the true [pic]

Example 4: Suppose [pic]iid. Exponential with mean [pic], i.e.,

[pic]

The parameterization [pic]is identifiable. The parameterization [pic] is unidentifiable because [pic].

Parameter: A parameter is a feature [pic]of [pic], i.e., a map from [pic]to another space [pic].

e.g., for Example 3, [pic]iid [pic],

[pic],the mean of each [pic], is a parameter.

[pic], the variance of each [pic], is a parameter.

[pic]is a parameter.

Some parameters are of interest and others are nuisance parameters that are not of central interest.

In Example 3, for the parameterization ([pic]), the parameter [pic]is the parameter of interest and the parameter [pic]is a nuisance parameter. The pizzeria’s claim concerns the average amount of pepperoni.

A parameter is by definition identified, meaning that if we knew the true [pic], we would know the parameter.

Proposition: For a given parameterization [pic], [pic]is a parameter if and only if the parameterization is identifiable.

Proof: If the parameterization is identifiable, then [pic]is equal to the inverse of the parameterization which maps [pic]. If the parameterization is not identifiable, then for some [pic], we have [pic]and consequently we can’t write [pic]for any function [pic].

Remark: Even if the parameterization is unidentifiable, components of the parameterization may be identified (i.e., parameters).

Why would we ever want to consider an unidentifiable parameterization?

Components of the parameterization may capture the scientific features of interest. We may be interested if certain components of the parameterization are identifiable.

Example 5: Survey nonresponse. A major problem in survey sampling is nonresponse.

Example: On Sunday, Sept. 11, 1988, the San Francisco Examiner ran a story headlined:

3 IN 10 BIOLOGY TEACHERS BACK BIBLICAL CREATIONISM

Arlington, Texas. Thirty percent of high school biology teachers polled believe in the biblical creation and 19 percent incorrectly think that humans and dinosaurs lived at the same time, according to a nationwide survey published Saturday…

The poll was conducted by choosing 400 teachers at random from the National Science Teachers Association’s list of 20,000 teachers and sending these 400 teachers questionnaires. 200 of these 400 teachers returned the questionnaires and 60 of the 200 believe in biblical creationism.

Let [pic]1 or 0 according to whether the ith teacher believes in biblical creationism, [pic].

Let [pic]1 or 0 according to whether the ith teacher would respond to the questionnaire if sent it, [pic].

We would like to know the proportion of teachers that believe in biblical creationism, [pic].

The data [pic]from the experiment of randomly sampling 400 teachers is (i) the number of teachers that respond, call this [pic] and (ii) the number of teachers that respond and believe in biblical creationism, call this [pic].

The distribution of [pic] over repeated random samples is that [pic] is hypergeometric

[pic]

and the conditional distribution of [pic] is

[pic]

A parameterization for the model is

[pic] ,

[Note that [pic]]

This parameterization includes our quantity of interest,

[pic].

But this parameterization is not identifiable:

[pic] and

[pic]

have the same distribution for [pic].

The quantity the article reported an estimate of, [pic], is identified (i.e., a parameter) but our quantity of interest,

[pic], is not identified (i.e., not a parameter).

III. Statistics

A statistic [pic] is a random variable or random vector that is a function of the data.

Example 3 continued: [pic]iid [pic]. Two statistics are [pic]and the sample variance [pic].

Section 1.1.4: Examples, Regression Models. Provides an example of one of the most important models in statistics.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download