Module III Lecture 2



Module III

Lecture 2

One Sample Situations

Testing Hypotheses About the Mean

In the last lecture, we studied intensively the situation where we tested hypotheses of the form:

in the circumstance where we took a random sample of size n and computed the sample mean. However, we assumed that we knew ( ! This assumption is usually only tenable in the quality control case where we have a great deal of information on the underlying process. But suppose we don’t know (, which is the usual case.

For example, suppose you worked for Bull’s Eye, a corporation which owns mid-scale department stores. Suppose you took a random sample of 25 customers and determined that on average purchase they made was $35.00 worth of goods with a standard deviation of $30.00 using the formulae from Module 1 of:

Since you don’t know ( you would be tempted to substitute the sample standard deviation s in the formulae of the proceeding lecture. Would this make a difference?

The answer is maybe yes and maybe no. Modifying Method 2 from the proceeding lecture, we could compute the statistic:

which differs from zobs only by substituting s for (.

In the 1920’s, William Gossett (publishing under the pseudonym “Student”), showed that the tobs followed a distribution called the “t” distribution which was symmetric and bell shaped and indexed by a term called the degrees of freedom.

Degrees of freedom is a mathematical term which has to do with the dimension of certain spaces which enter into theoretical derivations. It will vary from problem to problem.

How does the t distribution with df degrees of freedom compare with the standard normal distribution?

The answer depends on how large df is. For example, below a graph showing the standard normal distribution (mean = 0 and standard deviation =1) and the t distribution with 1 degree of freedom. (This example can be found in the file tdist.xls).

Notice that the t distribution is flatter in the middle but has more probability in the tails.

Now let us increase the degrees of freedom to 5. The resulting comparison is shown below:

Notice that although the t distribution is still lower in the middle and has heavier tails, the two distributions are much closer.

Now let us look at the situation with 20 degrees of freedom.

The two distributions are even closer although close inspection will still show that the t distribution is a little lower in the middle and has slightly heavier tails.

Finally, let us look at the comparison when the degrees of freedom are 30.

As can be seen the two distributions are almost indistinguishable.

We will then use the rule, use the t distribution when estimating the standard deviation from the sample if the degrees of freedom are fewer than 30, and otherwise use the normal values as before.

A 100*(1 - ( ) % confidence interval based on sample of size n would then be given by:

The appropriate t value can be found from the EXCEL function “tinv”. The form of “tinv” is:

=tinv(two sided ( , df).

Since in this case df = n – 1, one gets:

[pic]

In our sample case, if we want a 95% confidence interval so that ( =.05, we have for df = 25 –1, that:

Therefore our 95% confidence interval is given by:

or,

Any hypothesized value between $22.62 and $47.38 would be accepted at the 5% level of significance.

In the EXCEL file "onesam.xls" I have included a section that will automatically compute the confidence interval if you enter the sample mean, the sample standard deviation, the sample size and the alpha level. For our example the result is shown below:

[pic]

If you wanted a 99% confidence interval, one need only change alpha to .01 to obtain:

[pic]

As you can see the 95% confidence interval is very wide (+/- $12.38). Can it be made smaller? From the formula for the confidence interval, the only thing under the control of the manager is the significance level and the sample size. Suppose we wanted to know the average purchase to within $5.00, how big a sample should we take?

Let W be the desired width (W = 5.00 in our case), then mathematically, we would want to a confidence interval of the form:

[pic]

But our confidence interval is:

[pic].

We can insure that we achieve the width W if:

[pic].

Rearranging and solving for n, one obtains:

[pic].

Since this sample size is usually over 30, one usually uses z(/2 instead of t(/2, and ( instead of s, so that the formula is usually given as:

[pic].

A practical problem in using this formula is that it requires knowledge of (. However it is easy to get around this problem by taking a pre-sample and using the standard deviation of this pre-sample to estimate(.

In our case we want to know the average purchase within $5.00. We have already taken a sample of size 25 and obtained s = $30.00. For a 95% confidence interval we take z(/2 = 1.96. Plugging into the formula we obtain:

[pic]

Now since we already have 25 observations, we would take a further 138 – 25 = 113 random observations. We would then have a total of 138 and we would re-compute the sample mean and sample standard deviation to obtain the confidence interval.

In our case suppose that the mean of all 138 observations is $35.25 and the standard deviation is $29.85. Plugging into the formula given earlier yields:

[pic]

The confidence interval is $35.25 +/- $5.02 which is very close to our desired accuracy.

Technical Note: (Not Required)

The formulae above all assume that the ratio of sample size to population size, n'/N, is small (say less than 5%). If that is not the case, then the correct formula for a 100(1-α)% confidence interval is:

[pic]

If N, the population size is small, it is possible to use the formula for determining the size of a sample to get a confidence interval of size +/- W, and get a value of n which is greater than N. That is the sample size is bigger than the size of the population. If that should happen, the formula below gives the correct sample size n' which should be used in the confidence interval formula given immediately above.

Let n be given by the formula:

[pic]

as before, then n' is given by the formula:

[pic]

Although I feel that confidence intervals are by far the most practical method of inference (when they can be computed), it is possible to apply Methods 2 and 3 if one wishes to test a specific hypothesis.

Let us return to the case that we started with:

[pic]

Suppose we wish to test the hypothesis:

[pic]

(Which we know will be accepted since 45 is inside the confidence interval).

Using method 2 we would compute:

[pic]

Since this value is within the range +/- 2.0639 we would accept the hypothesis.

To use method three and determine the p-value, one need only use the formula:

two sided p-value = tdist (abs(tobs), n-1, 2).

The absolute value sign is necessary due to the programming assumptions made in EXCEL. The second entry is the degrees of freedom. The final entry is 2 for a two-sided p-value and 1 for a one-sided p-value.

In our case we get

two sided p-value = tdist(abs(-1.66666), 24, 2) = .10858.

Since .10858 > .05, we again would accept the null hypothesis that ( = 45.

Testing Hypotheses About Proportions

Another common one sample problem deals with proportions. Suppose we are concerned about the gender distribution of our middle managers. Assuming that this position now requires an MBA at the entry level, and also assuming that the proportion of men and women who are obtaining an MBA is approximately 50:50, does our work force reflect this gender distribution?

One might formulate this as a hypothesis in the following way. Let p represent the probability of a middle manager being female. If we take a random sample of n of our middle managers and determine the sample proportion of females, say [pic], does this value provide evidence that our proportion of female employees differs from .5?

Let x be the number of females in the sample of size n and define:

[pic].

Then as we have shown before, the sampling distribution of [pic]is approximately normal (this requires np>5 and n(1 – p) >5) with

[pic]

Based on this information we can formally test the hypothesis:

[pic]

where p0 = .5 in this specific case.

All four of our previous methods can be applied to this problem. For purposes of illustration we will use the example where n = 25, x = 10, p0 = .5, and ( =. 05.

Method 1, the quality control method would yield the following 100(1 - ()% quality control limits:

[pic]

by an argument similar to what we developed in the case of the sample mean. This leads to the rule:

Accept H0 if [pic] is in the range [pic],

Reject otherwise.

In our case the limits become:

[pic]

or .304 to .696.

Since [pic] falls inside the interval, we accept the null hypothesis that p = .5.

Method 2 also directly applies to this situation. We would first compute zobs using the formula:

[pic]

We would accept the null hypothesis if:

[pic]

otherwise we would reject the null hypothesis.

In our specific example, we have:

[pic].

Since this falls within the limits of +/- 1.96 we accept the hypothesis.

Method 3, the p-value method can also be applied. As in the case of the sample mean, we compute the two-sided p value as:

two-sided p value = 2*(1-normsdist(abs(zobs))).

In this case we get

two-sided p value =2*(1-normsdist(abs(-1))) = .317311

Since this value is greater than .05, we accept the null hypothesis.

Finally, we come to my preferred method of the confidence interval. It turns out, theoretically, that the formula for the exact confidence interval is somewhat complex, however, a very good approximation to the exact result is given by the formula:

[pic]

This is equivalent to interchanging the roles of [pic]and p in the quality control formula.

In our case the confidence interval becomes:

[pic]

which gives the interval:

[pic]

Since .5 is in the confidence interval we would accept the null hypothesis.

In the EXCEL file "onesam.xls", I have also included a template for the computation of the approximate confidence interval for the population proportion.

By entering x, n, and (, the confidence interval is computed. In our case the result looks like:

[pic]

A 99% confidence interval could be obtained by changing .05 to .01 with the result:

[pic]

You may have noticed that these confidence intervals are quite wide. Just as in the case of the mean, there is little the manager can do to make the confidence intervals narrower then increase the sample size n. We can determine the proportion to any desired precision.

Suppose we want to determine p to within +/- W. That is:

The confidence interval is given by:

therefore we will achieve the goal if:

This leads to the equation:

As in the case of the mean, this result is problematic since it requires knowledge of p to determine how large a sample we will need to determine p. This is a classic case of circular reasoning.

However, we can take a pre-sample as we did in the case of the mean. Let us assume that we wish to determine p to within +/- .025 (i.e. W = .025). Let us suppose that we wish to construct a 95% confidence interval so that z(/2 = 1.96. Now we already have a random sample of size 25 with an estimate of p as .4. Therefore, we would estimate the total sample size necessary as:

Since we already have 25, we would need to sample 1,450 more persons. Now once we have all 1,475 let us suppose that 578 are female, so that

Then our 95% confidence interval would be:

This gives a confidence interval of .367 to .417 which is very close to our desired accuracy.

Actually in the case of estimating the sample size for a proportion, we are in a slightly better position than in estimating the mean since we can get a worst-case estimate.

Notice that the numerator of the formula for n has the term p(1-p).

Since p is always between 0 and 1, one can plot the function p(1-p) as shown below:

Notice that this reaches a maximum value of .25 = ¼ when p = ½.

This means that the following inequality always holds:

Therefore if we choose n so that

the value of n may be larger than necessary for any value of p, but it cannot be smaller!

In our particular case, the equation becomes:

This worst-case estimate does not require knowledge of p. In our case it would require taking 1,537 – 1,475 = 62 more sample values than using the pre-sample method. If the cost of an individual sample is not large, the worst-case analysis is often used.

Technical Point (Not Required):

As in the case of the sample mean, it could happen that the sample size chosen, n, based on the above formulas could be greater than the population size N. In that case,

compute n' using the formula:

[pic]

and use the following formula for the approximate confidence interval on p:

[pic]

Testing Hypotheses in Regression

Consider the following data:

[pic]

This data represents a random sample of 20 high school boys, where their height is measures in inches above 5 feet tall ("3" = 5 foot 3 inches) and their weight is given in pounds.

The plot of the raw data is shown below:

[pic]

Clearly a linear relationship seems to exist.

Running our regression program, as we did in Module I, yields the following results:

[pic]

This indicates that there is a correlation of r = .8519 between x and y. As you know, we square the value to interpret it giving r2 = .7257. Therefore approximately 72.57% of the variability in weight can be "explained" by using height as a predictor.

You may have wondered at what value of r2 is enough? I cannot answer that question. However there is a test for the hypothesis that (, the population correlation between x and y, is equal to zero.

The formal statement of the hypothesis to be tested is:

[pic]

If you reject the null hypothesis and conclude that [pic]then the correlation is said to be statistically significant. If you accept the null hypothesis that [pic]then the correlations is said to be not statistically significant.

Although it is possible to construct a confidence interval for ( , the process is complex and approximate. Traditionally only methods 2 and 3 are used.

If you have all of the raw data as in the case above, then one can test the hypothesis by simply running the regression and using method3, the p-value method. The following output shows (in yellow) the required p-value on the regression output.

[pic]

The observed correlation coefficient is r = .851856, and the two-sided p-value is .00000188. Since this is much lower than ( =.05 or even ( =.01, we would say that there is a statistically significant correlation between height and weight.

If you do not have the raw data, but only have the actual value of the correlation coefficient r, then one can use method 2, the t test method. Let us work with ( =.01. The test statistic that will be used is given by the formula:

[pic]

which follows the t distribution with degrees of freedom = n – 2.

In our case n = 20, so the t-distribution has df = 20 – 2 = 18 degrees of freedom.

The appropriate cut off point, using the EXCEL function tinv, is:

t(/2 = tinv( .01, 18) = 2.878442

Then compute:

[pic]

Since this value falls outside the range [pic] we would reject the null hypothesis and conclude that there is a statistically significant correlation between height and weight.

The choice of the English word "significant" is unfortunate since the impression one has is that if something is significant it is important. Consider a situation where a random sample of size 400 is taken and the correlation computed between two variables based on this sample is .1. Suppose we work with ( =.05 so that our cut-off points are [pic]. Then the test statistic would be:

[pic]

Since this is outside the +/- bounds we would reject the null hypothesis and say that there is a statistically significant correlation between the two variables. However, all this means is that the correlation is probably not zero. It does not mean that the relationship is useful for forecasting!!!!!

In order to determine the practical use of any relationships we still need to square r. In this case r2 = (.1)2 = .01. This indicates that only about 1% of the variability in the y variable can be explained by using x as a predictor leaving almost 99% unexplained.

Whenever you hear someone claim that there is a statistically significant correlation between two variables, remember that only means the correlation is probably not zero. Ask what the value of r is and then square it to determine if there is something of practical value in the relationship.

We can also test hypotheses about regression coefficients using the theory we have developed. Consider the regression model:

[pic]

as we studied in Module I. EXCEL provides the p-values to test the hypothesis:

[pic]

In our current example, these values are highlighted below:

[pic]

In this case both b0 and b1 are significantly different than zero even with alpha as low as .01.

In fact comparing the p-value to .05 formed the basis of the Backwards Elimination procedure that we developed in the first module.

Structural Hypotheses

Assume your firm is in an area with only one major competitor. The marketing department has determined that buyers fall into one of three groups. They are either loyal buyers of your product, loyal to your competitor's product, or opportunity buyers who will purchase from either of you depending on their whim.

Last year you held 60% of the market, your competitor held 25%, and 15% of the market were opportunistic buyers. A recent survey gave the following results:

[pic]

Has there been a change?

Notice that in this situation there really is no "statistic" like the mean or the proportion to formulate a hypothesis for. Rather, the question is structural, does the data conform to a fixed pattern, in this case the market share distribution of last year.

If we can specify the structural pattern as a probability distribution, then a very useful statistic called the Chi-Squared Distribution can often be used. Formally we need the following set-up:

Category Probability Observed Number

1 (1 x1

2 (2 x2

. . .

. . .

K (K xK

__________ _____________

1.00 n

Define

EXPi = n (i .

Now if EXPi >3.5 for each of the K categories, then the Chi-Squared Distribution with K – 1 degrees of freedom can be used to test whether the observed data conforms to the structure of the probability distribution.

Formally, the hypothesis being tested is:

H0 : Data Conforms To The Specified Probability Distribution

HA : Data Does Not Conform To The Specified Probability Distribution

Again notice that no specific value is specified in the hypothesis. This means that we cannot approach this problem using the confidence interval approach since there is nothing to put a confidence interval on.

The actual test statistic is:

[pic]

where OBSi is the observed value in category i, i.e. xi.

Notice that if the expected value in a category differs from the observed value in a category in either a positive or negative direction, when the chi-square statistic is computed, a positive deviation will be generated since the difference between the observed and expected value is squared.

Accordingly, we shall use a one-sided p-value when testing these kinds of structural hypotheses.

The Chi-Square distribution is a right-skewed distribution. There are two functions in EXCEL associated with its use.

The first is

=chidist(value, degrees of freedom)

For the given value and degrees of freedom, this function will give us the one-sided p-value of being greater than or equal to the observed value.

The second is

=chiinv(p,degrees of freedom).

This gives the value which has a probability p of being exceed for a chi-square distribution with the given degrees of freedom.

Unfortunately, EXCEL does not perform the Chi-Square test directly. However it is very easy to set up as shown below:

[pic]

The "EXP" column is obtained by simply multiplying the probability for each of the categories last year by 6,478. The value of [pic] is shown in yellow.

I can get the p-value by using the EXCEL function Chidist as follows with

(3 –1) = 2 degrees of freedom:

one sided p-value = chidist( 10.44, 2) = .005412

Using either (=.05 or (=.01 we would reject the hypothesis that the data conform to last year's pattern thus concluding that the pattern has changed.

If one rejects a structural hypothesis, the next question is where does the structure differ from what was hypothesized? An empirical procedure suggests that one look for cells where:

[pic]

Examining the table below, it indicates that the major source of change is in the Opportunity category with value 8.84 (shown in green below).

[pic]

Examining the cells highlighted in yellow, the data suggest that the Opportunity group is shrinking and that people are becoming more loyal customers of either your company or your competitor.

As another example, consider the pseudo-random numbers that we have been using throughout this course. How could I test if they really are close to random? One way is to see if they behave like random numbers and have an equal probability of taking on any value between 0 and 1.

Below I have generated 100 random numbers:

[pic]

A histogram of the 100 numbers distributed in the ranges

0 - .10, .10 - .20, ……….., .80-.90, .90 – 1.00 resulted in the histogram below:

[pic]

Does this data conform to approximately 10% of the data in each bin?

The table below shows the computation as before:

[pic]

As can be seen the p-value is .071177 which is the value given by

chidist(15.8, 9). The result is not significant at the .05 level so there is no reason to doubt that the pseudo random numbers are behaving like usual random numbers.

Finally, note that in two of the cells, the contribution to chi-square exceed 3.5. If the result had been significant, we would have focused on these cells.

However, we only look at the individual contributions if the overall result is significant!

In other words, we only look for the deviations in individual categories if the overall pattern does not seem to conform to the hypothesized structure.

-----------------------

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download