MATH1725 Introduction to Statistics: Worked examples

[Pages:18]MATH1725 Introduction to Statistics: Worked examples

Worked Example: Lectures 1?2

The lifetimes of 400 light-bulbs were found to the nearest hour. The results were recorded as follows.

Lifetime (hours) 0?199 200?399 400?599 600?799 800?999 1000?1199 1200?1999

Frequency

143

97

64

51

14

14

17

Construct a histogram and cumulative frequency polygon for these data. Estimate the percentage of bulbs with lifetime less than 480 hours.

Answer: Lifetimes cannot be negative so class intervals are [0, 199.5), [199.5, 399.5), [399.5, 599.5), and so on.

120

0 20 40 60 80

Freq. per 200 hour class

0

500

1000

1500

2000

Lifetime (hours)

Adjust height of the rectangle for the "1200?2000" interval to make histogram area proportional to frequency. If the vertical axis is "frequency per interval of 200 hours", the height of the [0, 199.5) class is 143 ? 200/199.5 = 143.4 to allow for the first class not being of width 200.

Lifetime (hours)

0.0 199.5 399.5 599.5 799.5 999.5 1299.5 1999.5

Cumulative frequency 0 143 240 304 355 369 383 400

Make the cumulative frequency at time zero equal to 0.

300

280

Cumulative freq.

Cumulative freq. 0 100 200 300 400

265.8

260

240

480

0

500

1000

1500

2000

Lifetime (hours)

400

450

500

550

600

Lifetime (hours)

Estimated number of light-bulbs with lifetime less than 480 hours is

240

+

480

- 399.5 200

?

(304

-

240)

=

265.8.

1

Required percentage is

265.8 400

?

100

=

66.4%

Worked Example: Lectures 1?2

The Christmas cactus Zygocactus truncatus has branches made up of separate segments. For one such cactus the number of segments in each branch were counted.

Number x of segments

12345 6789

Number of branches with x segments 3 0 6 7 8 18 8 0 2

Construct a cumulative frequency polygon to represent these data.

Answer: The data is discrete so cumulative frequency plot is a step function.

Number x of segments

123 4 5 6 7 8 9

Number of branches with x segments 3 3 9 16 24 42 50 50 52

Cumulative freq. 0 10 20 30 40 50 60

0

2

4

6

8

10

Number of segments

Worked Example: Lectures 1?2

The following data give one hundred measurement errors made during the mapping of the American state of Massachusetts during the last century.

Error X (in minutes of arc) (-4, -2] (-2, 0] (0, +2] (+2, +4] (+4, +6]

Frequency 10

43

39

5

3

Show that the sample mean and sample standard deviation for these data are x? = -0.04 and s = 1.717 respectively.

Answer:

Class

Class frequency f Class mid-point x f x f x2

-4< x -2

10

-3

-30 90

-2< x 0

43

-1

-43 43

0< x +2

39

+1

39 39

+2< x +4

5

+3

15 45

+4< x +6

3

+5

15 75

Totals

n = 100

-4 292

2

x?

=

-4 100

=

-0.04.

s2

=

1 99

(292

-

100

?

(-0.04))

=

2.9479,

so s =

(s2) = 2.9479 = 1.717.

Worked Example: Lectures 1?2

The time between arrival of 60 patients at an intensive care unit were recorded to the nearest hour. The data are shown below.

Time (hours) 0?19 20?39 40?59 60?79 80?99 100?119 120?139 140?159 160?179

Frequency

16 13

17

4

4

3

1

1

1

Determine the median and semi-interquartile range. Explain why this pair of statistics might be preferred to the mean and standard deviation for these data.

Answer:

Time (hours)

0.0 19.5 39.5 59.5 79.5 99.5 119.5 139.5 159.5 179.5

Cumulative frequency 0 16 29 46 50 54 57 58 59 60

Median lies in "40?59" class, corresponding to cumulative frequency 30. Lower quartile is in "0?19" class, corresponding to cumulative frequency 15. Notice that this class has width 19.5 hours, not 20 hours. Upper quartile is in "40?59" class, corresponding to cumulative frequency 45.

Median

=

39.5

+

30 46

- -

29 29

?

20

=

40.7

hours.

Lower

quartile

=

0.0

+

15 16

- -

0 0

?

19.5

=

18.3

hours.

Upper

quartile

=

39.5

+

45 46

- -

29 29

?

20

=

58.3

hours.

Semi-interquartile

range

=

1 2

(58.3

-

18.3)

=

20.0

hours.

The histogram for these data is positively skew, so the median and semi-interquartile range might be preferred to the mean and standard deviation as measures of location and dispersion respectively.

10 15 20

Freq. per 20 hour class

5

0

0

50

100

150

200

Inter-arrival time (hours)

3

Worked Example: Lectures 4?6

A firm investigates the length of telephone conversations of their office staff. Ten consecutive conversations had lengths, in minutes:

10.7, 9.5, 11.1, 7.8, 11.9, 4.1, 10.0, 9.2, 6.5, 9.2.

Derive a 95% confidence interval for the mean conversation length. Test whether the mean length of a conversation is eight minutes.

Answer:

x?

=

1 n

n i=1

xi

=

90 10

=

9

minutes.

s2

=

n

1 -

1

n

x2i - nx?2 = 5.42.

i=1

Estimate the population variance 2 by s2 with s = 5.42 = 2.33. Then

Xs?/-n? tn-1.

95% confidence interval for ? is x? ? t9(2.5%)s/ 10. Here s/ 10 = 0.737, t9(2.5%) = 2.262.

x?

?

t9

(2.5%)

s 10

=

9 ? (2.262 ? 0.737)

= 9 ? 1.667 = (7.3, 10.7).

Since 8 minutes lies inside the 95% confidence interval we would accept H0 in testing H0 : ? = 8 vs. H1 : ? = 8 at the 5% significance level.

Worked Example: Lectures 5?6

A population has a Poisson distribution but it is not known whether the mean ? is 1 or 4. To test the hypothesis H0 : ? = 1 vs. H1 : ? = 1 on the basis of one observation X the following test procedure is considered: reject H0 if X i.

Type I error is defined to be "rejecting H0 when H0 is true". Find the probability of type I error for the three cases i = 2, 3, 4.

Answer: If H0 is true, ? = 1 and

pr{X = x} = e-1 , x!

so that pr{Type I error} = pr{X i}. If i = 2,

x = 0, 1, 2, . . . ,

pr{Type I error} = pr{X 2} = 1-pr{X < 2} = 1-pr{X = 0}-pr{X = 1} = 1-e-1-e-1 = 0.264.

Similarly if i = 3,

pr{Type I error} = pr{X 3} = 1 - pr{X < 3} = 0.080.

If i = 4,

pr{Type I error} = pr{X 4} = 0.019.

Notice that an exact 5% or 10% significance level test does not exist for this discrete distribution.

4

Worked Example: Lectures 5?6

A sample of size 64 is drawn by simple random sampling from a normal population which has known variance 4. The sample mean is -0.45. Test the hypothesis H0 : ? = 0 vs. H1 : ? = 0 at the 5% level of significance. Repeat for testing H0 : ? = 0 vs. H1 : ? > 0

Answer: Here X? N(?, 2/n) with 2 = 4, n = 64, so 2/n = 0.0625 and X? N(?, 0.0625).

Test statistic is

Z

=

X?/-n?

=

X? 0.0625

=

X? 0.25

where Z N(0, 1) if H0 is true.

For = 0.05 with a two-sided test, z/2 = 1.96. Critical region is Z < -1.96 and Z > 1.96. Observed value is z = -0.45/0.25 = -1.8. This does not lie in critical region so accept H0.

For = 0.05 with a one-sided test, z = 1.645. Critical region is Z < -1.645. Observed value is z = -1.8 which lies in critical region so reject H0.

Worked Example: Lecture 6

The absenteeism rates (in days and parts of days) for nine employees of a large company were recorded in two consecutive years.

Employee 1 2 3 4 5 6 7 8 9 Year 1 3.0 6.7 11.3 5.0 9.4 15.7 8.0 10.0 9.7 Year 2 2.8 5.1 8.4 5.0 6.2 12.2 10.0 6.8 6.0

Is there any evidence that the average absenteeism rate is different for the two years?

Answer: Data paired as same employee studied in each of the two years. Form difference di = (year 1)i - (year 2)i. Need to estimate variance d2. Test H0 : ?d = 0 vs. H1 : ?d = 0. See lecture 6.

Worked Example: Lecture 8

Which phrases i-iv below apply to the sample correlation coefficient rXY ? (i) measures linear association between two variables, (ii) is never negative, (iii) has positive slope, (iv) depends on the units of measurement of X and Y .

Answer: i only.

Worked Example: Lecture 8

The tensile strength of a glued joint is related to the glue thickness. A sample of six values gave the following results:

Glue Thickness (inches) 0.12 0.12 0.13 0.13 0.14 0.14 Tensile Strength (lbs.) 49.8 46.1 46.5 45.8 44.3 45.9

Calculate the sample correlation coefficient r for these data. Use the fitted least squares regression line to predict the tensile strength of a joint for a glue thickness of 0.14 inches. Using scatter-diagrams, sketch the form of regression line expected in the three cases when r takes the values -1, 0, and +1.

5

Answer: Let X denote the glue thickness and Y the joint strength.

x

y

x2

y2

xy

0.12 49.8 0.0144 2480.04 5.976

0.12 46.1 0.0144 2125.21 5.532

0.13 46.5 0.0169 2162.25 6.045

0.13 45.8 0.0169 2097.64 5.954

0.14 44.3 0.0196 1962.49 6.202

0.14 45.9 0.0196 2106.81 6.426

Totals 0.78 278.4 0.1018 12934.44 36.135

x?

=

0.78 6

=

0.131,

y?

=

278.4 6

=

46.41,

sX2

=

1 5

{0.1018

-

6(0.131)2 }

=

0.00008,

sY2

=

1 5

{12934.44

-

6(46.41)2

}

=

3.336,

sXY

=

1 5

{36.135

-

6(0.131)(46.41)}

= -0.0114.

rXY

=

sXY sX sY

= -0.0114 0.00008 ? 3.336

= -0.698.

Regression line:

y

=

y?

+

(x

-

x?)

sXY s2X

=

46.4

+

(x

-

0.13)

-0.0114 0.00008

= 64.925 - 142.5x.

At x = 1.4 this gives y = 44.975 lbs.. Scatter-plots: r = -1: data lies on a straight line with negative slope. r = +1: data lies on a straight line with positive slope. r = 0: data randomly scattered (X and Y independent) or could show case with X and Y having a non-linear dependence as in the lecture notes. You could even show both of these cases!

Worked Example: Lecture 11

A coin is tossed three times. Let X denote the number of heads and Y the length of the longest run of heads or tails. Thus HTT gives X = 1 and Y = 2, THT gives X = 1 and Y = 1. (a) Obtain the joint probabilities of X and Y . (b) Obtain the marginal probability distribution of X and Y . (c) If X = 1, what is the distribution of Y ?

Answer: (a and b) All eight outcomes are equally likely, so occur with probability 1/8.

Outcome X Y

Probability

HHH 3 3 1/8

HHT 2 2 1/8

HTH 2 1

1/8

HTT 1 2 1/8

THH 2 2

1/8

THT 1 1

1/8

TTH 1 2 1/8

TTT 0 3

1/8

Y

1 2 3 pX (x)

0 0 0 1/8 1/8

X

1 1/8 1/4 0

3/8

2 1/8 1/4 0

3/8

3 0 0 1/8 1/8

pY (y) 1/4 1/2 1/4 Total = 1

6

Joint probabilities p(x, y) are found by summing probabilities for each outcome giving rise to (X = x, Y = y). Thus p(1, 2) = pr{HT T or T T H} = 1/4.

Marginal probabilities are found by forming row or column sum. For example

pr{X

=

2}

=

p(2, 1)

+

p(2, 2)

+

p(2, 3)

=

3 8

.

(c) If X = 1, then Thus

pr{Y

= y|X

= 1} =

p(1, y) pX (1)

=

p(1, y) 3/8

.

pr{Y

= 1|X

= 1} =

1/8 3/8

= 1/3,

pr{Y

= 2|X

= 1} =

2/8 3/8

= 2/3,

pr{Y = 3|X = 1} = 0.

If X = 1, then the outcome is one of HTT, THT, TTH. In one out of these three cases we observe Y = 1 and in two out of three we observe Y = 2.

Worked Example: Lecture 11

Suppose X and Y are independent continuous random variables which are each uniformly distributed on the interval (0, 1). (a) Find the probability that 0 < X + Y < z for values z (0, 2). (b) If Z = X + Y , deduce the form of the probability density function f (z) of Z. Hints: In (a), think about the area on the x-y plane corresponding to 0 < x + y < z. In (b), first find the cumulative distribution function F (z) = pr{Z z}.

Answer: As X and Y are uniformly distributed on the interval [0, 1) they have pdf

fX (x) =

1 if 0 < x < 1, 0 otherwise,

fY (y) =

1 if 0 < y < 1, 0 otherwise.

(a) Joint probability density is f (x, y) = fX(x)fY (y) by independence of X and Y . Hence f (x, y) = 1, a constant, for 0 < x < 1 and 0 < y < 1.

Probability of an event A is volume under pdf with base area given by A. Here A is the region for which 0 < X + Y < z.

Consider the two cases z < 1 and z > 1 separately.

f(x,y)

1

0

A

1

X

Y

1

Y Case z < 1

1 z

x+y 1 1

2-z x+y ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download