Statistics:



Handout 1 (Chapter 1): Overview and Descriptive Statistics

Statistics: The branch of scientific inquiry that provides methods for organizing and summarizing data and for using information in the data to draw various conclusions.

Population: All individuals or objects of a particular type. We will define the population size by N.

Sample: A portion or subset of the population. We will define the sample size by n.

Variable: Any characteristics whose value may change from one object to another in the population.

Example 1: Population: engineering students in Texas A&M University

Sample: graduating engineering students in Texas A&M University

Variable: gender of graduating engineering students in Texas A&M University or GPA of graduating engineering students in Texas A&M University

Question: Is the sample or the variable unique for the same population?

Question: Population: all daily newspapers published in United States

Sample:

Variable:

Question: Published papers propose that consumption of vitamin A prevents cancer. How would you prove their proposal is supported by the data or not?

Sample:

Variable:

Question: Studies show that smoking cigarette causes cancer and yellow fingers? How would you prove if the data support this?

Sample:

Variable:

Descriptive statistics: Organizing and summarizing the data.

Inferential statistics: Drawing conclusions about the population based on sample information.

DATA ( Univariate ( Qualitative: Categorical

( Quantitative: Numerical ( Discrete ( Continuous

( Bivariate ( Two groups

( Multivariate( more than two groups

Example 2: Identify the following as categorical or numeric (if numeric, discrete or continuous).

• Color of eyes,

• number of students play baseball in different schools,

• price of your textbook,

• type of car each student drives,

• height (in inches) or weight (in pounds),

• zip code,

• actual weight of tea-leaves in a 1-lb package,

• number of customers waiting in different banks.

Tabular and Pictorial Methods for Describing Data

Given the data set of n observations on some variable X, the individual observations are [pic]. The ordered observations (if numeric from smallest to largest) will be shown by [pic], i=1,2,....,n where [pic] is the ith ordered value. n is the sample size and N is the population size.

Stem–and–Leaf Display

Stem and leaf plots are very easy to create and look at the numeric data. An advantage to this type of plot is that you can actually still see your data. How to make a stem-and-leaf:

1. Look at the range of your data.

2. Choose your stem – this is the leading digit(s). This is usually the 1’s, 10’s,100’s, etc. place

3. Add your leaf – this is the trailing digit(s). Some just plot the next digit while others may plot the next few digits.

Example 3 (Exercise 1.14): Data set consist of observations on shower-flow rate, X (L/min.) for 129 houses in Perth, Australia. Unordered data (x1=4.6. x2=12.3, x3=7.1, …..,x127=6.3, x128=3.8, x129=6.0) are listed in the textbook. Data range 2.2 to 18.9. Thus I will use the first digit (my 10’s place) as my stem and I will attach the leaf, which is the next digit.

The way the stem-and-leaf display is tabulated for the 6 data points (4.6, 12.3, 7.1, 6.3, 3.8, 6.0) selected from 129 data are as follows:

. 3 | 8

4 | 6

5 |

6 | 03

7 | 1

8 |

9 |

10 |

11 |

12 | 3

The following Minitab output summarizes the final result for the complete data:

Stem-and-leaf of Rate n = 129

Leaf Unit = 0.10

2 2 23

12 3 2344567789

20 4 01356889

37 5 00001114455666789

62 6 0000122223344456667789999

(17) 7 00012233455555668

50 8 02233448

42 9 012233335666788

27 10 2344455688

17 11 2335999

10 12 37

8 13 8

7 14 36

5 15 0035

1 16

1 17

1 18 9

Example 4 (Exercise 9.23): Fusible interlinings are being used with increasing frequency to support outer fabrics and improve the shape and drape of various pieces of clothing. The data on extensibility (%) at 100 gm/cm for both high-quality fabric (H) and poor-quality fabric (P) specimens are as follows.

H 1.2 0.9 0.7 1.0 1.7 1.7 1.1 0.9 1.7 1.9 1.3 2.1 1.6 1.8 1.4 1.3 1.9 1.6 0.8 2.0 1.7 1.6 2.3 2.0

P 1.6 1.5 1.1 2.1 1.5 1.3 1.0 2.6

The following Minitab output summarizes the final result for the complete data using different appearances of stem-and-leaf displays.

Stem-and-leaf of H: n = 24 Stem-and-leaf of H: n = 24

Leaf Unit = 0.10 Leaf Unit = 0.10

1 0 7 4 0 7899

4 0 899 10 1 012334

6 1 01 (10) 1 6667777899

9 1 233 4 2 0013

10 1 4

(7) 1 6667777

7 1 899

4 2 001

1 2 3

Stem-and-leaf of P: n = 8 Stem-and-leaf of P: n = 8

Leaf Unit = 0.10 Leaf Unit = 0.10

2 1 01 3 1 013

3 1 3 (3) 1 556

(2) 1 55 2 2 1

3 1 6 1 2 6

2 1

2 2 1

1 2

1 2

1 2 6

Frequency Distributions

For the numeric continuous (discrete) data, creates class intervals (lists the data points) and counts the number of data falls into it. This count is called frequency. Relative frequencies are obtained by dividing frequency by the total number of data. It is the fraction or the proportion of time the interval is observed (the value occurs). For categorical data, frequency is the number of data falls into each category.

Example 5 (Exercise 1.21): The number of intersections, z is listed as one of the characteristics of subdivisions.

Relative Cumulative

z Count Frequency relative frequency

0 13 13/47=0.2766 13/47=0.2766

1 11 11/47=0.2340 24/47=0.5106

2 3 3/47=0.0638 27/47=0.5745

3 7 7/47=0.1489 34/47=0.7234

4 5 5/47=0.1064 39/47=0.8298

5 3 3/47=0.0638 42/47=0.8936

6 3 3/47=0.0638 45/47=0.9575

7 0 0/47=0 0.9575

8 2 2/47=0.0425 47/47=1

n=47

What percentage of these subdivisions had at most 3 intersections?

What percentage of these subdivisions had less than 3 intersections?

What percentage of these subdivisions had between 2 and 5 (inclusive) intersections?

What percentage of these subdivisions had less than 2 or more than 5 intersections?

Histogram

A pictorial representation of a frequency distribution can be obtained by constructing a histogram. The histogram is a much better way of visualizing a data set than the stem-and-leaf.

How to construct a histogram for continuous data:

a) Divide range of observations into intervals ( Plot on x axis)

b) Count the # of observations that fall in each interval --- frequency.

c) Compute the relative frequency = [pic] (percentage falls into the interval)

d) Plot rectangle above each interval whose height is proportional to the relative frequency or frequency. If all the intervals for the continuous data do not have the same width, density (relative frequency/interval width) is a better measure to use for histogram.

The following is the histogram for example 5.

[pic]

The following is the histogram and the frequency distribution for example 3 (flow rate).

[pic]

Relative Cumulative Density

rate Count Frequency relative frequency

[1,3) 2 2/129 2/129 ( 0.0155 1/129

[3,5) 18 18/129 20/129 ( 0.1550 9/129

[5,7) 42 42/129 62/129 ( 0.4806 21/129

[7,9) 25 25/129 87/129 ( 0.6744 12.5/129

[9,11) 25 25/129 112/129 ( 0.8682 12.5/129

[11,13) 9 9/129 121/129 ( 0.9380 4.5/129

[13,15) 3 3/129 124/129 ( 0.9612 1.5/129

[15,17) 4 4/129 128/129 ( 0.9923 2/129

[17,19) 1 1/129 129/129 ( 1 0.5/129

Rule of thumb: number of classes ( [pic]

What to Look For In Your Graph: (Use with stem-and-leaf & histogram)

1. The center of the distribution

2. The overall Shape of the distribution.

Unimodal

• Symmetric – portions on each side of the center value are mirror images of each other

• Skewed left (negatively skewed) – the left tail (lower values) is stretched out longer than the right tail (higher values)

• Skewed right (positively skewed) – the right tail (higher values) is stretched out longer than the left tail (lower values)

Thus, whichever direction the curve is pulled – that is the direction in which it is skewed.

Bimodal

Multimodal

3. Marked deviations from the overall shape of the distribution.

• Outliers – individual observations that fall well outside the overall pattern of the graph

• Gaps in the distribution

For the intersections data, we see that the center in our distribution of intersections is in 2’s. The graph is skewed to the right with one major distinct peak (unimodal). 8 intersections may be outliers. Note one major gap.

For the flow rate data, we see that the center in our distribution of flow rates is in 7’s. The graph is skewed to the right with one major distinct peak (unimodal). No major gaps or outliers.

For the categorical data (the cars students drive), we can count the number of students for each category of car defining the number of categories. We can use these counts on the histogram vertical axis, categories horizontal axis. Placing a bar as high as the frequency on the top of each category, histogram can be created.

Measures of location:

The sample mean, [pic] is the arithmetic average.

The population mean, [pic].

There is only one mean for a quantitative data set. Its value is influenced by extreme measurements.

Note that the sample mean is the statistics where the population mean is the parameter.

The sample median, [pic]is the middle value when the measurements are arranged from lowest to highest. If n is odd, the median is the observation which have exactly (n-1)/2 values are greater than and (n-1)/2 values are less than the median. If n is even, the median is the average of the two middle values and n/2 values are greater than and n/2 values are less than the median.

There is only one median for the quantitative data and its value is not likely influenced by few extreme measurements.

The mode is the most frequently occurring value. This measure may not be unique in that two (or more) values may occur with the same greatest frequency.

There can be more than one mode for a data set. It is applicable for both quantitative and qualitative data. Its value is not likely influenced by few extreme measurements.

Note that there is negatively skewed distribution if mean < median, positively skewed distribution if median ip

- the pth percentile is [pic]

Trimmed mean is a compromise between the mean and the median. A 5% trimmed mean would be computed by eliminating the smallest 5% and the largest 5% of the sample and averaging what is left over.

Sample proportion is the number of successes divided by the total number of observations.

Measures of variability:

The sample range measures the distance between the largest and smallest observations. R =[pic].

It is sensitive to outliers and provide no information on patterns of variability.

Interquartile range ( IQR=Upper Quartile - Lower Quartile= Q3 - Q1 ) : It is the range of middle half of the distribution. Your text book calls it as fourth spread.

It is not sensitive to outliers.

Deviations from the mean: [pic] where [pic]= 0.

The sample variance and standard deviation, are [pic] and [pic], respectively with given sample size n where the population variance and standard deviation are [pic] and ( =[pic], respectively with given population size N.

It is the most commonly used measure of variability and sensitive to outliers.

Note that the sample variance or standard deviation are statistics where the population variance or standard deviation are parameters.

Example 6:

|Random variable X |Mean |Median |Q1 |Q3 |s |IQR |R |

|–100, -50, 0, 50, 100 |0 |0 |-50 |50 |79.05694 |100 |200 |

|–100, -50, 0, 0, 50, 100 |0 |0 |-50 |50 |70.71068 |100 |200 |

|–200, -100, -50, 0, 50, 100 |-33.3333 |-25 |-100 |50 |108.0123 |150 |300 |

Coefficient of variation (CV): Unit free variation (amount of variability relative to the value of the mean) where variance and standard deviation measures the variability dependent on units of measurements. CV=100([pic]).

Example 7: When the heights of students (in inches) and their weights (in pounds) are recorded, the data set with more variation is measured by CV.

The following Minitab output summarizes some of the measures of location and variability for flow rate data.

Variable n Mean Median TrMean StDev SE Mean

Rate 129 7.708 7.000 7.540 3.077 0.271

Variable Minimum Maximum Q1 Q3

Rate 2.200 18.900 5.600 9.600

The following Minitab output summarizes some of the measures of location and variability for the extensibility of high-quality versus low-quality fabric.

Variable n Mean Median TrMean StDev SE Mean

H: 24 1.5083 1.6000 1.5091 0.4442 0.0907

P: 8 1.588 1.500 1.588 0.530 0.188

Variable Minimum Maximum Q1 Q3

H: 0.7000 2.3000 1.1250 1.8750

P: 1.000 2.600 1.150 1.975

The following Minitab output summarizes some of the measures of location and variability for intersections data.

Variable n Mean Median TrMean StDev SE Mean

z: 47 2.277 1.000 2.116 2.253 0.329

Variable Minimum Maximum Q1 Q3

z: 0.000 8.000 0.000 4.000

Example 8:

a) If the sample mean is 50 for 10 observations and 11th observation is 50, what would be the sample mean of 11 observations? If the 11th observation is 40, would your answer still be the same? Explain.

b) If the sample mean and variance are 50 and 3.25 for 10 observations and 11th observation is 50, what would be the sample variance of 11 observations? If the 11th observation is 40, would your answer still be the same? Explain.

c) If the deviations from the mean for 5 observations are –0.3, 0.1, 2, 1.4, -1.7, what would be the sum of the remaining 5 deviations from the mean where the data set have 10 observations.

Question: Let c be a constant, X &Y be random variables. How would the mean and variance change if you

• add the same constant to the each observation (yi=xi+c, i=1,2,….,n)

• multiply each observation with the same constant (yi=cxi, i=1,2,….,n)

Boxplots

Boxplots are formed using what is called the five number summary:

1. minimum

2. first (lower) quartile, 25th percentile, Q1.

3. median, 50th percentile, Q2.

4. third (upper) quartile, 75th percentile, Q3.

5. maximum

Ideal for comparing two populations (samples) when measuring a continuous random variable.

1. The ends of the box are at the quartiles. The length of the box is Q3-Q1. This box will contain 50% of the data values

2. The median is marked by a line within the box

3. The two vertical lines (called whiskers) outside the box extend to the smallest and largest observations within 1.5 x IQR of the edges of the box.

4. Observations outside of these whiskers (that is, farther away than 1.5 X IQR beyond edge of box) are called outliers. In general, outlier is the observation which is much larger or smaller than the rest of the data. If the data falls between 1.5(IQR and 3(IQR from the edge to which it is closest, they are called mild outliers. If the data fall more than 3(IQR from the edge to which it is closest, they are called extreme outliers.

Comparative boxplot for the extensibility of high-quality versus low-quality fabric:

[pic]

Boxplot for flow rate:

[pic]

Boxplot for intersections:

[pic]

For all the boxplots above or on the previous page,

1.5,∙IQR Q1-1.5∙IQR Q3+1.5∙IQR

H 1.5(1.875-1.125)=1.125 0 3

P 1.5(1.975-1.15)=1.2375 -0.0875 3.2125

Rate 1.5(9.6-5.6)=6 -0.4 15.6

Z 1.5(4-0)=6 -6 10

INTERPRETING BOXPLOTS

• Note the position of the median. Medians not in the middle of the box can indicate skewness in the middle 50% of the data as well as in the whole data set. Recall that the mean will get drawn in the direction of the “skewness”. Thus the box will be a lot longer in the direction of the skewness.

• Note the length of the whiskers and the outliers. If the data is symmetric, the whiskers will be of equal length.

Coefficient of Skewness (SK): The direction of and degree to which a frequency distribution is skewed. (SK0 ( positively skewed).

Example 9: A computer scientist is investigating the usefulness of two different design languages in improving programming tasks. Twelve expert programmers, familiar with both languages, are asked to code a standard function in both languages, and the time in minutes is recorded.

Programmer |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 | |Design language 1 |17 |16 |21 |14 |18 |24 |16 |14 |21 |23 |13 |18 | |Design language 2 |18 |14 |19 |11 |23 |21 |10 |13 |19 |24 |15 |20 | |

The following are the descriptive statistics and the comparative boxplots obtained by MINITAB.

Variable n Mean Median TrMean StDev SE Mean

Design1 12 17.92 17.50 17.80 3.63 1.05

Design2 12 17.25 18.50 17.30 4.59 1.33

Variable Minimum Maximum Q1 Q3

Design1 13.00 24.00 14.50 21.00

Design2 10.00 24.00 13.25 20.75

[pic]

Example 10 (Exercise 1.60): Observations on burst strength (lb/in2) were obtained both for test nozzle closure welds and for production canister nozzle welds.

The following are the descriptive statistics and the comparative boxplots obtained by MINITAB.

Variable n n* Mean Median TrMean StDev

Test 11 1 7355 7300 7389 614

Cannister 12 0 5887.5 5887.5 5880.0 317.9

Variable SE Mean Minimum Maximum Q1 Q3 SK

Test 185 6100 8300 7200 8000 -0.463

Cannister 91.8 5250.0 6600.0 5725.0 6037.5 0.314

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download