Day 1

Lecture 1 – Basic Descriptive Techniques

Describing Data

Three Descriptive Techniques

1. Tables.

Regular Frequency Distribution, Grouped Frequency Distribution

2. Graphs/ plots

Bar Graph, Histogram, Dot Plot, Stem and Leaf Display, Box and Whisker plot, Scatterplot

3. Numeric Summaries

Mean, median, standard deviation, variance, Pearson correlation coefficient

Major considerations in describing

1. The data type - important for choice of table or graph.

Categorical: Differences between people are qualitative, not quantitative

Quantitative: Difference between people are quantitative

2. The way the data are distributed - important for choice of numeric summary.

Two basic shapes of distributions: Symmetric and Skewed

3. Outliers.

Data which are extreme in some form.

Data point outlier: A single datum which is extreme relative to the other values of the variable.

May be extreme in one dimension or extreme in two or more dimensions.

Case outlier: A case whose values on one or more variables are extreme.

Variable Outlier: A variable whose values aren't as they were expected to be.

E.g., an abberent question in a set of questions forming a scale.

A variable whose correlation with other variables is negative rather than the expected positive.

Study outlier: A study whose results were extreme relative to expectations.

Example: A study that reported a negative relationship of teacher evaluations to learning

Tabular Techniques

Regular or Ungrouped Frequency Distribution (M, et. al., Table 2.2, p. 18

Definition: An ordered list of all possible scores values from the largest observed value down to the smallest observed value and the frequency/percentage of each value. Interior values that could have occurred but didn’t are represented.

Best used for: Categorical data.

Example: Responses of employees to the question

My job involves risky, serious health hazards

Response Frequency Percent

7: Strongly agree 34 17.5

6: Moderately agree 27 13.9

5: Slightly agree 33 17.0

4: Neither A nor D 20 10.3

3: Slightly Disagree 27 13.9

2: Moderately disagree 15 7.7

1: Strongly disagree 38 19.6

1. Large scores should go at the top. Note that SPSS’s default table above does not follow that rule.

2. Interior scores with 0 frequency must be represented in the table.

Grouped Frequency Distribution (M, et. al., Table 2.3, p. 19)

Definition: An ordered List of score groups from a group containing the largest observed value down to a group containing the smallest observed value and frequency of occurrence in each

Best used for: Typically used for continuous variables or variables with many values, e.g., Length IQ, Job Satisfaction, Organizational Commitment.

Alas, SPSS has no procedure which produces grouped frequency distributions such as those presented in the text. If you want one, you’ll have to make it yourself.

Examples: Initial Peabody Picture Vocabulary Scores of Early Success Students

Table 1. Grouped Frequency Distribution of initial PPVT scores of Control and Experimental Groups.

|Control Group | |Experimental Group |

|PPVT Interval |N |Percent | |PPVT Interval |N |Percent |

|130-139 |1 | 0.5 | |130-139 |3 | 0.8 |

|120-129 |0 | 0.5 | |120-129 |7 | 1.8 |

|110-119 |17 |10.1 | |110-119 |32 | 8.3 |

|100-109 |22 |13.1 | |100-109 |69 |17.9 |

|90-99 |30 |17.9 | |90-99 |96 |24.9 |

|80-89 |42 |25.0 | |80-89 |90 |23.4 |

|70-79 |31 |18.5 | |70-79 |50 |13.0 |

|60-69 |20 |11.9 | |60-69 |31 | 8.1 |

|50-59 |3 | 1.8 | |50-59 |5 | 1.3 |

|40-49 |1 | 0.5 | |40-49 |2 | 0.5 |

|Total |167 |100 | |Total |385 |100 |

Rules for creating tables:

0. Put tables side-by-side when comparing groups.

1. Largest scores are at the top of the table.

2. When comparing groups, scales for both groups must be identical.

40-49, 50-59, . . . 130-139 in both of the above tables.

3. Percent column must be included if group sizes are different.

4. Interior intervals with 0 frequency must be represented.

Stem & Leaf Display (Not in M, et. al.)

Definition: An ordered representation of scores in which rows represent score intervals and numbers within rows represent individual values. The rows are called stems and the numbers within rows are called leaves.

Popularized by John Tukey.

The most straightforward such table is one representing two-digit scores. In this case, rows correspond to the first digit of each number . Within each row, the last digit of each number represents the number.

For example, consider the following two-digit values . . .

24 29 40 58 42 9 15 20 78 90 96 26 10 16 38 46 29 65 82 71 81 45 52 68 49 94

These would be represented in a stem & leaf display as follows . . .

Stems Leaves

0 9

1 5 0 6

2 4 9 0 6 9

3 8

4 0 2 6 5 9

5 2 8

6 5 8

7 8 1

8 2 1

9 0 6 4

Usually, the leaves are ordered from smallest to largest within stems . . .

Stems Ordered Leaves

0 9

1 0 5 6

2 0 4 6 9 9

3 8

4 0 2 5 6 9

5 2

6 5 8

7 1 8

8 1 2

9 0 4 6

Graphs

Bar Graph or Bar Chart

Definition: Columns whose location represents value and whose length represents frequency.

Used for: Used for Categorical data

Graphical equivalent of the Regular Frequency Distribution.

Creation: Produced by SPSS Frequencies procedure

(Analyze -> Descriptive Statistics ->Frequencies)

Or by SPSS Graphs -> Bar…

Example: Types of jobs held by respondents in a manufacturing facility.

[pic]

Histogram

Definition: Columns whose location represents value and whose area represents frequency.

Columns are usually contiguous and of equal width so height represents frequency.

Used for: Continuous or quantitative data, usually grouped.

Creation: SPSS Frequencies procedure

(Analyze -> Descriptive Statistics -> Frequencies)

Or by SPSS Graphs -> Histogram…

Example 1: Distribution of salaries of employees at a bank in the 1970s

[pic].

[pic]

Rules:

1. If you’re comparing groups, put one histogram on top of the other.

2. Make x-axis scales of histograms being compared identical.

Box and Whisker Plot

A single group

A representation of the Maximum

3rd Quartile (75th Percentile point)

Median (50th Percentile point)

1st Quartile (25th Percentile point)

Minimum

Used for ordered response and "many valued" variables.

Produced by SPSS Examine procedure (Analyze -> Descriptive Statistics -> Explore )

Or by SPSS Graphs -> Boxplot…

[pic]

Comparing Groups

[pic]

Description Using SPSS

The FREQUENCIES procedure

This example will use data from a consulting project involving victims of ATV accidents.

Entering the data (‘G:MDBT\InClassDatasets\ATVDataForClass050906.sav’)

The SPSS Data Editor showing numbers actually entered.

[pic]

Same data with View -> Value Labels chosen

[pic]

The dependent variable for this example is a measure of injury severity, appropriately named the Injury Severity Score (ISS)

The ISS is a quantity that is computed for hospital patients based on examination of trauma on several parts of the body. Each part is assigned a value. The ISS is a composite of those individual body-area values. The larger the number, the more severely injured the patient. The value, 0, means essentially no injury.

1. Whole sample analysis.

Menu Sequence to access the FREQUENCIES procedure

[pic]

Dialog with FREQUENCIES PROCEDURE

Specifying which variables to analyze

[pic]

Choosing specifics

[pic]

The FREQUENCIES output.

Frequencies

[pic]

[pic]

2. Comparing Groups using the EXPLORE. procedure

One of the goals of the ATV study was to determine whether severity of injury was related to whether or not the rider was wearing a helmet.

So the variable, helmet, is an independent variable in the study.

PASW/SPSS FREQUENCIES output for helmet

[pic]

Missing value: A value entered in the absence of a valid data value.

Sometimes missingness is represented by simply leaving the cell in the data editor empty.

Sometime it is represented by a specific value.

SPSS has to be told that a specific value represents missingness.

The EXPLORE procedure

A procedure in SPSS designed to allow comparison of groups using a variety of descriptive techniques.

The EXPLORE main dialog window

[pic]

Analysis specifics

[pic]

[pic]

The EXPLORE Output

[pic]

[pic]

The Histograms

[pic]

[pic]

[pic]

Note:

1. The histograms are stacked vertically – following the rule for comparing groups using histograms.

2. The histograms have equal x-axis labels and equal column widths.

To manipulate x-axis labels.

1. Double-click on the histogram to open the Chart Editor window.

[pic]

2. Double-click on one of the x-axis numbers

[pic]

3. Then click on Scale and choose the appropriate scale values – in this case I chose 0, 80, and 10 for Minimum, Maximum, and Major Increment

[pic]

4. Click on Apply.

To manipulate column width

1. Double-click the figure to open the Chart Editor window.

2. Double-click on a column.

[pic]

2. Click on Binning, then click on Custom, and enter the desired width. I entered 5.

[pic]

3. Click on Apply.

The Goals of Descriptive Statistics

What kinds of characteristics can a collection of numbers have?

People can be kind, aloof, gregarious, tall, friendly, mean, spacy, etc. Cities can be forward-looking, violent, progressive, etc. Cars can be fast, economical, stylish, ugly, heavy, etc.

Just as there are certain characteristics which seem to "belong" to people or cities or cars, there are a few characteristics which "belong" to collections of numbers and which statisticians feel should be mentioned whenever an attempt is made to describe a collection.

The Big Three Characteristics

1: Central Tendency

The first characteristic is called the central tendency. (It's also called "average" value, location, and expected value.) It reflects the sizes of the numbers in the collection.

Consider the following weights: 230, 260, 305, 195.

Compare them with the following: 115, 120, 105, 94, 110,115, 100 90, 85.

Even though the second collection has more scores in it, the central tendency of the first is larger. The scores in the first collection are larger than those in the second.

2: Variability

The second important characteristic of collections of numbers is the variability of the values. It is also called the dispersion, heterogeneity or width of the values. This characteristic reflects the differences between the values. If all the values are close to each other we say that variability is small. If the values in the collection are quite different from each other, we say that variability is large.

Consider the following collection: 150, 155, 158, 160, 153, 156, 152.

Compare it with: 85, 175, 305, 95, 130.

Note that the scores in the second collection are quite different from each other. Thus, the second collection is more variable than the first.

3: Shape

Shape refers to the way score values are position or placed on the number line.

In some distributions, the scores are all piled up on lone side or the other.

In others, the scores are piled up in the middle.

Shape will be considered in detail after graphical methods of description have been introduced.

Other Characteristics

We will consider the relationship or correlation between paired data later in the course.

Mike – Make memo assignment 1 here.

Use this as opportunity to get PINs. Put name and PIN on back of this homework assignment.

Future assignments, put only PIN on back, no other identifying information anywhere.

-----------------------

[pic]

Leaf

Stem

A nominal variable.

The distributions are positively skewed.

Comparison of the histograms suggests that male salaries were higher than female salaries in this organization.

The distributions are slightly positively skewed.

It appears that females are a little older than males in this organization.

Find the variable’s name in the leftmost field, highlight it, and click on the arrow between the fields to move it to the rightmost field.

To create a graph, click on the Charts button.

ISS is a continuous variable, so a Histogram will be used. The normal curve overlay will be used to give a visual indication of how nearly normal the distribution is.

[pic]

The histogram has been reduced in size to make it fit on the page.

I clicked on Options and told the program to include reports for missing values.

I told the program to give me histograms.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches