Statistics Help Guide



Statistics Help Guide

Becki Paynich

I. A Few Important Things:

A. Math is the least important factor in statistics. Knowing which statistics

test to employ based on sample, level of measurement, and what you are trying to understand is the most important part in statistics. However, you need to understand how the math works to better understand why you pick any given test.

B. Statistics is not brain surgery. Lighten up. You’ll do fine.

C. This is a help guide, not a comprehensive treatment of statistics. Please

refer to a good Statistics textbook for a deeper understanding of statistics.

II. Levels of Measurement: Knowing the level of measurement is extremely

important in statistics. Most tests of significance are explicit about the level of measurement they require. If the requirements for the level of measurement are not adhered to, results lose validity. For example, Pearson’s r assumes that variables are at least interval level. If a Pearson’s r is employed on nominal or ordinal level variables, results cannot be directly interpreted (pretty much your results are meaningless).

A. Nominal—a variable that is measured in categories that cannot be ranked.

Usually nominal variables are collapsed into the smallest number of categories possible (0,1) so that they may be used in analysis. A variable in this form (dichotomous) can be used in virtually any type of analysis. For example: the variable RACE may be originally separated into 12 different categories. These 12 might be collapsed into 2 (most likely being Caucasian=1 and Non-Caucasian=0). The important thing to remember is that collapsing categories into larger ones can hide important relationships between them. For example, the difference between African-Americans and Native Americans on a given variable may be important but if they are both collapsed into the category of Non-Caucasian, this relationship will be obscured. Another thing to keep in mind is that a dichotomous or dummy variable (one with only two categories as in above example) is the most powerful form, statistics wise, for a nominal variable to be in.

B. Ordinal—a variable that can be ranked but the exact difference between

categories cannot be identified. For example: The categories of Strongly Agree, Agree, No Opinion, Disagree, and Strongly Disagree, can be differentiated by order but one does not know the exact mathematical difference between two respondents who answer Strongly Agree and Agree respectively. Another example would be categorical ranges. Income, for example, can be grouped into the following ranges: $0-$10,000; $10,0001-$20,000; and $20,000+. While certainly, this variable is measured with numbers, one cannot tell the exact difference between a respondent who selected the 0-10,000 range and a respondent that selected the 10,001-20,000 range. In theory, they could by $1.00 apart or closer to $10,000 apart in income. A final form of ordinal variable which is actually treated as interval-level is when one adds together multiple ordinal level responses. For example, I may want to make an index on how well a student likes the course materials by combining his/her responses to questions about the books, the web page, and the class notes. Thus, if a student answered Strongly Agree to questions about these three items, and the score for Strongly Agree is 5, the student’s total score for this new index is 15.

Ordinal variables can also be and often are collapsed into a smaller number of categories. For example: A variable measuring rank may have 8 or more original categories that can be collapsed into the categories of administration=1 and non-administration=0. Or, there could be three categories with a middle category being middle-management. The problem again becomes sacrificing important relationships that may be obscured by the collapse for ease in analysis. The best rule of thumb is to run analyses with all the categories in their smaller form first to identify important relationships between them and then run analyses again after collapsing the variables to identify important relationships across larger ones.

** When you create nominal and ordinal level questions, make sure they are mutually exclusive (no respondent can fit into more than one category) and exhaustive (you have listed all possibilities). Also, when designing numerical ranges, make sure that the intervals have the same width. In the following example, notice that the width for each category is $10,000:

Question: What is your income range?

1 = 0-10,000

2 = Above 10,000 to 20,000

3 = Above 20,000 to 30,000

C. Interval—a variable or scale that uses numbers to rank order. This is the best type of variable to have as you can keep it in interval form or turn it into an ordinal or nominal level variable and do virtually any type of analysis. Furthermore, the exact difference between two respondents can be identified. For example: with the variable age, when respondents report their actual age, one can mathematically determine the differences between the ages of respondents. If however, the variable is grouped into age ranges (0-5, 6-10, 11-15, 16-20…) then the variable is ordinal and not interval scale. Interval level variables can be collapsed into ordinal or even nominal categories but this is usually avoided as most statistical tests designed for interval level variables are more powerful and provide more understanding of the nature of the relationships than those tests designed for ordinal or nominal variables.

D. Ratio—a ratio scale variable is identical to interval in almost every aspect

except for it has an absolute zero. Age and Income have absolute zeros. If a variable can be considered on a ratio scale, then it is also interval scale. **Many researchers in criminal justice ignore the difference between interval and ratio as the tests of significance that we use for ratio and interval variables are the same.**

**Make sure to code variables in the most logical order.** For example, if you are trying to measure the frequency of smoking, use the highest numbers for the most amount of smoking and the lowest numbers for the least amount of smoking. For example:

Question: How often do you smoke?

0 = Never

1 = Less than once per week

2 = At least once per week

3 = 2 or 3 times per week

4 = Daily

III. Numerical Measures:

A. Measures of Central Tendency

1. Mode-the most prevalent variable (for example, in a data set of

pets: 25 cats, 13 dogs, 12 hamsters, 47 fish. Fish is the modal category with 47 being the mode). The only measure of central tendency for nominal-level data is the mode. The mode can also be used with ordinal and interval-level data.

2. Median-the mid point in a value. An equal number of the sample

will be above the median as will be below it. In a sample of

n =101: 50 of the scores will be above the median, and 50 will be

below the median). The median is most often used for interval-level data. When ranked from lowest to highest, the 51st score will be the median. If your n is an even number, take the average between the two middle scores.

3. Mean-the average of a value. The mean is usually not computed on

ordinal and nominal-level data because it has little explanatory value. However, many statistical tests require that the mean be computed for the formula. The mean is most appropriate for interval-level data.

B. Measures of Variability

1. Range—the difference between the largest and smallest values.

2. Variance—the sum of the squared deviations of n measurements

from their mean divided by (n – 1).

3. Standard Deviation—the positive square root of the variance. One

must compute the variance first to get the standard deviation because the sum of deviations from the mean always equals zero.

4. Typically, standard deviations are used in the form of standard

errors from the mean in a normal curve. In a normal distribution, the interval from one standard error below the mean to one standard error above the mean contains approximately 68% of the measurements. The interval between two standard errors below and above the mean contains approximately 95% of the measurements.

5. In a perfect normal curve, the mode, median, and mean are all the

same number.

IV. Probability:

A. Classic theory of probability—the chance of a particular outcome

occurring is determined by the ratio of the number of favorable outcomes (successes) to the total number of outcomes. This theory only pertains to outcomes that are mutually exclusive (disjoint).

B. Relative Frequency Theory—is that if an experiment is repeated an

extremely large number of times and a particular outcome occurs a percentage of the time, then that particular percentage is close to the probability of that outcome.

C. Independent Events—outcomes not affected by other outcomes.

D. Dependent Events—outcomes affected by other outcomes.

E. Multiplication Rule

1. Joint Occurrence—to compute the probability of two or more

independent events all occurring, multiply their probabilities.

F. Addition Rule—to determine the probability of at least one successful

event occurring, add their probabilities.

G. Remember to account for probabilities with and with replacement. For

example, when picking cards out of a deck, the probability of choosing on the first try the Queen of Hearts is 1/52. However, if you don’t put back the first card before choosing for the second time, your probability increases to 1/51 because there are only 51 cards left to choose from.

H. Probability Distribution is nothing more than a visual representation of the

probabilities of success for given outcomes.

I. Discrete variables are those that do not have possible outcomes in

between values. Thus, a coin toss can only result in either a heads or tails outcome. Continuous variables are those that have possible outcomes between values. For example, it is absolute possible to be 31.26 years old.

V. Sampling:

A. Independent Random Sampling: Most statistical tests are based on the

premise that samples are independently and randomly selected. If in fact a sample was purposive, statistical analysis cannot be generalized outside the sample. Random sampling is done so that inferential statistics can be interpreted outside the sample. Statistical tests can still be employed in samples that have been drawn through non-random techniques, however, their interpretation must be confined to the sample at hand, and limitations of the sampling design must be addressed in the results and methods discussions of your research.

B. Central Limit Theorem—the larger the number of observations (the bigger

your n), the more likely the sample distribution will approximate a normal curve.

VI. Principles of Testing:

A. Research hypothesis—can be stated in a certain direction or without

direction. Directional hypotheses are tested with one-tailed tests of significance. Non-directional hypotheses are tested with two-tailed tests of significance.

B. Null hypothesis—is essentially the statement that two values are not

statistically related. If a test of a research hypothesis is not significant, then the null hypothesis cannot be rejected.

C. Type I error: (alpha) is when a researcher rejects the null hypothesis when

in fact it is true. That is, stating that two variables are significantly related when in fact they are not.

D. Type II error: (beta) is when a researcher fails to reject the null hypothesis when in fact the null hypothesis is false. That is, stating that two variables are not significantly related when in fact they are.

VII. Univariate Inferential Tests: (See also Appendix A for quick reference guide)Essentially, univariate tests are only looking at one variable and how the scores for that variable differ between sample and population groups, two samples, and within groups at different points in time. Many univariate tests can be employed on proportions when that is all that is available. This is not a comprehensive listing of all univariate tests but can the most commonly used are briefly discussed below.

A. Steps in testing:

1. State the assumptions

2. State the Null and Research Hypotheses.

3. Decide on a significance level for the test, Determine the test to be used.

4. Compute the value of a test statistic.

5. Compare the test statistic with the critical value to determine

whether or not the test statistic falls in the region of acceptance or the region of rejection.

B. One-sample z-test:

1. Requirements: normally distributed population. Population

variance is known.

2. Test for population mean

3. One-tailed or two-tailed.

4. This test essentially tells you if your sample mean is statistically

different from the larger population mean.

C. One-sample t-test:

1. Requirements: normally distributed population. Population

variance is known.

2. Test for population mean

3. One-tailed or two-tailed.

4. This test essentially tells you if your sample mean is statistically

different from the larger population mean.

D. Two-sample t and z-tests for comparing two means:

1. Requirements: two normally distributed but independent

populations. Population variance can be known or unknown. Different formulas depending on whether the variance is known.

E. Paired Difference t-test:

1. Requirements: a set of paired observations from a normal

population. This test is usually employed to compare “before” and “after” scores. Also employed in twin studies.

F. Chi-Square for population distributions:

1. Requirements: assumption of normal distribution

2. This test basically compares frequencies from a known sample

with frequencies from an assumed population. There are many different types of chi-square tests which test different things—consistency, distribution, goodness of fit, independence…make sure you are employing the right chi-square for your needs.

VIII. Bivariate Relationships: Bivariate simply means analysis between two

variables. Before beginning this section it is important to note that there is a difference between measures of association and tests of significance. A Pearson’s r (a measure of association) will give you are statistic between -1 and +1. -1 is a perfect negative correlation, +1 is a perfect positive correlation. 0 is no correlation. When SPSS provides output, it has to run a separate test to tell you if the relationship is significant. Thus, if doing stats by hand, one has to complete two different formulas—one to get the correlation and one to get the significance.

A. Measures of Association: (See also Appendix A for quick reference guide)

Nominal Variables

1. Lambda-a PRE (proportionate reduction in error) approach that

essentially tells one how well the prediction in one variable is affected with the knowledge of another. Lambda ranges from 0-1. Thus the direction of relationship between two variables cannot be determined, only the strength can be determined with Lambda. This measure of association is used for nominal level variables and is based on modal categories.

2. Phi-a correlation coefficient used to estimate association in a 2 x 2

table.

3. Cramer’s V- a correlation coefficient to estimate association in

tables larger than 2 x 2.

Ordinal Variables

1. Gamma-the most frequently used PRE measure of ordered cross-

tabular association. It ranges from -1 to +1 thus the direction and the strength of the relationship can be determined with gamma. Gamma is used for ordinal level variables. Gamma is based on

concordant and discordant pairs. Gamma does not account for tied pairs.

2. Kendall’s tau b- like gamma, is a PRE measure based on

concordant and discordant pairs, but accounts for tied pairs on the dependent variable. Kendall’s tau b ranges from -1 to +1. Use Kendall’s tau b for ordinal variables that when they are arranged in tables, the number of rows and columns is equal.

3. Kendall’s tau c- same as Kendall’s tau b but used for when the

number of rows and columns are not equal.

4. Somers’s dxy – measure of association that accounts for tied pairs

on both the independent and dependent variable. Somers’s dxy will

have different values depending on which variable is X (independent) and which variable is Y(dependent).

Interval Variables

1. Spearman’s rho- measure of association for interval level data that

is less sensitive to outliers than Pearson’s r. Pearson’s r is usually the preferred test statistic so if outliers are not a problem, use Pearson’s r.

2. Pearson’s r – measure of association that requires interval level

variables. Pearson’s is great and used often because it holds strong interpretive power. However, it is sensitive to outliers. In cases where your data holds outliers that cannot be taken out of the analysis, use Spearman’s rho which is less sensitive to outliers.

B. Tests of Significance (Again, see also Appendix A)

1. Chi-Square- used for nominal and sometimes ordinal level data.

The important thing to remember here is that if you have too many categories, the data will be spread out which can be problematic for this test. Chi-square assumes that there are at least 5 cases in each box in the table. To avoid this problem, categories can be collapsed (see discussion under nominal variables in levels of measurement). Chi-square can tell you if the observed frequency is statistically significant relative to what we might expect by chance. Again, be careful that you are using the right chi-square formula as there are several uses for chi-square.

2. Bivariate Regression- regression can be used for as little as two

variables. However, the assumptions of linear regression must be followed, and the causal order of X and Y must be specified correctly. If the causal order is not specified, try running the regression while flipping around X and Y. Perhaps a reciprocating relationship is present. A more detailed discussion or regression is provided below under multivariate linear regression.

3. Logistic regression-Logistic regression is used for when the

dependent variable is dichotomous (only two categories: yes/no, black/white…etc.) Logistic regression is interpreted slightly different than linear regression in that the coefficients refer to the likelihood or probability that X causes a change in Y.

IX. Multivariate Relationships:

A. Linear Regression Assumptions: There are several assumptions that must

be followed when employing linear regression. As a researcher, if you violate any of these assumptions, you must provide a discussion of the limitations produced by the violated assumptions. Some violations are more serious and problematic than others. These assumptions are listed and briefly discussed below. Many of them apply to logistic and bivariate equations discussed above. They are in no particular order.

1. The relationship between the independent variables (X) and the

dependent variable (Y) is linear.

2. Errors are random and normally distributed.

3. The mean of error is zero.

4. Errors are not correlated with X.

5. No autocorrelation between the disturbances.

6. No covariance between error and X.

7. The number of observations must be greater than the number of

parameters to be estimated (n must be larger than the number of X’s in the equation).

8. There is variability in both the X and Y values. (X & Y should be

interval level but this assumption violation is not a huge deal if Y is interval and at least one X is interval). X can be dichotomous as well but the coefficient cannot be directly interpreted, must be restated into probabilities.

9. No specification bias of error.

10. No perfect multicollinearity.

11. The direction of the hypothesis is properly stated (X impacts Y in

this manner)

12. X and Y are normally distributed.

B. Multivariate regression assumes that the variables included in the

equation are meaningful and make theoretical sense. If one throws every variable possible into the equation, the r2 will be inflated. To avoid this, only include those variables which make theoretical sense and use the Adjusted r2 if there is a significant difference between r2 and adjusted r2.

C. The output from SPSS will provide a table with the standardized and

unstandardized beta coefficients. The standardized coefficients reflect the relative importance of any given X in the equation. For standardized beta coefficients, the largest one signifies the variable with biggest impact on Y. The unstandardized coefficients will tell you the change in Y with a one unit increase in X. There is also a column (sig) which will tell you if that X variable is statistically and significantly related to Y.

D. Regression is an extremely powerful and useful tool in statistics. Please

refer to class notes and stats books if you have any questions about regression.

X. Common Mistakes:

A. Forgetting to convert between standard deviation and variance. Some

formulas require standard deviation, others require variance. Remember that one must compute the variance first and then take its square root to get the standard deviation because the sum of deviations from the mean always equals zero.

B. Misstating one-tailed and two-tailed hypotheses. If your hypotheses

suggest a direction—greater than or less than (for example individuals with more education will have a higher income), then the hypothesis is one-tailed. If your hypothesis only says that there will be a difference in values—not equal (for example, education has an effect on income), then your hypothesis is two-tailed. Make sure that the appropriate test is employed based on how your hypothesis is stated.

C. Failing to split the alpha level for two-tailed tests.

D. Misreading the standard normal (z) table.

E. Using n instead of n – 1 degrees of freedom in one-sample t-tests. Make

sure to use the appropriate formula to figure out the degrees of freedom, it differs across tests.

F. Confusing confidence level with confidence interval. The confidence level

is the significance level of the test or the likelihood of obtaining a given result by chance. The confidence interval is a range of values between the lowest and highest values that the estimated parameter could take at a given confidence level.

G. Confusing interval width with margin of error. Interval width is the

distance between the margins of error. For example: +/-3 margin of error has an interval width of 6.

H. Confusing statistics with parameters. Parameters are characteristics of the

population that you usually don’t know. Statistics are characteristics of samples that you are usually able to compute. Essentially you compute statistics in order to estimate parameters.

I. Confusing the addition rule with the multiplication rule.

J. Forgetting the addition rule applies to mutually exclusive outcomes.

K. Using a “percentage” instead of a “proportion” in a formula.

XI. What do I do with a data set?

A. There are no hard and fast rules because every research project is

different. However, there are a few basic steps in data analysis:

1. Run frequencies of every variable. This will give you a feel for the

data and will alert you to any problems or anomalies that need to be addressed.

2. Run a correlation matrix of important variables. Use Kendall’s tau

b for this preliminary analysis. I know that Kendall’s tau b is for

ordinal level data and not all variables will be ordinal, however, it is a good starting point to begin to see what relationships look like between variables. Also, the matrix might help you identify degrees of multicollinearity which will justify combining variables into scales and indices. Once you review the matrix you may get some concrete ideas about how to proceed with the analyses. Do not use this matrix as a form of hypotheses testing unless both variables you are looking at are ordinal level.

3. Construct scales, indices, and interaction variables. Usually,

questions/items which measure similar things are combined into a scale by adding each item together. This step will eliminate multicollinearity problems. Keep in mind that the items you add together must be theoretically or empirically related to each other. Also, each item must be measured the same way and in the same direction. Also, you may have to place weight on items which hold more predictive power. Make sure to run an alpha reliability analysis. Alpha should be at least a .60 to justify the scale. However, if it is close and you have sound theoretical arguments why the item(s) should remain together in scale form, it is OK to keep them together. But, you should analyze items separately to identify which variables/relationships were most important.

4. Begin hypotheses testing. At this stage, you should be careful

about the assumptions you violate because you are going to use your analyses as tests of hypotheses rather than exploratory analyses. Choose the appropriate tests based on the requirements and assumptions. Recode variables in scales and indices (be sure to run reliability analyses). Be aware that you may have to tests hypotheses in different ways with different tests depending on the data set. For example, experience may be measured in several ways and the measures may be at different levels. You may not be able to combine these measures into a scale or index which means you will have to analyze each measure of experience separately.

5. Discuss your analyses. Make sure to include limitations and

alternative explanations for your results.

APPENDIX A

Tests of Significance

Information Obtained: Tests of significance can only tell you whether or not something is statistically significant such as:

• Whether or not two variables are dependent upon one another (chi-square)

• Whether or not the distribution of one variable is not likely based on random chance (chi-square)

• Whether or not a sample mean is significantly different from a populations mean (T-tests and Z-tests for comparing a sample to a population)

• Whether or not two sample means are significantly different from one another (T-tests and Z-tests for comparing 2 samples)

• Whether or not 3 or more groups or samples are statistically significant from each other (ANOVA)

T-Tests (Univariate)

|Level of measurement |The test variable is either measured in the nominal (when you compare proportions) or interval (when you |

| |compare means) |

|Why Use? |To see if a population (based on information from a sample) is significantly different from a population. |

| |To see if two samples are significantly different. |

|Step 1 |State your Assumptions |

| |Random Sampling (Independent random sampling if comparing two samples) |

| |Level of Measurement _________ |

| |Normal sampling distribution |

|Step 2 |State the Null and Research Hypotheses |

| |H0: The population (based on the sample) is not |

| |significantly different from the overall population. |

| |Or |

| |The two samples are not significantly different. |

| | |

| |H1: The population (based on the sample) is |

| |significantly different from the overall population. |

| |Or |

| |The two samples are significantly different. |

| | |

| |*You may also state the research hypothesis in a certain direction if you know the direction. This is called|

| |a one-tailed test.* |

|Step 3 |Select the Sampling Distribution (T Distribution) and Establish the Critical Region |

| |Pick your alpha level (For ex: .05) |

| |Determine if this is a one or two-tailed test |

| |Determine the degrees of freedom |

| |Df= n-1 (for comparing a sample to a population) |

| |Df= n1+ n2 – 2 (for comparing two samples) |

| |4. Look up the critical T in back of book based on if it is a one or two tailed test and the alpha. |

|Step 4 |Compute the Test Statistic (Solve the formula to get the obtained T) |

|Step 5 |Interpret your results. |

| |Anything beyond the critical T on both tails for two-tailed test, and beyond one tail for a one-tailed test |

| |you reject the null hypothesis (The area beyond T is considered the “critical” region. |

| | |

| |Anything smaller than the critical T for a one-tailed test and within the region between the critical Ts |

| |(plus and minus) means you fail to reject the null hypothesis |

|Special Instructions |T can be used when the n is smaller than 100. It can also be used when n is larger than 100. |

| |When comparing samples (Ch.8) be sure to compute ALL formulas needed. |

Z –Tests (Univariate)

|Level of measurement |The test variable is either measured in the nominal (when you compare proportions) or interval (when you|

| |compare means) |

|Why Use? |To see if a population (based on information from a sample) is significantly different from a |

| |population. |

| |To see if two samples are significantly different. |

|Step 1 |State your Assumptions |

| |Random Sampling (Independent random sampling if comparing two samples) |

| |Level of Measurement _________ |

| |Normal sampling distribution |

|Step 2 |State the Null and Research Hypotheses |

| |H0: The population (based on the sample) is not |

| |significantly different from the overall population. |

| |Or |

| |The two samples are not significantly different. |

| | |

| |H1: The population (based on the sample) is |

| |significantly different from the overall population. |

| |Or |

| |The two samples are significantly different. |

| | |

| |*You may also state the research hypothesis in a certain direction if you know the direction. This is |

| |called a one-tailed test.* |

|Step 3 |Select the Sampling Distribution (Z Distribution) and Establish the Critical Region |

| |Pick your alpha level (For ex: .05) |

| |Determine if this is a one or two-tailed test |

| |3. Look up the critical Z in back of book based on if it is a one or two tailed test and the alpha. |

|Step 4 |Compute the Test Statistic (Solve the formula to get the obtained Z) |

|Step 5 |Interpret your results. |

| |Anything beyond the critical Z on both tails for two-tailed test, and beyond one tail for a one-tailed |

| |test you reject the null hypothesis (The area beyond Z is considered the “critical” region. |

| | |

| |Anything smaller than the critical Z for a one-tailed test and within the region between the critical Zs|

| |(plus and minus) means you fail to reject the null hypothesis |

|Special Instructions |Z can only be used when the n is larger than 100. |

| |When comparing samples be sure to compute ALL formulas needed. |

ANOVA (Univariate)

|Level of measurement |The test variable is interval (you are comparing means). The groups are based on Nominal or Ordinal |

| |categories. |

|Why Use? |To see if 3 or more groups or samples are significantly different. |

|Step 1 |State your Assumptions |

| |Independent Random Sampling |

| |Level of Measurement _________ |

| |Normal sampling distribution |

| |Population variances are equal. |

|Step 2 |State the Null and Research Hypotheses |

| |H0: The population means (based on the samples) are not |

| |significantly different from one another. |

| | |

| |H1: The population means (based on the samples) are |

| |significantly different from one another. |

|Step 3 |Select the Sampling Distribution (F Distribution) and Establish the Critical Region |

| |Pick your alpha level (For ex: .05) |

| |Determine the degrees of freedom between and within. |

| |Dfb = k-1 |

| |Dfw = N-k |

| |Look up the critical F in back of book based on the |

| |degrees of freedom and the alpha. |

|Step 4 |Compute the Test Statistic (Solve the formula to get the obtained F) |

|Step 5 |Interpret your results. |

| |Anything equal to or larger than the critical F means you reject the null hypothesis. |

| | |

| |Anything smaller than the critical F means you fail to reject the null hypothesis |

|Special Instructions |ANOVA should be used when the n’s are comparable in number. |

| |When using ANOVA be sure to compute ALL formulas needed. |

| |With ANOVA, there is no such thing as a one or two-tailed test. The F ratio is based solely on the |

| |differences between groups. |

Chi-Square (Univariate and Bivariate)

|Level of measurement |Nominal and Ordinal |

|Why Use? |To see if two variables are dependent. |

| |To see if distribution of one variable is not likely based on random chance. |

|Step 1 |State your Assumptions |

| |Random Sampling |

| |Level of Measurement _________ |

|Step 2 |State the Null and Research Hypotheses |

| |H0: The two variables are independent. |

| |Or |

| |The distribution of ____ is random. |

| | |

| |H1: The two variables are dependent. |

| |Or |

| |The distribution of ____ is not random. |

|Step 3 |Select the Sampling Distribution (chi-square) and Establish the Critical Region |

| |Pick your alpha level (For ex: .05) |

| |Calculate degrees of freedom: |

| |Df = (r-1)(c-1) for two variable tests. |

| |Df = (k-1) for one variable tests |

| |Look up the critical chi-square in the back of the |

| |book. |

|Step 4 |Compute the Test Statistic (Solve the formula to get the obtained chi-square) |

|Step 5 |Interpret your results. |

| |Anything equal to or above the critical chi-square means you reject the Null |

| |hypothesis. Anything |

| |smaller than the critical chi-square means you fail to reject the null hypothesis |

|Special Instructions |Don’t use on anything bigger than a |

| |4 x 4 table. |

| |If any cell has a frequency of less than 5, do not use |

Measures of Association (Bivariate)

Information Obtained: Measures of Association will tell you the strength of a relationship but will not tell you whether or not it is statistically significant. {For example, the relationship between eating apples and getting cancer may achieve statistical significance (because it is consistent) but it can either be a strong relationship (it only takes a few apples to give you cancer) or a weak one (it takes many apples—like millions—over a lifetime to increase the probability of getting cancer.)}

Tests which measure the association between two variables are based on the following assumptions:

• Random Sampling

• Normal Sampling Distribution

The level of measurement of the variables will determine which test you choose.

Nominal Level Variables

|Test |Formula |Special Instructions |

|Phi |Formula is based on Chi-Square statistic |Use for 2x2 tables. Ranges 0 to +1 |

|Cramer’s V |Formula is based on Chi-Square statistic |Use for larger than 2x2 tables. Ranges 0 to|

| | |+1. |

|Lambda |PRE measure (means your prediction error is|Use for any type of table except for when |

| |reduced when you have knowledge of the |the row marginals are vastly different. |

| |independent variable) |Ranges from 0 to +1. Multiply Lambda by 100|

| | |and get the exact reduction in error by |

| | |knowing the independent variable. |

Ordinal Level Variables

|Test |Formula |Special Instructions |

|Gamma |PRE measure based on concordant and |Ranges from -1 to +1 (-1 is a perfect |

| |discordant pairs (Gamma does not account |negative relationship and +1 is a perfect |

| |for tied pairs) |positive relationship. 0 means no |

| | |relationship. |

|Kendall’s tau-b |PRE measure based on concordant and |Ranges from -1 to +1 (-1 is a perfect |

|(use when the number of rows is equal to |discordant pairs. This test accounts for |negative relationship and +1 is a perfect |

|the number of columns) |tied pairs (thus a little more accurate |positive relationship. 0 means no |

| |than Gamma) |relationship. |

|Kendall’s tau-c |PRE measure based on concordant and |Ranges from -1 to +1 (-1 is a perfect |

|(use when the number of rows is NOT equal |discordant pairs. This test accounts for |negative relationship and +1 is a perfect |

|to the number of columns) |tied pairs (thus a little more accurate |positive relationship. 0 means no |

| |than Gamma) |relationship. |

|Somers dxy |This test accounts for tied pairs on both |Ranges from -1 to +1 (-1 is a perfect |

| |the independent and dependent variable. |negative relationship and +1 is a perfect |

| |With this formula, you must specify the |positive relationship. 0 means no |

| |independent and dependent variables |relationship. |

| |correctly. | |

Interval and Ratio Level Variables

|Test |Formula |Special Instructions |

|Spearman’s rho |Use this formula when you have outliers. |Ranges from 0 to +1. Once you square the |

|(Can be used for ordinal level variables |This test is less sensitive to outliers. |obtained Spearman’s rho, you make multiply |

|that are “continuous”—scales with actual | |that number by 100 and interpret it as the |

|scores—in some cases) | |percentage of reduced error. |

|Pearson’s r |This formula is the most often used but is |Ranges from -1 to +1 (-1 is a perfect |

| |sensitive to outliers |negative relationship and +1 is a perfect |

| | |positive relationship. 0 means no |

| | |relationship. If you square this number it |

| | |can be interpreted to represent the |

| | |percentage of variation of the dependent |

| | |variable that the independent variable |

| | |explains. |

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download