Reliability and validity



Reliability and Validity

Judy Shoemaker, Ph.D.

University of California, Irvine

September 2006

Every type of measurement instrument, including educational assessment methods, contain some degree of error. Sources of error may be the individuals being assessed (a student who doesn’t perform well due to illness), the administration and scoring procedures used (there was a noisy event taking place outside the classroom preventing test takers from concentrating), or the instrument itself. To ensure that the instrument itself is sound, it is important to review evidence of its reliability (consistency) and validity (accuracy).

Since most of the work on reliability and validity is done within the context of tests and test scores, that terminology will be used here. However, the concepts can be applied to any other form of assessment.

Reliability

Reliability refers to the consistency of the scores. That is, can we count on getting the same or similar scores if the test was administered at a different time of day, or if different raters scored the test? Reliability also refers to how internally consistent the test is.

Reliability is estimated using correlation coefficients (Pearson r) derived from various sets of test scores. Correlation coefficients range from 0.00 (no correlation) to 1.00 (perfect correlation). Professionally developed standardized tests often have reliability coefficients of .90 or higher, while teacher-made tests often have coefficients of .50 or lower (Ebel & Frisbie, 1986).

There are several different methods for estimating the reliability of a test:

Test-Retest Method: Test-retest reliability is a measure of stability (Gronlund & Linn, 1990). For this method the test is administered twice to the same set of students. The correlation between students’ scores on the first test and scores on the second test is an estimate of reliability. This method makes several assumptions, some of which might not be realistic, such as there has been no change in what students know or can do between tests, and that details of the first test are not remembered.

Alternate Forms Method: With this method two parallel tests are developed at the same time, following the same test blueprint.[1] Each student takes both tests and the correlation between the two test scores are indicative of the reliability of the test. Alternate forms reliability is a measure of equivalence (Gronlund & Linn, 1990).

Split-Halves Method: Since it is often difficult and inefficient to develop two parallel tests, a more common approach is to split the current test into two equivalent halves and correlate those test scores together. One method is to use the odd-numbered items as one half of the test, and the even-numbered items as the second half of the test. The correlation between the two halves is an estimate of reliability of the test. In this case, the correlation coefficient is often adjusted to reflect the length of the original test. Split-halves reliability is a measure of internal consistency.

Kuder-Richarson (K-R) Method. Taking the split-halves method one step further, the method developed by Kuder and Richardson yields an estimate of reliability that is equivalent to the “average correlation achieved by computing all possible split-halves correlations for a test” (Ebel & Frisbie, 1986). Use of the K-R formulas assumes that each test item is scored dichotomously (one point for a correct answer and no points for an incorrect answer). Split-halves reliability is a measure of internal consistency.

Cronbach’s Alpha. This method is appropriate for items scored with values other than 1 or 0, such as an essay item that might be scored using a 5-point scale. Like the K-R formulas, Cronbach’s alpha represents an average correlation that would be obtained over all split-halves of the test. Cronbach’s alpha is a measure of internal consistency and is the most widely used and reported method for estimating the reliability of test scores.

Inter-Rater Reliability. For essay or other performance-based items that are scored by more than one rater, it is important that all raters are scoring the items in the same way. To estimate rater reliability, we calculate the percentage of scores that are in absolute agreement when multiple raters rate the same set of papers. Another measure of reliability that is commonly used to measure inter-rater reliability is the average correlation of scores between pairs of raters. In both cases, we would look for percentages above .70. Inter-rater reliability is especially important when multiple raters are using a scoring rubric to assess student learning outcomes.

Improving reliability: Reliability can be improved by ensuring that the test items are written clearly and without ambiguity. Response options should be appropriate and meaningful. If possible, making the test longer (adding more items, for example) will improve test reliability. If scoring rubrics are being used, reliability of ratings can be improved through training and practice.

Types of Reliability

|Type of reliability|What’s measured |Procedure |Comments |

|Test-retest |Stability |Give the same test twice to the same set of students|Assumes no learning over time; fairly |

| | |over a very short period of time (a day or two) and |unrealistic assumptions. |

| | |correlate the test scores. | |

|Alternate forms |Equivalence |Develop two parallel versions of the same test, give|Difficult and time consuming to develop two|

| | |both tests at the same time, and correlate the test |different versions of the same test. |

| | |scores. | |

|Split-Halves |Internal consistency |Before scoring, split the test into two halves |Works best when test has many items, |

| | |(odd/even items) and correlate the scores on the two|correlations should be adjusted upward to |

| | |halves. |reflect total number of items on the test. |

|Kuder-Richardson |Internal consistency |Hypothetical average of all possible split-halves |Assumes each test item is scored |

|(K-R) | |correlations. |right/wrong (1/0). |

|Cronbach’s Alpha |Internal consistency |Hypothetical average of all possible split-halves |Allows test items to be scored with values |

| | |correlations |other than 1/0. Most widely used measure |

| | | |of internal consistency. |

|Inter-rater |Between raters |Degree of consistency among raters when more than |Important for performance exams or where |

| | |one rater is used to score the same test items. |scoring rubrics are used by multiple |

| | | |raters. |

Validity

Validity refers to how accurately a test is measuring what it is supposed to measure. A foreign language placement exam is said to be valid if it accurately predicts grades in introductory foreign language courses. To be valid, the test must first be reliable, but not all reliable tests demonstrate validity. Stated another way, “Reliability is a necessary but not sufficient condition for validity” (Gronlund & Linn, 1990, p. 79).

There are different types of validity depending on the purpose of the test. Commonly used types of validity are face validity, construct validity, predictive validity, and content validity. Unlike reliability, there is no single statistical method that is used to demonstrate validity.

Face validity. Face validity is the weakest type of validity. Face validity refers to how well the test “on the face of it” looks like it measures what it is supposed to measure. A mathematics test made up of mathematics problems is said to have face validity. This type of validity is especially important to test takers.

Content validity. Content validity is the most important type of validity for assessment of student learning. Content validity refers to how well the test items represent what is learned in a course or in a similar knowledge domain. That is, to what extent are the test items representative of the types of content or skills that were taught? Content validity can be enhanced by carefully designing the test to reflect what was taught. To ensure this alignment, many test makers use a test blueprint or matrix where the rows or columns are the important elements of the content and the cells represent the associated test items.

Not all achievement tests demonstrate content validity for use in classroom assessment. For example, although nationally standardized achievement tests are developed by content experts, the content selected may not actually be a good fit for what was taught in a specific course. Thus it might be inappropriate to assess course learning outcomes with a standardized test unless it can be shown that there is a good fit between the test and what is actually taught in the course.

Predictive validity. Predictive validity is important when the purpose of the test is to predict future behavior, such as the foreign language placement exam described earlier. Predictive validity is demonstrated when scores on the test are positively correlated with future behavior, such as a grade in a course. This type of validity is also known as criterion-related validity since test scores are compared with an external criterion.

Construct validity. Construct validity indicates the extent to which a test measures an underlying construct, such as intelligence or anxiety. Construct validity is demonstrated if it correlates with similar tests measuring the same construct, or if test scores are consistent with what the construct would predict. For example, when individuals are placed in a stress environment, it would be expected that their scores on an anxiety test would go up.

Types of Validity

|Type of validity |Definition |Example |

|Face |The extent to which a test “looks like” what it is supposed to |An in-class essay in a writing course. |

| |measure. | |

|Content |The extent to which a test is representative of what was taught.|Essential for assessing student learning |

| | |outcomes. |

|Predictive (criterion) |The extent to which a test accurately predicts future behavior. |A calculus placement exam used to place |

| | |students into pre-calculus or regular calculus |

| | |courses (criterion = course grades) |

|Construct |The extent to which a test corresponds to other variables, as |Scores on a depression scale correlate with a |

| |predicted by the construct or theory. |physician’s diagnosis of depression. |

References

Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Englewood Cliffs, NJ: Prentice Hall.

Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th ed.) New York: Macmillan.

-----------------------

[1] A test blueprint is a matrix showing areas to be assessed (rows) and the cognitive level to be assessed (columns). Numbers in each cell identify how many test items will be used to measure that area at that specific cognitive level. A test blueprint is also called a Table of Specifications.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download