Test Reliability—Basic Concepts

Research Memorandum

ETS RM?18-01

Test Reliability--Basic Concepts

Samuel A. Livingston

January 2018

ETS Research Memorandum Series

EIGNOR EXECUTIVE EDITOR

James Carlson Principal Psychometrician

ASSOCIATE EDITORS

Beata Beigman Klebanov Senior Research Scientist

Heather Buzick Senior Research Scientist

Brent Bridgeman Distinguished Presidential Appointee

Keelan Evanini Research Director

Marna Golub-Smith Principal Psychometrician

Shelby Haberman Distinguished Research Scientist, Edusoft

Anastassia Loukina Research Scientist

John Mazzeo Distinguished Presidential Appointee

Donald Powers Principal Research Scientist

Gautam Puhan Principal Psychometrician

John Sabatini Managing Principal Research Scientist

Elizabeth Stone Research Scientist

Rebecca Zwick Distinguished Presidential Appointee

Kim Fryer Manager, Editing Services

PRODUCTION EDITORS

Ayleen Gontz Senior Editor

Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Memorandum series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Memorandum series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.

The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS.

Test Reliability--Basic Concepts Samuel A. Livingston

Educational Testing Service, Princeton, New Jersey January 2018

Corresponding author: S. A. Livingston, E-mail: slivingston@ Suggested citation: Livingston, S. A. (2018). Test reliability--Basic concepts (Research Memorandum No. RM-18-01). Princeton, NJ: Educational Testing Service.

Find other ETS-published reports by searching the ETS ReSEARCHER database at

To obtain a copy of an ETS research report, please visit

Action Editor: Gautam Puhan Reviewers: Shelby Haberman and Marna Golub-Smith Copyright ? 2018 by Educational Testing Service. All rights reserved. ETS, the ETS logo, GRE, MEASURING THE POWER OF LEARNING, and TOEFL are registered trademarks of Educational Testing Service (ETS). All other trademarks are the property of their respective owners.

Abstract The reliability of test scores is the extent to which they are consistent across different occasions of testing, different editions of the test, or different raters scoring the test taker's responses. This guide explains the meaning of several terms associated with the concept of test reliability: "true score," "error of measurement," "alternate-forms reliability," "interrater reliability," "internal consistency," "reliability coefficient," "standard error of measurement," "classification consistency," and "classification accuracy." It also explains the relationship between the number of questions, problems, or tasks in the test and the reliability of the scores. Key words: reliability, true score, error of measurement, alternate-forms reliability, interrater reliability, internal consistency, reliability coefficient, standard error of measurement, classification consistency, classification accuracy

RM-18-01 i

Preface This guide grew out of a class that I teach for staff at Educational Testing Service (ETS). The class is a nonmathematical introduction to the topic, emphasizing conceptual understanding and practical applications. The class consists of illustrated lectures, interspersed with written exercises for the participants. I have included the exercises in this guide, at roughly the same points as they occur in the class. The answers are in the appendix at the end of the guide.

In preparing this guide, I have tried to capture as much as possible of the conversational style of the class. I have used the word "we" to refer to myself and most of my colleagues in the testing profession. (We tend to agree on most of the topics discussed in this guide, and I think it will be clear where we do not.)

RM-18-01 ii

Table of Contents Instructional Objectives .................................................................................................................. 1 Prerequisite Knowledge .................................................................................................................. 2 What Factors Influence a Test Score? ............................................................................................ 2

The Luck of the Draw.............................................................................................................. 3 Reducing the Influence of Chance Factors .............................................................................. 4 Exercise: Test Scores and Chance ........................................................................................... 5 What Is Reliability? ........................................................................................................................ 6 Reliability Is Consistency ........................................................................................................ 6 Reliability and Validity............................................................................................................ 7 Exercise: Reliability and Validity............................................................................................ 8 Consistency of What Information? ................................................................................................. 8 "True Score" and "Error of Measurement" .................................................................................... 9 Reliability and Measurement Error ....................................................................................... 11 Exercise: Measurement Error ................................................................................................ 11 Reliability and Sampling............................................................................................................... 12 Alternate-Forms Reliability and Internal Consistency ................................................................. 13 Interrater Reliability...................................................................................................................... 15 Test Length and Reliability........................................................................................................... 16 Exercise: Interrater Reliability and Alternate-Forms Reliability........................................... 17 Reliability and Precision ............................................................................................................... 18 Reliability Statistics ...................................................................................................................... 19 The Reliability Coefficient .................................................................................................... 19 The Standard Error of Measurement ..................................................................................... 20 How Are the Reliability Coefficient and the Standard Error of Measurement Related? ...... 22 Test Length and Alternate-Forms Reliability ........................................................................ 22 Number of Raters and Interrater Reliability .......................................................................... 24 Reliability of Differences Between Scores............................................................................ 25 Demystifying the Standard Error of Measurement................................................................ 26 Exercise: The Reliability Coefficient and the Standard Error of Measurement .................... 26 Reliability of Essay Tests.............................................................................................................. 27

RM-18-01 iii

Reliability of Classifications and Decisions ................................................................................. 28 Summary ....................................................................................................................................... 30 Acknowledgments......................................................................................................................... 32 Appendix. Answers to Exercises .................................................................................................. 32

Exercise: Test Scores and Chance ......................................................................................... 32 Exercise: Reliability and Validity.......................................................................................... 33 Exercise: Measurement Error ................................................................................................ 33 Exercise: Interrater Reliability and Alternate-Forms Reliability........................................... 35 Exercise: The Reliability Coefficient and the Standard Error of Measurement .................... 35 Notes ............................................................................................................................................. 37

RM-18-01 iv

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download