Test Statistics Checklist



Overall Exam Statistics ExplainedThe following statistics may be used to summarize the overall performance of any exam.Measures of Central Tendency. The exam mean (average) score, median (the midpoint of the distribution), standard deviation (average distance from the mean) and range of scores (lowest to highest) should all be reviewed. These statistics are helpful in determining acceptable cut points. Reliability Coefficient or the Kuder Richardson Formula (KR20 coefficient). The KR20 is a measure of a test’s stability, namely the internal consistency and precision of test scores. It assists in informing the answer to the question, ‘if you were to administer the same exam again to the same set of examinees would you get the same results?’ If a test produces approximately the same item difficulty indices when administered to similar groups of examinees it is an indication of high reliability. If on the other hand the same test administered to the same, or a very similar, group of examinees showed variation in the difficulty of the items that could indicate a problem with the test and/or the test items. The KR20 is a correlation statistic (linear) that ranges between 0 and 1.0. For our purposes a KR 20 of about .70 would be considered satisfactory. Note: Poorly constructed items and lots of easy items interfere with a test’s reliability. How the KR20 coefficient calculated: Ideally to determine a test’s reliability one would have to administer one or more equivalent tests to the same group of students to determine whether their performance is consistent in relation to each other and across tests. Since this is not often practical, we can use a correlation coefficient (in this case the KR20). This procedure is done by making two tests from a single exam by randomly splitting the test into two halves, and measuring students’ performance on each half. Standard Error of Measurement. The standard error of measurement provides information about imprecisions in an exam and as such it relates to reliability. It is a statistic that measures the amount of variability in an individual examinee’s performance that is due to random measurement error. The standard error is expressed in the same scale as the test score. Therefore, we can use the standard error to identify a range within which the true measurement of a student’s performance is likely to fall. Since no test is completely reliable, any given score on an exam is likely to deviate a few points from a student’s true score. The standard error can be useful in reviewing cut-offs. For instance, a student receives a score of 65.00 on an exam. The faculty determines that the cut-off for failing is 64.50. If the standard error of that test is 1.5, that student’s true score falls in a range of 63.5 to 66.5. Individual Item Performance Statistics ExplainedThe following statistics may be used to assess the individual questions that make up an exam.Difficulty Index (sometimes called the p-value). The difficulty index, or p-value, represents the proportion of students in a group who correctly answer a question. (Note: this is for items with one correct answer and worth one point.) When multiplied by 100, the p-value becomes a percentage (i.e., the percentage of examinees that got the item correct). For instance, if 64 out of 98 students correctly answer a question, the difficulty index would be .65 (64/98); divide the number of students who got the correct answer by the total number of students who took the test. Difficulty indices range from 0 to 1.0. The higher the difficulty index the easier the item; the lower the index the more difficult the item. There should be a range of difficulty values on an exam. An item’s difficulty is important for determining whether or not students have mastered the content being tested. Discrimination (D) Index. The discrimination measures a tests item’s ability to differentiate (i.e., ‘discriminate’) between students on how well they have mastered the content that is being tested. The discrimination ranges from -1.0 to +1.0. It serves as a method of testing for item validity. A positive discrimination index indicates that those students who got the test item correct also had a high overall exam score. When the discrimination index is negative it means that the examinees in the low performing group got the answer correct at a higher rate than the higher performing group. When more students who performed poorly on the overall exam were able to correctly answer the question than students who performed well on the exam it indicates that the item should be reviewed. When both the difficulty and discrimination indices are outside of the normal range, questions should be reviewed. Is the item well constructed? Are some of the distracters non-functioning. Does the question fairly represent concepts and content taught in the course?How the discrimination index is calculated: The discrimination index is calculated by creating two groups within the total number of examinees, the high scoring group and the lower scoring group. The high and low scoring groups each comprise 27% of the total group. The difficulty index for each of the two groups is calculated, the difficulty index of the lower scoring group is subtracted from that of the higher scoring group (usually producing a positive value). When the discrimination index is negative it indicates that the item may have low validity; this occurs if the examinees in the low performing group got the answer correct at a higher rate than the higher performing group. For instance, if 96 of examinees answered an item correctly and 64 did not (from a total group of 160), the difficulty index would be .60 (96/160). For this example, say that we make two groups made up of high scorers and low scorers; and that 74 of 96 in the high group and 18 in the low group answered this item correctly. Calculate a difficulty index for both groups; high group difficulty is .77 (74/96) and the low group difficulty is .28 (18/64). Subtracting the low group difficulty from the high group difficulty will produce a Discrimination idex of .49. Note: the discrimination index will be low if the difficulty index is very high. Point Biserial Index (PBI). This statistic correlates the performance of examinees on the overall exam to a specific response option (i.e., the correct answer). This correlation reflects the degree of the relationship between scores on an individual exam item and total test scores. When it is positive it means that those who got the item correct also had a high score on the test, or students who performed well on the exam also got this item correct (i.e., what one would expect). When the PBI is negative it indicates that the higher performers were not as likely to choose the correct answer as the lower performers and signals that the item should be reviewed (is the item keyed correctly, or is the correct answer ambiguous, or there was a problem with the way the information was taught?). Analysis of Response Options. This is a comparison of the number of examinees who selected each answer option/distractor. Analysis of response options allows for the evaluation of distractor quality. If examinees are not tempted by a particular distractor it may be better to revise it because it is not functioning as a plausible distractor; therefore, guessing could be more likely, and guessing effects the validity of an item. Also, the presence of one or more implausible distractors may make the exam item easier. Whereas the point biserial discrimination index for the correct answer should be positive, the point biserial discrimination for the distractors should be negative (and if not negative at least lower than the correct answer). ReferencesDowning, S.M. (2009). Statistics of testing. In S.M. Downing & R. Yudkowsky (Eds.), Assessment in Health Professions Education (pp. 93-117). New York, NY: Routledge.Haladyna, T.M. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Kelley, T., Ebel, R., & Linacre, J.M. (2002). Item discrimination indices. Rasch Measurement Transactions, 16(3), 883-4. Nunnally, J.C. (1967). Psychometric Theory. New York, NY: McGraw-Hill. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download