DIFFERENTIAL ITEM FUNCTIONING - Amazon Web Services



Differential Item Functioning

THE RELIABILITY OF TEST SCORES IS ALSO INFLUENCED BY THE EXAMINEES’ DEMOGRAPHIC CHARACTERISTICS. EVERY EXAMINEE BELONGS TO A SUBGROUP. EXAMINEES’ RESPONSES TEND TO BE INFLUENCED BY THEIR SUBGROUP. DIFFERENTIAL ITEM FUNCTIONING REFERS TO THE PHENOMENON OF SUBGROUPS OF EXAMINEES RESPONDING DIFFERENTLY TO ITEMS BECAUSE (1) THERE ARE ACTUAL DIFFERENCES IN THE SUBGROUP OR (2) THE SUBGROUPS INTERPRET THE ITEMS DIFFERENTLY. EITHER SITUATION CAN RESULT IN DIFFERENTIAL ITEM FUNCTIONING. TAKE, FOR INSTANCE, THE SITUATION OF CHILD DEVELOPMENT. GIRLS TEND TO DEVELOP FINE MOTOR SKILLS AT AN EARLIER AGE THAN DO BOYS. BOYS TEND TO OUTPERFORM GIRLS WHEN USING GROSS MOTOR SKILLS. CONSEQUENTLY, ITEMS OR ACTIVITIES THAT REQUIRE THE USE OF GROSS MOTOR SKILLS WOULD DEMONSTRATE DIFFERENTIAL ITEM FUNCTIONING FOR BOYS AND GIRLS. THIS IS AN EXAMPLE OF AN ACTUAL DIFFERENCE. WE MIGHT ALSO CONSIDER THE ISSUE OF CRYING. IN SOME CULTURES, CRYING IS CONSIDERED AN ACCEPTABLE WAY OF SHOWING PAIN AND IN OTHERS IT IS NOT. THEREFORE, ANY ITEM RELATED TO CRYING COULD BE PERCEIVED DIFFERENTLY IN ONE SUBCULTURE THAN IN ANOTHER. ITEMS THAT REFER TO CRYING WOULD LIKELY DEMONSTRATE DIFFERENTIAL ITEM FUNCTIONING WHEN ANSWERED BY INDIVIDUALS OF VARIOUS SUBCULTURES.

Once data has been collected on the items, a statistical analysis referred to as a DIF procedure can be performed on item responses that is intended to reveal differential item functioning. The statistical analysis will simply tell you when an item needs to be looked at more carefully, but it won’t tell you what the problem is; human judgment is still required. The Standards for Educational and Psychological Testing discuss the need to detect DIF.

It is important to note that simply comparing the percentage correct for different groups, is not sufficient. This ignores the fact that real differences between groups do exist, and assumes that everyone had an equal opportunity to learn the subject material and that testing conditions were truly standardized for everyone.

There are four different methods to determine DIF.

▪ Item Response Theory methods (require large groups of examinees)

▪ Mantel-Haenzel statistics (chi-square test and most popular since easy to do and does not require large N)

▪ Standardization

▪ Logistic regression

DIF analyses provide the most accurate results on tests that measure the same or highly related KSAs or learning objectives (e.g., high internal consistency reliability). However, it is important to remember that criterion-referenced tests typically don’t have high internal consistency reliability because the KSAs or learning objectives may not be highly related. In other words, DIF analyses may not provide as highly accurate results on criterion-referenced tests.

DIF is reported as a value of z for each item. If an item has a z value of 2 or higher and is statistically significant then you need to examine the item and try to determine what the problem is and if possible fix it or remove the item.

However, you should also take the point biserial of the item into account when deciding whether or not to discard a item. If you remove an item that has a point biserial of .30 or higher, you may lower the overall reliability of the test. It is possible that if you review the item you may be able to remove whatever is causing the problem without actually changing the item.

Dan Biddle’s book Adverse Impact and Test Validation does a good job of addressing what you should consider when an item has high DIF. A full reference can be found in the course library.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download