IQ's Corner: IAP AP101 Brief # 10: Understanding IQ score differences: Examiner Errors

Why do significant differences in IQ scores often occur between different tests or the same test given at different times? The explanations are many. Previous IAP Applied Psychometric 101 Reports and Briefs have touched on a number of reasons. Click here to view or link to these reports.

In the first AP101 report, which I would recommend reading prior to reading the material below, test administration and or scoring errors (examiner errors) were mentioned as a possible reason for score discrepancies. The brief report below addresses this topic.

Test procedural and administration errors (examiner error)

Despite rigorous graduate training in standardized administration of intelligence tests for most psychologists, the extant research on adherence to standardized administration and scoring procedures has consistently reported (unfortunately) that the frequency of examiner errors occurs with enough regularity, for both novice and experienced psychological examiners, to be a concern.

Ramos, Alfonso and Schermerhorn (2009) summarized the extant research on examiner errors and reported that most research studies reported sufficient average examiner error to produce significant changes in IQ scores for individuals. The most frequent types of errors reported included a failure to record responses, use of incorrect basal and ceiling rules, reporting an incorrect global IQ score, incorrect adding of subtest scores, incorrect assignment of points for specific items, and incorrect calculation of the individuals age. On Wechsler-related studies, Ramos et al.'s review found that studies have reported average error rates from 7.8 to 25.8 errors per test record, almost 90% of examiners making one error, and in one study 2/3 of the test records reviewed resulted in a change in the Full Scale IQ. Examiner errors do not appear instrument specific as Ramos et al’s reported an average error rate of 4.63 errors per test record on the WJ III Tests of Cognitive Abilities.

The importance of verifying accurate administration and scoring is evident in the finding that across experienced psychologists and students in graduate training, ranges of score differences were as high as 25, 22, and 11 points respectively for the WAIS-III Verbal, Performance and Full Scale IQ scores (Ryan & Schnakenberg-Ott, 2003). Despite examiners reporting confidence in their scoring accuracy, Ryan and Schnakenberg-Ott reported average levels of agreement with the standard (accurate) test record of only 26.3% (Verbal IQ), 36.8 % (Performance IQ), and 42.1 % (Full Scale IQ).

This level of examiner error is alarming, particularly in the context of important decision-making (e.g., IQ score-based life-and-death Atkins MR/ID decisions; eligibility for intervention programs; eligibility for social security disability funds). The level of examiner experience does not appear to be an explanatory variable. More recently, when investigating a single subtest (WISC-IV Vocabulary), Erodi, Richard and Hopwood (2009) reported that more errors may be present when evaluating low and high ability subjects.

Numerous test development and professional training and monitoring recommendations have been suggested (see Erodi et al, 2009; Hopwood & Richard, 2005; Kuentzel et al. 2011; Ramos et al., 2009; Ryad & Schnakenberg-Ott, 2003), some that have empirically demonstrated improvement in accuracy (see Kuentzel, Hetterscheidt &Barnett, 2011).

Examiner test administration and scoring errors can be the reason for discrepant IQ-IQ score differences. It is clear that before attempting to interpret any IQ scores, or trying to reconcile IQ-IQ score differences between tests, the first step would be for all examiners to double check their scoring. Another wise step would be to seek independent review of a scored test record by another experienced examiner. In the case of Atkins decisions, attempts should be made to secure copies of the original IQ test records for independent review. If any clear errors are present, they should be corrected and new scores recalculated. Only then should psychologists proceed to draw conclusions about the consistency or differences between scores from different IQ tests or versions of the same test given at different times during an individual’s life-span.

Any intelligence test results used in an Atkin’s hearings must be subject to independent review of the original test protocol (this may be impossible for old historical testing results) to insure against administration or scoring errors that might result in significant differences in the reported IQ score. This is critically important in Atkin’s cases were the courts often use a strict specific-IQ “bright line” cut score to determine the presence of an intellectual disability.

Below are the abstracts from the primary sources for this brief report. Double click on the images to enlarge.