IQ's Corner: Interpretation of Gf tests: Ideas from Whilhelm

Monday, May 30, 2005

Interpretation of Gf tests: Ideas from Whilhelm

In a prior post I summarized a taxonomic lens for analyzing performance on figural/spatial matrix measures of fluid intelligence (Gf). Since then I have had the opportunity to read “Measuring Reasoning Ability” by Oliver Wilhelm (see early blog post on recommended books to read – this chapter is part of the Handbook of Understanding and Measuring Intelligence by Wilhelm and Engle). Below are a few select highlights.

The need for a more systematic framework for understanding Gf measures

As noted by Wilhelm, “there is certainly no lack of reasoning measures” (p. 379). Furthermore, as I learned when classifying tests as per CHC theory with Dr. Dawn Flanagan, the classificaiton of Gf tests as measures of general sequential (deductive) reasoning (RG) inductive reasoning (I), and quantitative reasoning (QR) is very difficult. Kyllonen and Christal’s 1990 statement (presented in the Wilhelm chapter) that the “development of good tests of reasoning ability has been almost an art form, owing more to empirical trial-and-error than to systematic delineation of the requirements which such tests must satisfy” (p.446 in Kyllonen and Christal; p. 379 in Wilhelm). It thus follows that the logical classification of Gf tests is often difficult…or, as we used to say when I was in high school..”no sh____ batman!!!!”

As a result, “scientists and practitioners are left with little advice from test authors as to why a specific test has the form it has. It is easy to find two reasoning tests that are said to measure the same ability but that are vastly different in terms of their features, attributes, and requirements” (p. 379).

Wilhelm’s system for formally classifying reasoning measures

Wilhelm articulates four aspects to consider in the classification of reasoning measures. These are:

Formal operation task requirements – this is what most CHC assessment professionals have been encouraged to examine via the CHC lens. Is a test a measure of RG, I, RQ, or a mixture of more than one narrow ability?

Content of tasks – this is where Wilhelm’s research group has made one of its many significant contributations during the past decade. Wilhelm et al. have reminded us that just because the Rubik’s cube model of intelligence (Guilford’s SOI model) was found seriously wanting, the analyses of intelligence tests by operation (see above) and content facets is theoretically and empirically sound. I fear that many psychologists, having been burned by the unfulfilled promise of the SOI interpretative framework, have often thrown out the content facet with the SOI bath water. There is clear evidence (see my prior post that presents evidence for content facets based on the analysis of 50 CHC designed measures via a Carroll analyses of the data) that most psychometric tests can be meaningfully classified as per stimulus content – figural, verbal, and quantitative.

The instantiation of the reasoning tasks/problems – what is the formal underlying structure of the reasoning tasks? Space does not allow a detailed treatment here, but Wilhelm provides a flavor of this feature when he suggests that one must go through a “decision tree” to ascertain if the problems are concrete vs. abstract. Following the abstract branch, further differentiation might occur vis-à-vis the distinction of “nonsense” vs. “variable” instantiation. Following the concrete branch decision tree, reasoning problem instantiation can be differentiated as to whether they require prior knowledge or not. And so on.

As noted by Wilhelm, “it is well established that the form of the instantiation has substantial effects on the difficulty of structurally identical reasoning tasks” (p. 380).

Vulnerability of task to reasoning ‘strategies” – all good clinicians know, and have seen, that certain examinees often change the underlying nature of a psychometric task via the deployment of unique metacognitive/learning strategies. I often call this the “expansion of a tests specificity by the examinee.” According to Wilhelm, “if a subgroup of participants chooses a different approach to work on a given test, the consequence is that the test is measuring different abilities for different subgroups…depending on which strategy is chosen, different items are easy and hard, respectively” (p, 381). Unfortunately, research-based protocols for ascertaining which strategies are used during reasoning task performance are more-or-less non-existent.

Ok…that’s enough for this blog post. Readers are encouraged to chew on this taxonomic framework. I do plan (but don’t hold me to the promise…it is a benefit of being the benevolent blog dictator) to summarize additional information from this excellent chapter. Whilhelm’s taxonomy has obvious implications for those who engage in test development. Wilhelm’s framework suggests a structure from which to systematically design/specify Gf tests as per the four dimensions.

On the flip side (applied practice), Whilhelm’s work suggests that our understanding of the abilities measured by existing Gf tests might be facilitated via the classification of different Gf tests as per these dimensions. Work on the “operation” characteristic has been going strong since the mid 1990’s as per the CHC narrow ability classification of tests.

Might not a better understanding of Gf measures emerge if those leading the pack on how to best interpret intelligence tests add (to the CHC operation classifications of Gf tests) the analysis of tests as per the content and instantiation dimensions, as well as identifying the different types of cognitive strategies that might be elicited by different Gf tests by different individuals?

I smell a number of nicely focused and potentially important doctoral dissertations based on the administration of a large collection of available practical Gf measures (e.g., Gf tests from WJ III, KAIT, Wechslers, DAS, CAS, SB5, Ravens, and other prominent “nonverbal” Gf measures) to a decent sample, followed by exploratory and/or confirmatory factor analyses and multidimensional scaling (MDS). Heck….doesn’t someone out there have access to that ubiquitous pool of psychology experiment subjects --- viz., undergraduates in introductory psychology classes? This would be a good place to start.

More later…I hope.