Debating Ability Testing

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 1/79

CHAPTER 6 Group Tests and Controversies in Ability Testing

TOPIC 6A Group Tests of Ability and Related Concepts

6.1 Nature, Promise, and Pitfalls of Group Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec1#ch06lev1sec1)

6.2 Group Tests of Ability (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06lev1sec2)

6.3 Multiple Aptitude Test Batteries (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06lev1sec3)

6.4 Predicting College Performance (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06lev1sec4)

6.5 Postgraduate Selection Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06lev1sec5)

6.6 Educational Achievement Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev1sec6)

The practical success of early intelligence scales such as the 1905 Binet-Simon test motivated psychologists and educators to develop instruments that could be administered simultaneously to large numbers of examinees. Test developers were quick to realize that group tests allowed for the efficient evaluation of dozens or hundreds of examinees at the same time. As reviewed in an earlier chapter, one of the first uses of group tests was for screening and assignment of military personnel during World War I. The need to quickly test thousands of Army recruits inspired psychologists in the United States, led by Robert M. Yerkes, to make rapid advances in psychometrics and test development (Yerkes, 1921 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1796) ). Many new applications followed immediately—in education, industry, and other fields. In Topic 6A (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06#ch06box1) , Group Tests of Ability and Related Concepts, we introduce the reader to the varied applications of group tests and also review a sampling of typical instruments. In addition, we explore a key question raised by the consequential nature of these tests—can examinees boost their scores significantly by taking targeted test preparation courses? This is but one of many unexpected issues raised by the widespread use of group tests. In Topic 6B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21) , Test Bias and Other Controversies, we continue a reflective theme by looking into test bias and other contentious issues in testing.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec1#ch06lev1sec1
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06lev1sec2
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06lev1sec3
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06lev1sec4
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06lev1sec5
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev1sec6
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1796
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06#ch06box1
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 2/79

6.1 NATURE, PROMISE, AND PITFALLS OF GROUP TESTS Group tests serve many purposes, but the vast majority can be assigned to one of three types: ability, aptitude, or achievement tests. In the real world, the distinction among these kinds of tests often is quite fuzzy (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). These instruments differ mainly in their functions and applications, less so in actual test content. In brief, ability tests typically sample a broad assortment of proficiencies in order to estimate current intellectual level. This information might be used for screening or placement purposes, for example, to determine the need for individual testing or to establish eligibility for a gifted and talented program. In contrast, aptitude tests usually measure a few homogeneous segments of ability and are designed to predict future performance. Predictive validity is foundational to aptitude tests, and often they are used for institutional selection purposes. Finally, achievement tests assess current skill attainment in relation to the goals of school and training programs. They are designed to mirror educational objectives in reading, writing, math, and other subject areas. Although often used to identify educational attainment of students, they also function to evaluate the adequacy of school educational programs.

Whatever their application, group tests differ from individual tests in five ways:

Multiple-choice versus open-ended format Objective machine scoring versus examiner scoring Group versus individualized administration Applications in screening versus remedial planning Huge versus merely large standardization samples

These differences allow for great speed and cost efficiency in group testing, but a price is paid for these advantages.

Although the early psychometric pioneers embraced group testing wholeheartedly, they recognized fully the nature of their Faustian bargain: Psychologists had traded the soul of the individual examinee in return for the benefits of mass testing. Whipple (1910 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1752) ) summed up the advantages of group testing but also pointed to the potential perils:

Most mental tests may be administered either to individuals or to groups. Both methods have advantages and disadvantages. The group method has, of course, the particular merit of economy of time; a class of 50 or 100 children may take a test in less than a fiftieth or a hundredth of the time needed to administer the same test individually. Again, in certain comparative studies, e.g., of the effects of a week’s vacation upon the mental efficiency of school children, it becomes imperative that all S’s should take the tests at the same time. On the other hand, there are almost sure to be some S’s in every group that, for one reason or another, fail to follow instructions or to execute the test to the best of their ability. The individual method allows E to detect these cases, and in general, by the exercise of personal supervision, to gain, as noted above, valuable information concerning S’s attitude toward the test.

In sum, group testing poses two interrelated risks: (1) some examinees will score far below their true ability, owing to motivational problems or difficulty following directions and (2) invalid scores will not be recognized as such, with undesirable consequences for these atypical examinees. There is really no simple way to entirely avoid these risks, which are part of the trade-off for the efficiency of group testing.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib646
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1752

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 3/79

However, it is possible to minimize the potentially negative consequences if examiners scrutinize very low scores with skepticism and recommend individual testing for these cases.

We turn now to an analysis of group tests in a variety of settings, including cognitive tests for schools and clinics, placement tests for career and military evaluation, and aptitude tests for college and postgraduate selection.

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 4/79

6.2 GROUP TESTS OF ABILITY

Multidimensional Aptitude Battery-II (MAB-II) The Multidimensional Aptitude Battery-II (MAB-II; Jackson, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib820) ) is a recent group intelligence test designed to be a paper-and-pencil equivalent of the WAIS-R. As the reader will recall, the WAIS-R is a highly respected instrument (now replaced by the WAIS-III), in its time the most widely used of the available adult intelligence tests. Kaufman (1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib861) ) noted that the WAIS-R was “the criterion of adult intelligence, and no other instrument even comes close.” However, a highly trained professional needs about 1½ hours just to administer the Wechsler adult test to a single person. Because professional time is at a premium, a complete Wechsler intelligence assessment— including administration, scoring, and report writing—easily can cost hundreds of dollars. Many examiners have long suspected that an appropriate group test, with the attendant advantages of objective scoring and computerized narrative report, could provide an equally valid and much less expensive alternative to individual testing for most persons.

The MAB-II was designed to produce subtests and factors parallel to the WAIS-R but employing a multiple- choice format capable of being computer scored. The apparent goal in designing this test was to produce an instrument that could be administered to dozens or hundreds of persons by one examiner (and perhaps a few proctors) with minimal training. In addition, the MAB-II was designed to yield IQ scores with psychometric properties similar to those found on the WAIS-R. Appropriate for examinees from ages 16 to 74, the MAB-II yields 10 subtest scores, as well as Verbal, Performance, and Full Scale IQs.

Although it consists of original test items, the MAB-II is mainly a sophisticated subtest-by-subtest clone of the WAIS-R. The 10 subtests are listed as follows:

Verbal Performance

Information Digit Symbol

Comprehension Picture Completion

Arithmetic Spatial

Similarities Picture Arrangement

Vocabulary Object Assembly

The reader will notice that Digit Span from the WAIS-R is not included on the MAB-II. The reason for this omission is largely practical: There would be no simple way to present a Digit-Span-like subtest in paper- and-pencil format. In any case, the omission is not serious. Digit Span has the lowest correlation with overall WAIS-R IQ, and it is widely recognized that this subtest makes a minimal contribution to the measurement of general intelligence.

The only significant deviation from the WAIS-R is the replacement of Block Design with a Spatial subtest on the MAB-II. In the Spatial subtest, examinees must mentally perform spatial rotations of figures and select one of five possible rotations presented as their answer (Figure 6.1 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig1) ). Only mental rotations are involved (although “flipped-over” versions of the original stimulus are included as distractor items). The advanced items are very complex and demanding.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib820
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib861
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig1

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 5/79

The items within each of the 10 MAB-II subtests are arranged in order of increasing difficulty, beginning with questions and problems that most adolescents and adults find quite simple and proceeding upward to items that are so difficult that very few persons get them correct. There is no penalty for guessing and examinees are encouraged to respond to every item within the time limit. Unlike the WAIS-R in which the verbal subtests are untimed power measures, every MAB-II subtest incorporates elements of both power and speed: Examinees are allowed only seven minutes to work on each subtest. Including instructions, the Verbal and Performance portions of the MAB-II each take about 50 minutes to administer.

The MAB-II is a relatively minor revision of the MAB, and the technical features of the two versions are nearly identical. A great deal of psychometric information is available for the original version, which we report here. With regard to reliability, the results are generally quite impressive. For example, in one study of over 500 adolescents ranging in age from 16 to 20, the internal consistency reliability of Verbal, Performance, and Full Scale IQs was in the high .90s. Test–retest data for this instrument also excel. In a study of 52 young psychiatric patients, the individual subtests showed reliabilities that ranged from .83 to .97 (median of .90) for the Verbal scale and from .87 to .94 (median of .91) for the Performance scale (Jackson, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). These results compare quite favorably with the psychometric standards reported for the WAIS-R.

Factor analyses of the MAB-II are broadly supportive of the construct validity of this instrument and its predecessor (Lee, Wallbrown, & Blaha, 1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib963) ). Most recently, Gignac (2006 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib588) ) examined the factor structure of the MAB-II using a series of confirmatory factor analyses with data on 3,121 individuals reported in Jackson (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib820) ). The best fit to the data was provided by a nested model consisting of a first-order general factor, a first-order Verbal Intelligence factor, and a first-order Performance Intelligence factor. The one caveat of this study was that Arithmetic did not load specifically on the Verbal Intelligence factor independent of its contribution to the general factor.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib817
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib963
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib588
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib820

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 6/79

FIGURE 6.1 Demonstration Items from Three Performance Tests of the Multidimensional Aptitude Battery-II (MAB)

Source: Reprinted with permission from Jackson, D. N. (1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). Manual for the Multidimensional Aptitude Battery. Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265–1285.

Other researchers have noted the strong congruence between factor analyses of the WAIS-R (with Digit Span removed) and the MAB. Typically, separate Verbal and Performance factors emerge for both tests (Wallbrown, Carmin, & Barnett, 1988 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1707) ). In a large sample of inmates, Ahrens, Evans, and Barnett (1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib12) ) observed validity- confirming changes in MAB scores in relation to education level. In general, with the possible exception that Arithmetic does not contribute reliably to the Verbal factor, there is good justification for the use of separate Verbal and Performance scales on this test.

In general, the validity of this test rests upon its very strong physical and empirical resemblance to its parent test, the WAIS-R. Correlational data between MAB and WAIS-R scores are crucial in this regard. For 145 persons administered the MAB and WAIS-R in counterbalanced fashion, correlations between subtests ranged from .44 (Spatial/Block Design) to .89 (Arithmetic and Vocabulary), with a median of .78. WAIS-R and MAB IQ correlations were very healthy, namely, .92 for Verbal IQ, .79 for Performance IQ, and

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib817
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1707
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib12

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 7/79

.91 for Full Scale IQ (Jackson, 1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). With only a few exceptions, correlations between MAB and WAIS-R scores exceed those between the WAIS and the WAIS- R. Carless (2000 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib269) ) reported a similar, strong overlap between MAB scores and WAIS-R scores in a study of 85 adults for the Verbal, Performance, and Full Scale IQ scores. However, she found that 4 of the 10 MAB subtests did not correlate with the WAIS-R subscales they were designed to represent, suggesting caution in using this instrument to obtain detailed information about specific abilities.

Chappelle et al. (2010) obtained MAB-II scores for military personnel in an elite training program for AC- 130 gunship operators. The officers who passed training (N = 59) and those who failed training (N = 20) scored above average (mean Full Scale IQs of 112.5 and 113.6, respectively), but there were no significant differences between the two groups on any of the test indices. This is a curious result insofar as IQ typically demonstrates at least mild predictive potential for real world vocational outcomes. Further research on the MABII as a predictor of real world results would be desirable.

The MAB-II shows great promise in research, career counseling, and personnel selection. In addition, this test could function as a screening instrument in clinical settings, as long as the examiner views low scores as a basis for follow-up testing with an individual intelligence test. Examiners must keep in mind that the MAB-II is a group test and, therefore, carries with it the potential for misuse in individual cases. The MAB- II should not be used in isolation for diagnostic decisions or for placement into programs such as classes for intellectually gifted persons.

A Multilevel Battery: The Cognitive Abilities Test (CogAT) One important function of psychological testing is to assess students’ abilities that are prerequisite to traditional classroom-based learning. In designing tests for this purpose, the psychometrician must contend with the obvious and nettlesome problem that school-aged children differ hugely in their intellectual abilities. For example, a test appropriate for a sixth grader will be much too easy for a tenth grader, yet impossibly difficult for a third grader.

The answer to this dilemma is a multilevel battery, a series of overlapping tests. In a multi-level battery, each group test is designed for a specific age or grade level, but adjacent tests possess some common content. Because of the overlapping content with adjacent age or grade levels, each test possesses a suitably low floor and high ceiling for proper assessment of students at both extremes of ability. Virtually every school system in the United States uses at least one nationally normed multilevel battery.

The Cognitive Abilities Test (CogAT) is one of the best school-based test batteries in current use (Lohman & Hagen, 2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1003) ). A recent revision of the test is the CogAT Multilevel Edition, Form 6, released in 2001. Norms for 2005 also are available. We discuss this instrument in some detail.

The CogAT evolved from the Lorge-Thorndike Intelligence Tests, one of the first group tests of intelligence intended for widespread use within school systems. The CogAT is primarily a measure of scholastic ability but also incorporates a nonverbal reasoning battery with items that bear no direct relation to formal school instruction. The two primary batteries, suitable for students in kindergarten through third grade, are briefly discussed at the end of this section. Here we review the multilevel edition intended for students in 3rd through 12th grade.

The nine subtests of the multilevel CogAT are grouped into three areas: Verbal, quantitative, and nonverbal, each including three subtests. Representative items for the subtests of the CogAT are depicted

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib817
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib269
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1003

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 8/79

in Figure 6.2 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig2) . The tests on the Verbal Battery evaluate verbal skills and reasoning strategies (inductive and deductive) needed for effective reading and writing. The tests on the Quantitative Battery appraise quantitative skills important for mathematics and other disciplines. The Nonverbal Battery can be used to estimate cognitive level of students with limited reading skill, poor English proficiency, or inadequate educational exposure.

For each CogAT subtest, items are ordered by difficulty level in a single test booklet. However, entry and exit points differ for each of eight overlapping levels (A through H). In this manner, grade-appropriate items are provided for all examinees.

The subtests are strictly timed, with limits that vary from 8 to 12 minutes. Each of the three batteries can be administered in less than an hour. However, the manual recommends three successive testing days for younger children. For older children, two batteries should be administered the first day, with a single testing period the next.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig2

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 9/79

FIGURE 6.2 Subtests and Representative Items of the Cognitive Abilities Test, Form 6

Note: These items resemble those on the CogAT 6. Correct answers: 1: B. yogurt (the only dairy product). 2: D. swim (fish swim in the ocean). 3: E. bottom (the opposite of top). 4: A. I is greater than II (4 is greater than 2). 5: C. 26 (the algorithm is add 10, subtract 5, add 10 . . .). 6: A. −1 (the only answer that fits) 7: A (four-sided shape that is filled in). 8: D (same shape, bigger to smaller). 9: E (correct answer).

Raw scores for each battery can be transformed into an age-based normalized standard score with mean of 100 and standard deviation of 15. In addition, percentile ranks and stanines for age groups and grade level are also available. Interpolation was used to determine fall, winter, and spring grade-level norms.

The CogAT was co-normed (standardized concurrently) with two achievement tests, the Iowa Tests of Basic Skills and the Iowa Tests of Educational Development. Concurrent standardization with achievement measures is a common and desirable practice in the norming of multilevel intelligence tests. The particular virtue of joint norming is that the expected correspondence between intelligence and achievement scores is determined with great precision. As a consequence, examiners can more accurately

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 10/79

identify underachieving students in need of remediation or further assessment for potential learning disability.

The reliability of the CogAT is exceptionally good. In previous editions, the Kuder-Richardson-20 reliability estimates for the multilevel batteries averaged .94 (Verbal), .92 (Quantitative), and .93 (Nonverbal) across all grade levels. The six-month test–retest reliabilities for alternate forms ranged from .85 to .93 (Verbal), .78 to .88 (Quantitative), and .81 to .89 (Nonverbal).

The manual provides a wealth of information on content, criterion-related, and construct validity of the CogAT; we summarize only the most pertinent points here. Correlations between the CogAT and achievement batteries are substantial. For example, the CogAT verbal battery correlates in the .70s to .80s with achievement subtests from the Iowa Tests of Basic Skills.

The CogAT batteries predict school grades reasonably well. Correlations range from the .30s to the .60s, depending on grade level, sex, and ethnic group. There does not appear to be a clear trend as to which battery is best at predicting grade point average. Correlations between the CogAT and individual intelligence tests are also substantial, typically ranging from .65 to .75. These findings speak well for the construct validity of the CogAT insofar as the Stanford-Binet is widely recognized as an excellent measure of individual intelligence.

Ansorge (1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib55) ) has questioned whether all three batteries are really necessary. He points out that correlations among the Verbal, Quantitative, and Nonverbal batteries are substantial. The median values across all grades are as follows:

Verbal and Quantitative .78

Nonverbal and Quantitative .78

Verbal and Nonverbal .72

Since the Quantitative battery offers little uniqueness, from a purely psychometric point of view there is no justification for including it. Nonetheless, the test authors recommend use of all batteries in hopes that differences in performance will assist teachers in remedial planning. However, the test authors do not make a strong case for doing this.

A study by Stone (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1582) ) provides a notable justification for using the CogAT as a basis for student evaluation. He found that CogAT scores for 403 third graders provided an unbiased prediction of student achievement that was more accurate than teacher ratings. In particular, teacher ratings showed bias against Caucasian and Asian American students by underpredicting their achievement scores.

Raven’s Progressive Matrices (RPM) First introduced in 1938, Raven’s Progressive Matrices (RPM) is a nonverbal test of inductive reasoning based on figural stimuli (Raven, Court, & Raven, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) , 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ). This test has been very popular in basic research and is also used in some institutional settings for purposes of intellectual screening.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib55
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1582
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1337
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1340

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 11/79

RPM was originally designed as a measure of Spearman’s g factor (Raven, 1938 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1335) ). For this reason, Raven chose a special format for the test that presumably required the exercise of g. The reader is reminded that Spearman defined g as the “eduction of correlates.” The term eduction refers to the process of figuring out relationships based on the perceived fundamental similarities between stimuli. In particular, to correctly answer items on the RPM, examinees must identify a recurring pattern or relationship between figural stimuli organized in a 3 × 3 matrix. The items are arranged in order of increasing difficulty, hence the reference to progressive matrices.

Raven’s test is actually a series of three different instruments. Much of the confusion about validity, factorial structure, and the like stems from the unexamined assumption that all three forms should produce equivalent findings. The reader is encouraged to abandon this unwarranted hypothesis. Even though the three forms of the RPM resemble one another, there may be subtle differences in the problem- solving strategies required by each.

The Coloured Progressive Matrices is a 36-item test designed for children from 5 to 11 years of age. Raven incorporated colors into this version of the test to help hold the attention of the young children. The Standard Progressive Matrices is normed for examinees from 6 years and up, although most of the items are so difficult that the test is best suited for adults. This test consists of 60 items grouped into 5 sets of 12 progressions. The Advanced Progressive Matrices is similar to the Standard version but has a higher ceiling. The Advanced version consists of 12 problems in Set I and 36 problems in Set II. This form is especially suitable for persons of superior intellect.

Large sample U.S. norms for the Coloured and Standard Progressive Matrices are reported in Raven and Summers (1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). Separate norms for Mexican American and African American children are included. Although there was no attempt to use a stratified random-sampling procedure, the selection of school districts was so widely varied that the American norms for children appear to be reasonably sound. Sattler (1988 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1437) ) summarizes the relevant norms for all versions of the RPM. Raven, Court, and Raven (1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ) produced new norms for the Standard Progressive Matrices, but Gudjonsson (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib663) ) has raised a concern that these data are compromised because the testing was not monitored.

For the Coloured Progressive Matrices, split-half reliabilities in the range of .65 to .94 are reported, with younger children producing lower values (Raven, Court, & Raven, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). For the Standard Progressive Matrices, a typical split-half reliability is .86, although lower values are found with younger subjects (Raven, Court, & Raven, 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1338) ). Test–retest reliabilities for all three forms vary considerably from one sample to the next (Raven, 1965 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1336) ; Raven et al., 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). For normal adults in their late teens or older, reliability coefficients of .80 to .93 are typical. However, for preteen children, reliability coefficients as low as .71 are reported. Thus, for younger subjects, RPM may not possess sufficient reliability to warrant its use for individual decision making.

Factor-analytic studies of the RPM provide little, if any, support for the original intention of the test to measure a unitary construct (Spearman’s g factor). Studies of the Coloured Progressive Matrices reveal

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1335
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1337
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1437
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1340
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib663
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1337
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1338
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1336
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1337

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 12/79

three orthogonal factors (e.g., Carlson & Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ). Factor I consists largely of very difficult items and might be termed closure and abstract reasoning by analogy. Factor II is labeled pattern completion through identity and closure. Factor III consists of the easiest items and is defined as simple pattern completion (Carlson & Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ). In sum, the very easy and the very hard items on the Coloured Progressive Matrices appear to tap different intellectual processes.

The Advanced Progressive Matrices breaks down into two factors that may have separate predictive validities (Dillon, Pohlmann, & Lohman, 1981 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib420) ). The first factor is composed of items in which the solution is obtained by adding or subtracting patterns (Figure 6.3a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3) ). Individuals performing well on these items may excel in rapid decision making and in situations where part–whole relationships must be perceived. The second factor is composed of items in which the solution is based on the ability to perceive the progression of a pattern (Figure 6.3b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3) ). Persons who perform well on these items may possess good mechanical ability as well as good skills for estimating projected movement and performing mental rotations. However, the skills represented by each factor are conjectural at this point and in need of independent confirmation.

A huge body of published research bears on the validity of the RPM. The early data are well summarized by Burke (1958 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib234) ), while later findings are compiled in the current RPM manuals (Raven & Summers, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ; Raven, Court, & Raven, 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1338) , 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1339) , 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ). In general, validity coefficients with achievement tests range from the .30s to the .60s. As might be expected, these values are somewhat lower than found with more traditional (verbally loaded) intelligence tests. Validity coefficients with other intelligence tests range from the .50s to the .80s.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib832
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib832
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib420
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib234
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1337
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1338
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1339
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1340

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 13/79

FIGURE 6.3 Raven’s Progressive Matrices: Typical Items

Also, as might be expected, the correlations tend to be higher with performance than with verbal tests. In a massive study involving thousands of schoolchildren, Saccuzzo and Johnson (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1426) ) concluded that the Standard Progressive Matrices and the WISC-R showed approximately equal predictive validity and no evidence of differential validity across eight different ethnic groups. In a lengthy review, Raven (2000 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1334) ) discusses stability and variation in the norms for the Raven’s Progressive Matrices across cultural, ethnic, and socioeconomic groups over the last 60 years. Indicative of the continuing interest in this venerable instrument, Costenbader and Ngari (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib361) ) describe the standardization of the Coloured Progressive Matrices in Kenya. Further indicating the huge international popularity of the test, Khaleefa and Lynn (2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib884) ) provide standardization data for 6- to 11-year-old children in Yemen.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1426
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1334
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib361
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib884

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 14/79

Even though the RPM has not lived up to its original intentions of measuring Spearman’s g factor, the test is nonetheless a useful index of nonverbal, figural reasoning. The recent updating of norms was a much- welcomed development for this well-known test, in that many American users were leary of the outdated and limited British norms. Nonetheless, adult norms for the Standard and Advanced Progressive Matrices are still quite limited.

The RPM is particularly valuable for the supplemental testing of children and adults with hearing, language, or physical disabilities. Often these examinees are difficult to assess with traditional measures that require auditory attention, verbal expression, or physical manipulation. In contrast, the RPM can be explained through pantomime, if necessary. Moreover, the only output required of the examinee is a pencil mark or gesture denoting the chosen alternative. For these reasons, the RPM is ideally suited for testing persons with limited command of the English language. In fact, the RPM is about as culturally reduced as possible: The test protocol does not contain a single word in any language. Mills and Tissot (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1160) ) found that the Advanced Progressive Matrices identified a higher proportion of minority children as gifted than did a more traditional measure of academic aptitude (the School and College Ability Test).

Bilker, Hansen, Brensinger, and others (2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib159) ) developed a psychometrically sound 9-item version of the 60-item Standard Progressive Matrices (SPM) test. The short test cuts testing time to a fraction of the full test. Correlations of scores on the 9-item version with the full scale were in the range of .90 to .98, indicating a minimal loss of measurement accuracy. The short SPM promises to be highly useful for research applications.

Perspective on Culture-Fair Tests Cattell’s Culture Fair Intelligence Test (CFIT) and Raven’s Progressive Matrices (RPM) often are cited as examples of culture-fair tests, a concept with a long and confused history. We will attempt to clarify terms and issues here.

The first point to make is that intelligence tests are merely samples of what people know and can do. We must not reify intelligence and overvalue intelligence tests. Tests are never samples of innate intelligence or culture-free knowledge. All knowledge is based in culture and acquired over time. As Scarr (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1444) ) notes, there is no such thing as a culture-free test.

But what about a culture-fair test, one that poses problems that are equally familiar (or unfamiliar) to all cultures? This would appear to be a more realistic possibility than a culture-free test, but even here the skeptic can raise objections. Consider the question of what a test means, which differs from culture to culture. In theory, a test of matrices would appear to be equally fair to most cultures. But in practice, issues of equity arise. Persons reared in Western cultures are trained in linear, convergent thinking. We know that the purpose of a test is to find the single, best answer and to do so quickly. We examine the 3 × 3 matrix from left to right and top to bottom, looking for the logical principles invoked in the succession of forms. Can we assume that persons reared in Nepal or New Guinea or even the remote, rural stretches of Idaho will do the same? The test may mean something different to them. Perhaps they will approach it as a measure of aesthetic progression rather than logical succession. Perhaps they will regard it as so much silliness not worthy of intense intellectual effort. To assume that a test is equally fair to all cultural groups merely because the stimuli are equally familiar (or unfamiliar) is inappropriate. We can talk about degrees of cultural fairness (or unfairness), but the notion that any test is absolutely culture-fair surely is mistaken.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1160
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib159
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1444

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 15/79

6.3 MULTIPLE APTITUDE TEST BATTERIES In a multiple aptitude test battery, the examinee is tested in several separate, homogeneous aptitude areas. Typically, the development of the subtests is dictated by the findings of factor analysis. For example, Thurstone developed one of the first multiple aptitude test batteries, the Primary Mental Abilities Test, a set of seven tests chosen on the basis of factor analysis (Thurstone, 1938 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1647) ).

More recently, several multiple aptitude test batteries have gained favor for educational and career counseling, vocational placement, and armed services classification (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). Each year hundreds of thousands of persons are administered one of these prominent batteries: the Differential Aptitude Test (DAT), the General Aptitude Test Battery (GATB), and the Armed Services Vocational Aptitude Battery (ASVAB). These batteries either used factor analysis directly for the delineation of useful subtests or were guided in their construction by the accumulated results of other factor-analytic research. The salient characteristics of each battery are briefly reviewed in the following sections.

The Differential Aptitude Test (DAT) The DAT was first issued in 1947 to provide a basis for the educational and vocational guidance of students in grades 7 through 12. Subsequently, examiners have found the test useful in the vocational counseling of young adults out of school and in the selection of employees. Now in its fifth edition (1992), the test has been periodically revised and stands as one of the most popular multiple aptitude test batteries of all time (Bennett, Seashore, & Wesman, 1982 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib134) , 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib135) ). Wang (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1714) ) provides a succinct overview of the test.

The DAT consists of eight independent tests:

Verbal Reasoning (VR) Numerical Reasoning (NR) Abstract Reasoning (AR) Perceptual Speed and Accuracy (PSA) Mechanical Reasoning (MR) Space Relations (SR) Spelling (S) Language Usage (LU)

A characteristic item from each test is shown in Figure 6.4 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig4) .

The authors chose the areas for the eight tests based on experimental and experiential data rather than relying on a formal factor analysis of their own.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1647
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib646
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib134
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib135
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1714
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig4

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 16/79

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 17/79

FIGURE 6.4 Differential Aptitude Tests and Characteristic Items

In constructing the DAT, the authors were guided by several explicit criteria:

Each test should be an independent test: There are situations in which only part of the battery is required or desired. The tests should measure power: For most vocational purposes to which test results contribute, the evaluation of power—solving difficult problems with adequate time—is of primary concern. The test battery should yield a profile: The eight separate scores can be converted to percentile ranks and plotted on a common profile chart. The norms should be adequate: In the fifth edition, the norms are derived from 100,000 students for the fall standardization, 70,000 for the spring standardization. The test materials should be practical: With time limits of 6 to 30 minutes per test, the entire DAT can be administered in a morning or an afternoon school session. The tests should be easy to administer: Each test contains excellent “warm-up” examples and can be administered by persons with a minimum of special training. Alternate forms should be available: For purposes of retesting, the availability of alternate forms (currently forms C and D) will reduce any practice effects.

The reliability of the DAT is generally quite high, with split-half coefficients largely in the .90s and alternate-forms reliabilities ranging from .73 to .90, with a median of .83. Mechanical Reasoning is an exception, with reliabilities as low as .70 for girls. The tests show a mixed pattern of intercor-relations with each other, which is optimistically interpreted by the authors as establishing the independence of the eight tests. Actually, many of the correlations are quite high and it seems likely that the eight tests reflect a smaller number of ability factors. Certainly, the Verbal Reasoning and Numerical Reasoning tests measure a healthy general factor, with correlations around .70 in various samples.

The manual presents extensive data demonstrating that the DAT tests, especially the VR + NR combination, are good predictors of other criteria such as school grades and scores on other aptitude tests (correlations in the .60s and .70s). For this reason, the combination of VR + NR often is considered an index of scholastic aptitude. Evidence for the differential validity of the other tests is rather slim. Bennett, Seashore, and Wesman (1974 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib133) ) do present results of several follow-up studies correlating vocational entry/success with DAT profiles, but their research methods are more impressionistic than quantitative; the independent observer will find it difficult to make use of their results. Schmitt (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1468) ) notes that a major problem with the battery is the

lack of discriminant validity between the eight subtests. With the exception of the Perceptual Speed and Accuracy test, all of the subscales are highly intercorrelated (.50 to .75). If one wants only a general index of the person’s academic ability, this is fine; if the scores on the subtests are to be used in some diagnostic sense, this level of intercorrelation makes statements about students’ relative strengths and weaknesses highly questionable.

Even so, the revised DAT is better than previous editions. One significant improvement is the elimination of apparent sex bias on the Language Usage and Mechanical Reasoning tests—a source of criticism from earlier reviews. The DAT has been translated into several languages and is widely used in Europe for vocational guidance and research applications (e.g., Nijenhuis, Evers, & Mur, 2000

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib133
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1468
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1237

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 18/79

(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1237) ; Colom, Quiroga, & Juan-Espinosa, 1999 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib326) ).

A computerized version of the DAT has been available for several years, although its equivalence to the traditional paper and pencil format cannot be taken for granted (Alkhadher, Clarke, & Anderson, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib17) ). We will have more to say about computerized testing in a later section of the book. For now, it will suffice to mention that the psychometric qualities of a test may shift when the mode of administration is changed. Using counterbalanced testing in which examinees completed both versions (half taking the traditional version first, half taking the computerized version first), Alkhadher et al. (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib17) ) found that oil refinery trainees (N = 122) scored higher on one subtest of the computerized version than on the traditional version of the DAT, namely, the Numerical Ability subtest. The researchers conjectured that the computerized version reduced test fatigue, alleviated time pressure, and also provided novelty—thus boosting test performance modestly.

The General Aptitude Test Battery (GATB) In the late 1930s, the U.S. Department of Labor developed aptitude tests to predict job performance in 100 specific occupations. In the 1940s, the department hired a panel of experts in measurement and industrial-organizational psychology to create a multiple aptitude test battery to assess the 100 occupations previously studied and many more. The outcome of this Herculean effort was the General Aptitude Test Battery (GATB), widely acknowledged as the premiere test battery for predicting job performance (Hunter, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ).

The GATB was derived from a factor analysis of 59 tests administered to thousands of male trainees in vocational courses (United States Employment Service, 1970 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1679) ). The interpretive standards have been periodically revised and updated, so the GATB is a thoroughly modern instrument even though its content is little changed. One limitation is that the battery is available mainly to state employment offices, although nonprofit organizations, including high schools and certain colleges, can make special arrangements for its use.

The GATB is composed of eight paper-and-pencil tests and four apparatus measures. The entire battery can be administered in approximately two-and-a-half hours and is appropriate for high school seniors and adults. The 12 tests yield a total of nine factor scores:

General Learning Ability (intelligence) (G). This score is a composite of Vocabulary, Arithmetic Reasoning, and Three-Dimensional Space. Verbal Aptitude (V). Derived from a Vocabulary test that requires the examinee to indicate which two words in a set are either synonyms or antonyms. Numerical Aptitude (N). This score is a composite of both the Computation and Arithmetic Reasoning tests. Spatial Aptitude (S). Consists of the Three-Dimensional Space test, a measure of the ability to perceive two-dimensional representations of three-dimensional objects and to visualize movement in three dimensions. Form Perception (P). This score is a composite of Form Matching and Tool Matching, two tests in which the examinee must match identical drawings.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1237
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib326
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib17
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib17
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib801
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1679

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 19/79

Clerical Perception (Q). A proofreading test called Name Comparison, the examinee must match names under pressure of time. Motor Coordination (K). Measures the ability to quickly make specified pencil marks in the Mark Making test. Finger Dexterity (F). A composite of the Assemble and Disassemble tests, two measures of dexterity with rivets and washers. Manual Dexterity (M). A composite of Place and Turn, two tests requiring the examinee to transfer and reverse pegs in a board.

The nine factor scores on the GATB are expressed as standard scores with a mean of 100 and an SD of 20. These standard scores are anchored to the original normative sample of 4,000 workers obtained in the 1940s. Alternate-forms reliability coefficients for factor scores range from the .80s to the .90s. The GATB manual summarizes several studies of the validity of the test, primarily in terms of its correlation with relevant criterion measures. Hunter (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ) notes that GATB scores predict training success for all levels of job complexity. The average validity coefficient is a phenomenal .62.

The absolute scores are of less interest than their comparison to updated Occupational Aptitude Patterns (OAPs) for dozens of occupations. Based on test results for huge samples of applicants and employees in different occupations, counselors and employers now have access to a wealth of information about score patterns needed for success in a variety of jobs. Thus, one way of using the GATB is to compare an examinee’s scores with OAPs believed necessary for proficiency in various occupations.

Hunter (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ) recommends an alternative strategy based on composite aptitudes (Figure 6.5 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig5) ). The nine specific factor scores combine nicely into three general factors: Cognitive, Perceptual, and Psychomotor. Hunter notes that different jobs require various contributions of the Cognitive, Perceptual, and Psychomotor aptitudes. For example, an assembly line worker in an automotive plant might need high scores on the Psychomotor and Perceptual composites, whereas the Cognitive score would be less important for this occupation. Hunter’s research demonstrates that general factors dominate over specific factors in the prediction of job performance. Davison, Gasser, and Ding (1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib396) ) discuss additional approaches to GATB profile analysis and interpretation.

Van de Vijver and Harsveld (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1685) ) investigated the equivalence of their computerized version of the GATB with the traditional paper-and-pencil version. Of course, only the cognitive and perceptual subtests were compared—tests of motor skills cannot be computerized. They found that the two versions were not equivalent. In particular, the computerized subtests produced faster and more inaccurate responses than the conventional subtests. Their research demonstrates once again that the equivalence of traditional and computerized versions of a test should not be assumed. This is an empirical question answerable only with careful research. Nijenhuis and van der Flier (1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1236) ) discuss a Dutch version of the GATB and its application in the study of cognitive differences between immigrants and majority group members in the Netherlands.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib801
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib801
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig5
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib396
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1685
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1236

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 20/79

FIGURE 6.5 Specific and General Factors on the GATB

The Armed Services Vocational Aptitude Battery (ASVAB) The ASVAB is probably the most widely used aptitude test in existence. This instrument is used by the Armed Services to screen potential recruits and to assign personnel to different jobs and training programs. The ASVAB is also available in a computerized version that is rapidly supplanting the original paper-and-pencil test (Segall & Moreno, 1999). The computerized ASVAB is discussed in more detail at the end of this section. More than 2 million examinees take the ASVAB each year. The current version consists of nine subtests, four of which produce the Armed Forces Qualification Test (AFQT), the common qualifying exam for all services (Table 6.1 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06tab1) ). Alternate- forms reliability coefficients for ASVAB scores are in the mid-.80s to mid-.90s, and test–retest coefficients range from the mid-.70s to the mid.80s (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ). The one exception is Paragraph Comprehension with a reliability of only .50. The test is well normed on a representative sample of 12,000 persons between the ages of 16 and 23 years. The ASVAB manual reports a median validity coefficient of .60 with measures of training performance.

Decisions about ASVAB examinees are typically based on composite scores, not subtest scores. For example, an Electronics Composite is derived by combining Arithmetic Reasoning, Mathematics Knowledge, Electronics Information, and General Science. Persons scoring well on this composite might be assigned to electronics-related positions. Since the composite scores are empirically derived, new ones can be developed for placement decisions at any time. Composite scores are continually updated and revised.

At one point, the Armed Services relied heavily on the seven composites in the following list (Murphy, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1187) ). The Coding Speed subtest, listed here, is no longer used. The first three constitute academic composites, whereas the remaining are occupational composites. The reader will notice that individual subtests may appear in more than one composite:

Academic Ability: Word Knowledge, Paragraph Comprehension, and Arithmetic Reasoning Verbal: Word Knowledge, Paragraph Comprehension, and General Science Math: Mathematics Knowledge and Arithmetic Reasoning Mechanical and Crafts: Arithmetic Reasoning, Mechanical Comprehension, Auto and Shop Information, and Electronics Information

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06tab1
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib952
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1187

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 21/79

Business and Clerical: Word Knowledge, Paragraph Comprehension, Mathematics Knowledge, and Coding Speed Electronics and Electrical: Arithmetic Reasoning, Mathematics Knowledge, Electronics Information, and General Science Health, Social, and Technology: Word Knowledge, Paragraph Comprehension, Arithmetic Reasoning, and Mechanical Comprehension

TABLE 6.1 The Armed Services Vocational Aptitude Battery (ASVAB) Subtests

Arithmetic Reasoning* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)

16-item test of arithmetic word problems based on simple calculation

Mathematics Knowledge* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)

25-item test of algebra, geometry, fractions, decimals, and exponents

Word Knowledge* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)

35-item test of vocabulary knowledge and synonyms

Paragraph Comprehension* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)

15-item test of reading comprehension in short paragraphs

General Science 25-item test of general knowledge in physical and biological science

Mechanical Comprehension 25-item test of mechanical and physical principles

Electronics Information 20-item test of electronics, radio, and electrical principles

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 22/79

Assembling Objects 16-item test of mechanical and assembly concepts

Auto and Shop 25-item test of basic knowledge of autos, shop practices, and tool usage

*Armed Forces Qualifying Test (AFQT).

The problem with forming composites in this manner is that they are so highly correlated with one another as to be essentially redundant. In fact, the average intercorrelation among these seven composite scores is .86 (Murphy, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1187) )! Clearly, composites do not always provide differential information about specific aptitudes. Perhaps that is why recent editions of the ASVAB have steered clear of multiple, complex composites. Instead, the emphasis is on simpler composites that are composed of highly related constructs. For example, a Verbal Ability composite is derived from Word Knowledge and Paragraph Comprehension, two highly inter-related subtests. In like manner, a Math Ability composite is obtained from the combination of Arithmetic Reasoning and Mathematics Knowledge.

Some researchers have concluded that the ASVAB does not function as a multiple aptitude test battery but achieves success in predicting diverse vocational assignments because the composites invariably tap a general factor of intelligence. For example, Dunai and Porter (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib441) ) report favorably on the ASVAB as a predictor of entry-level success of radiography students in Air Force medical training. The ASVAB may be a good test of general intelligence, but it falls short as a multiple aptitude test battery. Another concern is that the test may possess different psychometric structures for men and women. Specifically, the Electronics Information subtest is a good measure of g (the general factor of intelligence) for men but not women (Ree & Carretta, 1995). The likely explanation for this is that men are about nine times more likely to enroll in high school classes in electronics and auto shop, and men, therefore, have the opportunity for their general ability to shape what they learn about electronics information, whereas women do not. Scores on this subtest will, therefore, function as a measure of achievement (what has already been learned) but not as an index of aptitude (forecasting future results).

Research on a computerized adaptive testing (CAT) version of the ASVAB has been under way since the 1980s. Computerized adaptive testing is discussed in Topic 12B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch12lev1sec5#ch12box3) , Computerized Assessment and the Future of Testing. We provide a brief overview here. In CAT, the examinee takes the test while sitting at a computer terminal. The difficulty level of the items presented on the screen is continually readjusted as a function of the examinee’s ongoing performance. In general, an examinee who answers a subtest item correctly will receive a harder item, whereas an examinee who fails that item will receive an easier item. The computer uses item response theory as a basis for selecting items. Each examinee receives a unique set of test items tailored to his or her ability level.

In 1990, the CAT-ASVAB began to replace the paper-and-pencil ASVAB. Currently, more than two-thirds of all military applicants are tested with the computerized version. Larson (1994

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1187
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib441
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch12lev1sec5#ch12box3
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib952

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 23/79

(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ) lists the reasons for adopting the CAT-ASVAB as follows:

Shorten overall testing time (adaptive tests require roughly one-half the items of standard tests). Increase test security by eliminating the possibility that test booklets could be stolen. Increase test precision at the upper and lower ability extremes. Provide a means for immediate feedback on test scores, since the computers used for testing can immediately score the tests and output the results. Provide a means for flexible test start times (unlike group-administered paper-and-pencil tests, for which everyone must start and stop at the same time, computer-based testing can be tailored to the examinees’ personal schedules) (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ).

Reliability and validity studies of the CAT-ASVAB provide strong support for its equivalence to the original test. In general, the computerized version of the instrument measures the same constructs as its paper- and-pencil counterpart—and does so in less time and with greater precision (Moreno & Segall, 1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1174) ). With the success of this project, the CAT-ASVAB and other tests likely will be expanded to measure new aspects of performance such as response latencies and to display unique item types such as visuospatial tests of objects in motion (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ). The CAT-ASVAB has the potential to change the future of testing.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib952
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib952
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1174
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib952

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 24/79

6.4 PREDICTING COLLEGE PERFORMANCE As most every college student knows, a major use of aptitude tests is the prediction of academic performance. In most cases, applicants to college must contend with the Scholastic Assessment Tests (SAT) or the American College Test (ACT) assessment program. Institutions may set minimum standards on the SAT or ACT tests for admission, based on the knowledge that low scores foretell college failure. In this section we will explore the technical adequacy and predictive validity of the major college aptitude tests.

The Scholastic Assessment Test (SAT) Formerly known as the Scholastic Aptitude Tests, the Scholastic Assessment Test, or SAT, is the oldest of the college admissions tests, dating back to 1926. The SAT is published by the College Board (formerly the College Entrance Examination Board), a group formed in 1899 to provide a national clearinghouse for admissions testing. As noted by historian Fuess (1950 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib549) ), the purpose of a nationally based admissions test was “to introduce law and order into an educational anarchy which towards the close of the nineteenth century had become exasperating, indeed almost intolerable, to schoolmasters.” Over the years, the test has been extensively revised, continuously updated, and repeatedly renormed. In the early 1990s, the SAT was renamed the Scholastic Assessment Test to emphasize changes in content and format. The new SAT assesses mastery of high school subject matter to a greater extent than its predecessor but continues to tap reasoning skills. The SAT represents state of the art for aptitude testing.

The new SAT, released in 2005, consists of the SAT Reasoning Test and the SAT Subject Tests. The SAT Reasoning Test is used for college admission decisions, whereas the optional SAT Subject Tests typically are needed for advanced college placement in fields such as Biology, Chemistry, History, Foreign Languages, and Mathematics. We restrict our discussion here to the SAT Reasoning Test. For ease of discussion, we refer to it simply as the “SAT.”

The SAT consists of three sections, each containing three or four subtests (Table 6.2 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06tab2) ). The Critical Reading section involves reading individual paragraphs and then answering multiple-choice questions about the passages. The questions embody three approaches:

TABLE 6.2 Sections and Subtests of the SAT Reasoning Test

Section Subtests

Critical Reading Extended Reasoning Literal Comprehension

Vocabulary in Context

Math Numbers and Operations Algebra and Functions Geometry and Measurement Data Analysis, Statistics, and Probability

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib549
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06tab2

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 25/79

Section Subtests

Writing Essay Improving Sentences Identifying Sentence Errors Improving Paragraphs

Vocabulary in Context—discerning the meaning of words from their context in the passage Literal Comprehension—understanding significant information directly available in the passage Extended Reasoning—following an argument or making inferences from the passage

Some questions in the Critical Reading section also engage a complex form of fill in the blanks. However, instead of testing for mere factual knowledge, the questions evaluate verbal comprehension. Here is a straightforward example:

Hoping to ________ the dispute, the family therapist proposed a concession that he felt would be ________ to both mother and daughter.

A. end . . . divisive B. overcome . . . unappealing C. protract . . . satisfactory D. resolve . . . acceptable E. enforce . . . useful

The correct answer is D. Of course, the SAT incorporates more difficult items of this genre.

The second part of the SAT is the Math section, consisting of three subtests. Collectively, these subtests assess basic math skills in algebra, geometry, statistics, and data analysis needed for successful navigation of college. Most of the questions are multiple-choice format, for example:

A special lottery was announced to select the student who will live in the only luxury apartment in student housing. In all, 50 juniors, 125 sophomores, and 175 freshmen applied. However, juniors were allowed to purchase 4 tickets each. What is the probability that the room will be awarded to a junior?

A. 1/5 B. 1/2 C. 2/5 D. 1/7 E. 2/7

The correct answer is C. In addition to multiple-choice questions, the Math section includes several items that require the student to generate a single correct answer and then enter it on the response sheet. For example:

What value of x satisfies both equations below?

x2 − 4 = 0 |4x + 6| = 2

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 26/79

The correct answer is −2. Strategies for finding a solution that might work with a multiple-choice question —trial and error, or process of elimination—are not likely to help with this style of question. Here the examinee must generate the correct answer by dint of careful analysis.

The Writing portion of the SAT now consists of a 25-minute Essay section and three multiple-choice subtests that evaluate the ability of the examinee to improve sentences, identify sentence errors, and improve paragraphs. In the Essay test, the examinee reads a short excerpt and then writes a short paper that takes a point of view. Here is an example of an excerpt and assignment:

A sense of happiness and fulfillment, not personal gain, is the best motivation and reward for one’s achievements. Expecting a reward of wealth or recognition for achieving a goal can lead to disappointment and frustration. If we want to be happy in what we do in life, we should not seek achievement for the sake of winning wealth and fame. The personal satisfaction of a job well done is its own reward.

Assignment: Are people motivated to achieve by personal satisfaction rather than by money or fame? Plan and write an essay in which you develop your point of view on this issue. Support your position with reasoning and examples taken from your reading, studies, experience, or observations. (College Board, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib324) )

The essay is evaluated by two trained readers on a 1 to 6 scale, resulting in a total score of 2 to 12 for the Essay test. Students also receive a separate score on a scale from 20 to 80 for the multiple-choice portion of the Writing section. Both these scores are combined for the overall section score for Writing. SAT scores for each of the three sections—Critical Reading, Math, and Writing—are now reported on the familiar 200- to 800-point scale, with an approximate mean of 500 and standard deviation of 100.

Great care is taken in the construction of new forms of the SAT because unfailing reliability and a high degree of parallelism are essential to the mission of this testing program. Historically, the internal consistency reliability of all sections is repeatedly in the range of .91 to .93; with only a few exceptions, test–retest correlations vary between .87 and .89. The standard error of measurement is 30 to 35 points.

Frey and Detterman (2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib544) ) conducted a sophisticated factor analytic study of the relationship between the SAT and g or general intelligence. Results for 917 youth who took the SAT and the ASVAB indicated a correlation of .82 between g (as extracted from ASVAB results) and SAT scores. They concluded that the SAT is an excellent measure of general cognitive ability.

The primary evidence for SAT validity is criterion-related, in this case, the ability to predict first-year college grades. Donlon (1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib431) , chap. VIII) reports a wealth of information on this point for earlier editions; we can only summarize trends here. In 685 studies, the combined SAT Verbal and Math scores correlated .42, on average, with college first-year grade point average. Interestingly, high school record (e.g., rank or grade point average) fares better than the SAT in predicting college grades (r = .48). But the combination of SAT and high school record proves even more predictive; these variables correlated .55, on average, with college first-year grade point average. Of course, these findings reflect a substantial restriction of range: low SAT-scoring high school students tend not to attend college. Donlon (1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib431) ) estimated that the real correlation without restriction of range (SAT + high school record) would be in the neighborhood of

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib324
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib544
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib431
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib431

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 27/79

.65. According to the College Board website, the combination of SAT and high school GPA continues to provide a robust correlation (r = .62) with freshman grades. Based on a sample of 151,316 students attending 110 colleges and universities across the United States, these results leave no room for doubt as to the general predictive power of SAT scores (www.collegeboard.com (http://www.collegeboard.com) ). However, the results also show that for students whose best language is not English (e.g., children of recent immigrants), the crucial reading and writing portions of the SAT underpredict freshman grades.

The American College Test (ACT) The American College Test (ACT) assessment program is a recent program of testing and reporting designed for college-bound students. In addition to traditional test scores, the ACT assessment program includes a brief 90-item interest inventory (based on Holland’s typology) and a student profile section (in which the student may list subjects studied, notable accomplishments, work experience, and community service). We will not discuss these ancillary measures here, except to note that they are useful in generating the Student Profile Report, which is sent to the examinee and the colleges listed on the registration folder.

Initiated in 1959, the ACT is based on the philosophy that direct tests of the skills needed in college courses provide the most efficient basis for predicting college performance. In terms of the number of students who take it, the ACT occupies second place behind the SAT as a college admissions test. The four ACT tests require knowledge of a subject area, but emphasize the use of that knowledge:

English (75 questions, 45 minutes). The examinee is presented with several prose passages excerpted from published writings. Certain portions of the text are underlined and numbered, and possible revisions for the underlined sections are presented; in addition, “no change” is one choice. The examinee must choose the best option. Mathematics (60 questions, 60 minutes). Here the examinee is asked to solve the kinds of mathematics problems likely to be encountered in basic college mathematics courses. The test emphasizes concepts rather than formulas and uses a multiple-choice format. Reading (40 questions, 35 minutes). This subtest is designed to assess the examinee’s level of reading comprehension; subscores are reported for social studies/sciences and arts/literature reading skills. Science Reasoning (40 questions, 35 minutes). This test assesses the ability to read and understand material in the natural sciences. The questions are drawn from data representations, research summaries, and conflicting viewpoints.

In addition to the area scores listed previously, ACT results are also reported as an overall Composite score, which is the average of the four tests. ACT scores are reported on a standard score 36-point scale. In 2012, the average ACT Composite score of high school graduates was 21.1, with a standard deviation of about 5 points.

Critics of the ACT program have pointed to the heavy emphasis on reading comprehension that saturates all four tests. The average intercor-relation of the tests is typically around .60. These data suggest that a general achievement/ability factor pervades all four tests; results for any one test should not be overinterpreted. Fortunately, college admission officers probably place the greatest emphasis on the Composite score, which is the average of the four separate tests. The ACT test appears to measure much the same thing as the SAT; the correlation between these two tests approaches .90. It is not surprising, then, that the predictive validity of the ACT Composite score rivals the SAT combined score, with correlations in the vicinity of .40 to .50 with college first-year grade point average. The predictive validity

http://www.collegeboard.com/

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 28/79

coefficients are virtually identical for advantaged and disadvantaged students, indicating that the ACT tests are not biased.

Kifer (1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib887) ) does not question the technical adequacy of the ACT and similar testing programs but does protest the enormous symbolic power these tests have accrued. The heavy emphasis on test scores for college admissions is not a technical issue, but a social, moral, and political concern:

Selective admissions means simply that an institution cannot or will not admit each person who completes an application. Choices of who will or will not be admitted should be, first of all, a matter of what the institution believes is desirable and may or may not include the use of prediction equations. It is just as defensible to select on talent broadly construed as it is to use test scores however high. There are talented students in many areas—leaders, organizers, doers, musicians, athletes, science award winners, opera buffs—who may have moderate or low ACT scores but whose presence on a campus would change it.

The reader may wish to review Topic 6B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21) , Test Bias and Other Controversies, for further discussion of this point.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib887
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 29/79

6.5 POSTGRADUATE SELECTION TESTS Graduate and professional programs also rely heavily on aptitude tests for admission decisions. Of course, many other factors are considered when selecting students for advanced training, but there is no denying the centrality of aptitude test results in the selection decision. For example, Figure 6.6 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06fig6) depicts a fairly typical quantitative weighting system used in evaluating applicants for graduate training in psychology. The reader will notice that an overall score on the Graduate Record Exam (GRE) receives the single highest weighting in the selection process. We review the GRE in the following sections, as well as admission tests used by medical schools and law schools.

FIGURE 6.6 Representative Weighting Scheme Used by Graduate Program Admission Committees in Psychology

Graduate Record Exam (GRE) The GRE is a multiple-choice and essay test widely used by graduate programs in many fields as one component in the selection of candidates for advanced training. The GRE offers subject examinations in many fields (e.g., Biology, Computer Science, History, Mathematics, Political Science, Psychology), but the heart of the test is the general test designed to measure verbal, quantitative, and analytical writing aptitudes. The verbal section (GRE-V) includes verbal items such as analogies, sentence completion, antonyms, and reading comprehension. The quantitative section (GRE-Q) consists of problems in algebra, geometry, reasoning, and the interpretation of data, graphs, and diagrams. The analytical writing section (GRE-AW) was added in October 2002 as a measure of higher-level critical thinking and analytical writing skills. It consists of two writing tasks: A 30-minute essay in which the applicant analyzes an issue, and a 30-minute essay in which the applicant analyzes an argument. Here is an example of an issue question:

As people rely more and more on technology to solve problems, the ability of humans to think for themselves will surely deteriorate.

Discuss the extent to which you agree or disagree with the statement and explain your reasoning for the position you take. In developing and supporting your position, you should consider ways in which

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06fig6

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 30/79

the statement might or might not hold true and explain how these considerations shape your position. (www.ets.org/gre (http://www.ets.org/gre) ).

The argument questions entail reading a short paragraph that invokes an argument, and writing a critique of the argument.

Beginning in 2012, the first two scores (GRE-V and GRE-Q) were reported as standard scores with a mean of about 150 and a range of 130 to 170. This new scaling metric represents a substantial change from the familiar GRE scale employed since the 1950s. Prior to 2012, the first two scores (GRE-V and GRE-Q) were reported as standard scores with a mean of about 500 and standard deviation of 100 (range of 200 to 800). Actually, the mean scores shifted from year to year because all test results were anchored to a standard reference group of 2,095 college seniors tested in 1952 on the verbal and quantitative portions of the test. Historically, graduate programs have paid more attention to the first two parts of the test (GRE- V and GRE-Q). Recently, programs have acknowledged the importance of writing skills in their applications, which explains the addition of the analytical writing section (GRE-AW).

Scoring of the analytical writing section is based on 6-point holistic ratings provided independently by two trained raters. If the two scores differ by more than one point on the scale, the discrepancy is adjudicated by a third GRE-AW reader. According to the GRE Board (www.gre.org (http://www.gre.org) ), the GRE-AW test reveals smaller ethnic group differences than found in the multiple-choice sections. For example, the differences between African American and Caucasian examinees and between Hispanic and Caucasian examinees are smaller on the GRE-AW than on the GRE-V or GRE-Q. This suggests that the new test does not unduly penalize ethnic groups traditionally underrepresented in graduate programs.

The reliability of the GRE is strong, with internal consistency reliability coefficients typically around .90 for the three components. The validity of the GRE commonly has been examined in relation to the ability of the test to predict performance in graduate school. Performance has been operationalized mainly as grade point average, although faculty ratings of student aptitude also have been used. For example, based on a meta-analytic review of 22 studies with a total of 5,186 students, Morrison and Morrison (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1179) ) concluded that GRE-V correlated .28 and GRE-Q correlated .22 with graduate grade point average. Thus, on average, GRE scores accounted for only 6.3 percent of the variance in graduate-level academic performance. In a recent study of 170 graduate students in psychology at Yale University, Sternberg and Williams (1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1570) ) also found minimal correlations between GRE scores and graduate grades. When GRE scores were correlated with faculty ratings on five variables (analytical, creative, practical, research, and teaching abilities), the correlations were even lower, for the most part hovering right around zero. The single exception was the GRE analytical thinking score, which correlated modestly with almost all of the faculty ratings. However, this correlation was observed only for men (on the order of r = .3), whereas for women it was almost exactly zero in every case! Based on these and similar studies, the consensus would appear to be that excessive reliance on the GRE for graduate school selection may overlook a talented pool of promising graduate students.

However, other researchers are more supportive in their evaluation of the GRE, noting that the correlation of GRE scores and graduate grades is not a good index of validity because of the restriction of range problem (Kuncel, Campbell, & Ones, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib930) ). Specifically, applicants with low GRE scores are unlikely to be accepted for graduate training in the first place and, thus, relatively little information is available with respect to whether low scores predict poor academic performance. Put simply, the correlation of GRE scores with graduate academic performance is based

http://www.ets.org/gre
http://www.gre.org/
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1179
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1570
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib930

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 31/79

mainly on persons with middle to high levels of GRE scores, that is, GRE-V + GRE-Q totals of 1,000 and up. As such, the correlation will be attenuated precisely because those with low GREs are not included in the sample. Another problem with validating the GRE against grades in graduate school is the unreliability of the criterion (grades). Based on the expectation that graduate students will perform at high levels, some professors may give blanket A’s such that grades do not reflect real differences in student aptitudes. This would lower the correlation between the predictor (GRE scores) and the criterion (graduate grades). When these factors are accounted for, many researchers find reason to believe the GRE is still a valid tool for graduate school selection (Powers, 2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1319) ).

In a comprehensive meta-analysis of 1,753 independent groups of students, Kuncel, Hezlett, and Ones (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib931) ) confirmed the validity of the GRE tests (Verbal, Quantitative, and Analytical) for the prediction of graduate student performance. The total sample size for their analysis was huge, including 82,659 students. The breadth of their investigation allowed them to code studies for several different forms of student accomplishment. GRE general test scores were significantly associated with the following student outcomes: first-year GPA, overall GPA, comprehensive exam scores, faculty ratings, and publication citation counts. The researchers also discovered that the GRE Psychology subject test outperformed the general test as a predictive measure of student success.

Medical College Admission Test (MCAT) The MCAT is required of applicants to almost all medical schools in the United States. The test is designed to assess achievement of the basic skills and concepts that are prerequisites for successful completion of medical school. There are three multiple-choice sections (Verbal Reasoning, Physical Sciences, Biological Sciences) (40 questions). The Verbal Reasoning section is designed to evaluate the ability to understand and apply information and arguments presented in written form. Specifically, the test consists of several passages of about 500 to 600 words each, taken from humanities, social sciences, and natural sciences. Each passage is followed by several questions based on information included in the passage. The Physical Sciences section (52 questions) is designed to evaluate reasoning in general chemistry and physics. The Biological Sciences section (52 questions) is designed to evaluate reasoning in biology and organic chemistry. These physical and biological science sections contain 10 to 11 problem sets described in about 250 words each, with several questions following.

Following the three required parts of the MCAT, an optional trial section of 32 questions is administered. This portion is not scored. The purpose of the trial section is to pretest questions for future exams. Some trial questions are designed for a new section of the MCAT, Psychological, Social, and Biological Foundations of Behavior, scheduled to commence in 2015. This new section will test knowledge of important concepts in introductory psychology, sociology, and biology, related to mental processes and behavior. The addition of this section acknowledges that effective doctors need to understand the whole person, including social and cultural determinants of health and health-related behaviors.

Each of the MCAT scores is reported on a scale from 1 to 15 (means of about 8.0 and standard deviations of about 2.5). The reliability of the test is lower than that of other aptitude tests used for selection, with internal consistency and split-half coefficients mainly in the low .80s (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). MCAT scores are mildly predictive of success in medical school, but once again the restriction of range conundrum (previously discussed in relation to the GRE) is at play. In particular, examinees with low MCAT scores who would presumably confirm the validity of the test by performing poorly in medical school are rarely admitted, which reduces the apparent validity of the test.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1319
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib931
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib646

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 32/79

Julian (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib849) ) confirmed the validity of the MCAT for predicting medical school performance by following 4,076 students who entered 14 medical schools in 1992 and 1993. Outcome variables included GPA and national medical licensing exam scores. When corrected for restriction of range, the predictive validity coefficients for MCAT scores were impressive, on the order of .6 for medical school grades, and as high as .7 for licensing exam scores. In fact, the MCAT scores were so strongly predictive of licensing exam scores that adding undergraduate GPAs into the equation did not appreciably boost the correlation. Julian (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib849) ) concludes that MCAT scores essentially replace the need for undergraduate GPAs in medical school student selection because of their remarkable capacity to predict medical licensing exam scores.

Law School Admission Test (LSAT) The LSAT is more than 60 years old. The test arose in the 1940s as a group effort from deans of leading law schools, who used first year grades in the early validation of the instrument (LaPiana, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib950) ). Practicality was a major impetus for test development, as law schools were flooded with worthy applicants. Also, there was an idealistic desire to ensure that admission to law school was based on aptitude and potential, not on privilege or connection. A leading figure in LSAT development has noted:

What makes us Americans is our adherence to the system that governs our nation. If that’s true, then being a lawyer is one of the most important jobs in American society because it is the lawyer’s job to make sure the law works and serves people. And if that is true, than the American legal profession is much too important to be left in the hands of a self-perpetuating elite. It has to be open to all Americans with the talent and ability to do legal work, no matter how their last names are spelled or where they or their ancestors were born or the color of their skin (LaPiana, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib950) , p. 12).

About 150,000 individuals take the LSAT each year. Of course, many other variables come into play in law school admissions, but test results probably are the single most important factor.

The LSAT is a half-day standardized test required of applicants to virtually every law school in the United States. The test is designed to measure skills considered essential for success in law school, including the reading and understanding of complex material, the organization and management of information, and the ability to reason critically and draw correct inferences. The LSAT consists of multiple-choice questions in four areas: reading comprehension, analytical reasoning, and two logical reasoning sections. An additional section is used to pretest new test items and to preequate new test forms, but this section does not contribute to the LSAT score. The score scale for the LSAT extends from a low of 120 to a high of 180. In addition to the objective portions, a 35-minute writing sample is administered at the end of the test. The section is not scored, but copies of the writing sample are sent to all law schools to which the examinee applies.

The LSAT has acceptable reliability (internal consistency coefficients in the .90s) and is regarded as a moderately valid predictor of law school grades. Yet, in one fascinating study, LSAT scores correlated more strongly with state bar test results than with law school grades (Melton, 1985). This speaks well for the validity of the test, insofar as it links LSAT scores with an important, real-world criterion.

In recent years, those responsible for law school admissions have shown interest in selection methods that go beyond the LSAT. One example is a promising project from the University of California, Berkeley, which ambitiously seeks to assess 26 traits identified as crucial to effective performance of lawyers

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib849
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib849
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib950
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib950

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 33/79

(Chamberlin, 2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib293) ). Using focus groups and individual interviews, psychologist Sheldon Zedeck and lawyer Marjorie Shultz distilled these 26 traits, which include varied capacities like practical judgment, researching the law, writing, integrity/honesty, negotiation skills, developing relationships, stress management, fact finding, diligence, listening, and community involvement/service. Next they developed realistic scenarios designed to evaluate one or more of these qualities. A sample question might ask the applicant to take the role of a team leader in a law firm. A verbal fight breaks out between two of the team members over the best way to proceed with the project. What should the team leader do? A number of options are listed, and the applicant is asked to rank them from best to worst. The format of the questions is varied. For other questions, the applicant might be asked to provide a short written response. Initial research with this yet- unnamed instrument indicates that it predicts success in the practice of law substantially better than the LSAT.

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib293

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 34/79

6.6 EDUCATIONAL ACHIEVEMENT TESTS Achievement tests permit a wide range of potential uses. Practical applications of group achievement tests include the following:

To identify children and adults with specific achievement deficits who might need more detailed assessment for learning disabilities To help parents recognize the academic strengths and weaknesses of their children and thereby foster individual remedial efforts at home To identify classwide or schoolwide achievement deficiencies as a basis for redirection of instructional efforts To appraise the success of educational programs by measuring the subsequent skill attainment of students To group students according to similar skill level in specific academic domains To identify the level of instruction that is appropriate for individual students

Thus, achievement tests serve institutional goals such as monitoring schoolwide achievement levels, but also play an important role in the assessment of individual learning difficulties. As previously noted, different kinds of achievement tests are used to pursue these two fundamental applications (institutional and individual). Institutional goals are best served by group achievement test batteries, whereas individual assessment is commonly pursued with individual achievement tests (even though group tests may play a role here, too). Here we focus on group educational achievement tests.

Virtually every school system in the nation uses at least one educational achievement test, so it is not surprising that test publishers have responded to the widespread need by developing a panoply of excellent instruments.

In the following section, we describe several of the most widely used group standardized achievement tests. We limit our coverage here to three educational achievement tests, each distinctive in its own way. The Iowa Tests of Basic Skills (ITBS) is representative of the huge industry of standardized achievement testing used in virtually all school systems nationwide. The Metropolitan Achievement Test is of the same genre as the ITBS but embodies a new and powerful technique of reading assessment known as the Lexile approach and, thus, merits special attention. Finally, almost everyone has heard of the Tests of General Educational Development, known familiarly as the “GED.” We would be remiss not to discuss this testing program.

Iowa Tests of Basic Skills (ITBS) First published in 1935, the Iowa Tests of Basic Skills (ITBS) were most recently revised and restandardized in 2001. The ITBS is a multilevel battery of achievement tests that covers grades K through 8. A companion test, the Tests of Achievement and Proficiency (TAP), covers grades 9 through 12. In order to expedite direct and accurate comparisons of achievement and ability, the ITBS and the TAP were both concurrently normed with the Cognitive Abilities Test (CogAT), a respected group test of general intellectual ability.

The ITBS is available in several levels that correspond roughly with the ages of the potential examinees: levels 5–6 (grades K–1), levels 7–8 (grades 2–3), and levels 9–14 (grades 3–8). The basic subtests for the older levels measure vocabulary, reading, language, mathematics, social studies, science, and sources of information (e.g., uses of maps and diagrams). A brief description of the subtests for grades 3–8 is

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 35/79

provided in Table 6.3 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab3) .

From the first edition onward, the ITBS has been guided by a pragmatic philosophy of educational measurement. The manual states the purpose of testing as follows:

The purpose of measurement is to provide information which can be used in improving instruction. Measurement has value to the extent that it results in better decisions which directly affect pupils.

To this end, the ITBS incorporates a criterion-referenced skills analysis to supplement the usual array of norm-referenced scores. For example, one feature available from the publisher’s scoring service is item- level information. This information indicates topic areas, items sampling the topic, and correct or wrong response for each item. Teachers, therefore, have access to a wealth of diagnostic-instructional information for each student. Whether this information translates to better instruction—as the test authors desire—is very difficult to quantify. As Linn (1989 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib996) ) notes, “We must rely mostly on logic, anecdotes, and opinions when it comes to answering such questions.”

The technical properties of the ITBS are beyond reproach. Historically, internal consistency and equivalent-form reliability coefficients are mostly in the mid-.80s to low .90s. Stability coefficients for a one-year interval are almost all in the .70 to .90 range. The test is free from overt racial and gender bias, as determined by content evaluation and item bias studies. The year 2000 norms for the test were empirically developed from large, representative national probability samples.

TABLE 6.3 Brief Description of ITBS Subtests for Grades 3–8

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab3
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib996

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 36/79

Vocabulary: A word is presented in the context of a short phrase or sentence, and students select the correct meaning from multiple-choice alternatives.

Reading Comprehension: Students read a brief passage and answer multiple-choice questions that require inference or generalization.

Spelling: Each multiple-choice item presents four words, one of which may be misspelled, and fifth option, no mistakes.

Capitalization: Test items require students to identify errors of under- or overcapitalization present in brief written passages.

Punctuation: Multiple-choice items require students to identify errors of punctuation involving commas, apostrophes, quotation marks, colons, and so on, or choose no mistakes.

Usage and Expression: In the first part, students identify errors in usage or expression; in the second part, students choose the best way to express an idea.

Math Concepts and Estimation: Questions deal with computation, algebra, geometry, measurement, and probability and statistics.

Math Problem Solving and Data Interpretation: Questions may involve multistep word problems or interpretation of tables and graphs.

Math Computation: These test items require the use of one arithmetic operation (addition, subtraction, multiplication, or division) with whole numbers, fractions, and decimals.

Social Studies: These questions involve aspects of history, geography, economics, and so on that are ordinarily covered in most school systems.

Science: These test items involve aspects of biology, ecology, space science, and physical sciences ordinarily covered in most school systems.

Maps and Diagrams: These questions evaluate the ability to use maps for a variety of purposes such as determining locations, directions, and distances.

Reference Materials: These questions measure the ability to use reference materials and library resources.

Item content of the ITBS is judged relevant by curriculum experts and reviewers, which speaks to the content validity of the test (Lane, 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib949) ; Linn, 1989 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib996) ). Although the predictive validity of the latest ITBS has not been studied extensively, evidence from prior editions is very encouraging. For example, ITBS scores correlate moderately with high school grades (r’s around .60). The ITBS is not a perfect instrument, but it represents the best that modern test development methods can produce.

Metropolitan Achievement Test (MAT) The Metropolitan Achievement Test dates back to 1930 when the test was designed to meet the curriculum assessment needs of New York City. The stated purpose of the MAT is “to measure the achievement of students in the major skill and content areas of the school curriculum.” The MAT is concurrently normed with the Otis-Lennon School Ability Test (OLSAT).

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib949
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib996

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 37/79

Now in its eighth edition, the MAT is a multilevel battery designed for grades K through 12 and was most recently normed in 2000. The areas tested by the MAT include the traditional school-related skills:

Reading Mathematics Language Writing Science Social Studies

An attractive feature of the MAT is that student reading scores are reported as Lexile measures, a new and practical indicator of reading level. Lexile measures are likely to become a standard feature in most group achievement tests in the years ahead, so it is worth a brief detour to explain their nature and significance.

Lexile Measures The Lexile approach is a major new improvement in the assessment of reading skill. It was developed over a span of more than 12 years using millions of dollars in grant funds from the National Institute of Child Health and Human Development (NICHD) (www.lexile.com (http://www.lexile.com) ). The Lexile approach is based on two simple, commonsense assumptions, namely (1) reading materials can be placed on a continuum as to difficulty level (comprehensibility) and (2) readers can be ordered on a continuum as to reading ability. The Lexile framework provides a common metric for matching readers and text, which, in turn, permits parents and educators to choose appropriate reading materials for children.

The Lexile scale (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss185) is a true interval scale. The Lexile measure for a reading selection is a specific number indicating the reading demand of the text based on the semantic difficulty (vocabulary) and syntactic complexity (sentence length). Lexile measures for reading selections typically range from 200L to 1,700L (Lexiles). The Lexile score for a student, obtained from the Reading Comprehension test of the MAT or other achievement tests, is a precise index of the student’s reading ability, calibrated on the same scale as the Lexile measure for text. The value of the Lexile approach is that student comprehension can be predicted as a function of the disparity between the demands of the text and the student’s ability. For example, when readers are well targeted (the difference between text and reader is close to 0 Lexiles), research indicates that reader comprehension will be about 75 percent. When the text difficulty exceeds the reader’s ability by 250L, comprehension drops to about 50 percent. When the skill of the reader exceeds the demands of the text by 250L, comprehension is about 90 percent (www.lexile.com (http://www.lexile.com) ).

The Lexile approach has a number of potential benefits and applications for teachers and parents. Teachers can look up Lexile measures for specific books (the Lexile corporation has evaluated over 30,000 titles to date) as a way of building a library of titles at varying levels. Also, they can produce individualized reading lists suitable for each student. Likewise, parents can select well-matched books to read to their children. Stenner (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1553) ) captures the allure of the Lexile approach as follows:

One of the great strengths of the Lexile Framework is the way it encourages thought about what forecasted comprehension rate would be optimal for different instructional contexts. Harry Potter and the Goblet of Fire is a 910L text. Readers at 400L to 500L can nonetheless enjoy listening to this story read aloud. A 700L reader could read the text in a one-on-one tutoring context. A 900L reader

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm01#bm01gloss185

https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1553

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 38/79

will disappear for an hour or two, fully capable of self-engaging with the text, and a 1600L adult reader can become so engrossed that a two-hour plane ride flies by.

The Lexile approach is not a panacea, but it is a major improvement in the assessment of reading skill.

Tests of General Educational Development (GED) Another widely used achievement test battery is the Tests of General Educational Development (GED), developed by the American Council on Education and administered nationwide for high school equivalency certification (www.acenet.edu (http://www.acenet.edu) ). The GED consists of multiple-choice examinations in five educational areas:

Language Arts—Writing Language Arts—Reading Mathematics Science Social Studies

The Language Arts—Writing section also contains an essay question that examinees must answer in writing. The essay question is scored independently by two trained readers according to a 6-point holistic scoring method. The readers make a judgment about the essay based on its overall effectiveness in comparison to the effectiveness of other essays.

The GED comes in numerous alternate forms. Typically, internal consistency reliabilities for the subscales are above .90. However, the interrater reliability of scoring on the writing samples is more modest, typically between .6 and .7. These findings indicate that a liberal criterion for passing this subtest is appropriate so as to reduce decision errors. Regarding validity, the GED correlates very strongly (r = .77) with the graduation reading test used in New York (Whitney, Malizio, & Patience, 1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1753) ). Furthermore, the standards for passing the GED are more stringent than those employed by most high schools: Currently, individuals who receive a passing score for a GED credential outperform at least 40 percent of graduating high school seniors (www.acenet.edu (http://www.acenet.edu) ).

The GED emphasizes broad concepts rather than specific facts and details. In general, the purpose of the GED is to allow adults who did not graduate from high school to prove that they have obtained an equivalent level of knowledge from life experiences or independent study. Employers regard the GED as equivalent (if not superior) to earning a high school diploma. Successful performance on the GED enables individuals to apply to colleges, seek jobs, and request promotions that require a high school diploma as a prerequisite. Rogers (1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1382) ) provides an unusually thorough review of the GED.

Additional Group Standardized Achievement Tests In addition to the previously described batteries, a few other widely used group standardized achievement tests deserve brief listing. These instruments are depicted in Table 6.4 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab4) .

TABLE 6.4 Selected Group Achievement Tests for Elementary and Secondary School Assessment

http://www.acenet.edu/
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1753
http://www.acenet.edu/
https://content.ashford.edu/books/Gregory.8055.17.1/sections/bm02#bm02bib1382
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab4

8/4/2019 Print

https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 39/79

Iowa Tests of Educational Development (ITED) Designed for grades 9 through 12, the objective of this test battery is to measure the fundamental goals or generalized skills of education that are independent of the curriculum. Most of the test items require the synthesis of knowledge or a multiple-step solution.

Tests of Achievement and Proficiency (TAP) This instrument is designed to provide a comprehensive appraisal of student progress toward traditional academic goals in grades 9 through 12. This test is co-normed with the ITED and the CogAT.

Stanford Achievement Test (SAchT)

Along with the ITBS, the SAchT is one of the leading contemporary achievement tests. Dating back more than 80 years and now in its tenth edition, it is administered to more than 15 million students every year.

TerraNova CTBS For grades 1 through 12, this multi-level test combines multiple-choice questions with constructed response items that require students to produce correct answers, not just select them from alternatives.

TOPIC 6B Test Bias and Other Controversies

6.7 The Question of Test Bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06lev1sec7)

Case Exhibit 6.1 The Impact of Culture on Testing Bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06exh1)

6.8 Social Values and Test Fairness (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec8#ch06lev1sec8)

6.9 Genetic and Environmental Determinants of Intelligence (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec9#ch06lev1sec9)

6.10 Origins and Trends in Racial IQ Differences (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec10#ch06lev1sec10)

6.11 Age Changes in Intelligence (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec11#ch06lev1sec11)

6.12 Generational Changes in IQ Scores (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec12#ch06lev1sec12)

An intelligence test is a neutral, inconsequential tool until someone assigns significance to the results derived from it. Once meaning is attached to a person’s test score, that individual will experience many repercussions, ranging from superficial to life-changing. These repercussions will be fair or prejudiced, helpful or harmful, appropriate or misguided—depending on the meaning attached to the test score.

Unfortunately, the tendency to imbue intelligence test scores with inaccurate and unwarranted connotations is rampant. Laypersons and students of psychology commonly stray into one thicket of

https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06lev1sec7
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06exh1
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec8#ch06lev1sec8
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec9#ch06lev1sec9
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec10#ch06lev1sec10
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec11#ch06lev1sec11
https://content.ashford.edu/books/Gregory.8055.17.1/sections/ch06lev1sec12#ch06lev1sec12

 
Do you need a similar assignment done for you from scratch? We have qualified writers to help you. We assure you an A+ quality paper that is free from plagiarism. Order now for an Amazing Discount!
Use Discount Code "Newclient" for a 15% Discount!

NB: We do not resell papers. Upon ordering, we do an original paper exclusively for you.