Stability of Rasch Latent
Trait between Paper-and-Pencil and Computer-Based Modes of Test Administration
for Gender and Ethnicity Subgroups
Do-Hong Kim
dkim15@uncc.edu
University of North Carolina at Charlotte
Huynh Huynh
hhuynh@gwm.sc.edu
This study illustrated Rasch-based procedures for
investigating the stability of Rasch latent trait
across paper- and computer-administration modes by using a 9th grade statewide
English test. The Rasch dichotomous model was used to
examine the item-level mode effects for the total group and four subgroups of
gender and ethnicity. The procedures used were the robust Z statistic and the
fit indices programmed in the WINSTEPS software. In the application, the robust
Z statistic indicated that item parameters (difficulty) estimates were quite
stable across modes for the total group. When analyzing the data by subgroups,
no conclusive evidence was found to suggest that the mode effects were related
to gender or ethnicity. Taken together, this study suggests that the Rasch latent trait remains stable across modes of
administration, and the computerized testing appeared to have no noticeable
effect on item-level performance of students.
Objectives
The current study examined the stability of the Rasch
latent trait between the paper-and-pencil testing (PPT) and computer-based
testing (CBT) modes of the statewide end-of-course (EOC) grade 9 English test.
The study focused on the total student group as well as subgroups aggregated by
gender and ethnicity. With PPT and CBT data collected from the same test form,
it was possible to compare student performance at the item level. Aside from a
study on differences between PPT and CBT, this paper also contributes to the
methodology that can be used for other comparability studies of the same nature
that rely on the Rasch model.
Perspective
In recent years, a computer-based mode of administration has become more
prevalent in statewide testing programs, partially due to the No Child Left
Behind Act (NCLB)’s accountability and reporting requirements. Under increasing
pressure to reduce turnaround time for testing results, CBT has become a
popular choice for statewide assessment programs. For the convenience of
students and depending on the availability of computer infrastructure at their
schools, several states are now offering both paper and computer modes of
testing. In this dual environment, questions are often raised about
comparability of CBT and PPT scores. Recently, a number of comparability
studies were conducted for statewide computerized assessments (e.g.,
Fitzpatrick & Triscari, 2005; Nichols &
Kirkpatrick, 2005; Poggio, Glasnapp,
Yang, & Poggio, 2005). Many of these studies
examined comparability at the total test level. To further insure test score
equivalence in meaning, it is also necessary to examine item-level mode effects
because some items may be more susceptible to administration modes, even though
there is no or only a small mode effect at the total score level (Pommerich, 2004). Item-level mode effects can be identified
by examining stability of item parameter estimates across administration modes.
Item-fit statistics can also be used for evaluating the performance of items. Misfitting items may indicate measurement disturbances such
as guessing, item bias, content/person interactions, and curriculum/content
interactions (Smith, 1991). If item characteristics appear to be unstable
across modes, statistical adjustment or separate equating studies may be
needed. Otherwise, student scores obtained from the two administration modes
are not comparable. Another critical issue that needs to be addressed with CBT
is the comparability of administration modes for different subgroups of
students. Many of the early research on mode effects for subgroups involved
young adult and adult populations in licensure, certification, admissions, and
Methods
Stability of Rasch latent trait was examined by
comparing Rasch item parameter estimates (b) and item
fit statistics between PPT and CBT. Item calibration was carried out for the Rasch dichotomous model (Rasch,
1960; Wright & Stone, 1979), using the WINSTEPS software (Linacre, 2006).
For each administration mode, a separate calibration was performed for the
total group (i.e., all students under study) and for the four subgroups of female,
male, African American, and White/Non-Hispanic students. After calibrations,
comparisons between PPT and CBT were made for each of the aforementioned five
groups separately. For example, item b estimates and item fit statistics of
female students taking PPT were compared to those of female students taking
CBT. The robust Z statistic was used to identify items that show substantial
discrepancy between the PPT and CBT modes.
Data source
This study was based on an operational statewide testing of a state in the
Southeast. Examinees were administered the grade 9 English test (an
end-of-course, EOC exam) via either PPT or CBT modes during 2006. Data were
also available for all students for the grade 8 English language arts (ELA)
test that had been administered in 2005. Students were allowed to take the
English EOC test in either the PPT or the CBT modes. Thus the CBT student
groups did not constitute a random sample of students from the state. To
address nonrandomness inherent in the data, a
quasi-experimental approach was taken to assure some degree of equivalence
between the PPT and CBT student groups. In order to accomplish this, a
stratified random sample of examinees taking the PPT was selected to match the
characteristics of examinees taking the same test on CBT using both the school
and student-level variables. School-level matching variables included school
average scale scores on the 2005 grade 8 ELA test, size of schools, and
percentage of White/Non-Hispanic students. Student-level matching variables
included 2005 grade 8 ELA test score, gender, and ethnicity.
Results and conclusions
Overall, both modes of test administration resulted in similar patterns across
the total group and subgroups. Item-level analyses showed that item parameter
estimates and item fit statistics were strongly correlated between PPT and CBT.
Item parameter (difficulty; b) estimates were quite stable across modes, and
only a few items were identified as having significant mode differences in b
for the total group and subgroups. Interestingly, most items showing
significant differences across modes favored examinees taking the test on
computer. Across all groups under study, only two items consistently appeared
in favor of examinees taking the test on computer. These two items were
attached to the same reading passage.
Results of fit statistic differences varied by groups and types of fit
statistics. The overall differences were small with no apparent patterns in the
results. Many of the items displaying unstable item parameter estimates or item
fit statistics across modes corresponded to the Reading Comprehension and
Analysis of Text content domains. When analyzing the data by subgroups, the
mode effects on item b estimates were more evident, although the overall
effects were marginal. The number of items with significant b differences was
higher in the gender and ethnicity subgroups than in the total group, with the
largest number of unstable items identified for male students. The findings
suggest that it is important to examine subgroups because the focus on the
performance of the student body as a whole may mask mode effects for specific
groups of examinees. The effects may be small, but it can still provide more
comprehensive information regarding mode effects.
Overall, there were no consistent, systematic differences in the results of
item fit statistics across the total group and subgroups. Thus, there was no
evidence suggesting that the mode effects were related to gender or ethnicity.
In all, the findings of this study suggest that the Rasch
latent trait remains stable across modes of administration for the statewide
grade 9 English test, and the computerized testing appeared to have no adverse
impact on item-level performance of students as a whole as well as performance
of subgroups of students. Finally this study also illustrates Rasch-based procedures that can be used to assess
differences in important characteristics of test items (such as item difficulty
and item fit) when test administration changes from one mode (such as PPT) to
another one (such as CBT).
 
References
Fitzpatrick, S., & Triscari, R. (2005, April).
Comparability studies of the Virginia computer-delivered tests. Paper presented
at the annual meeting of the American Educational Research Association,
Montreal, Canada.
Linacre, J. M. (2006). WINSTEPS version 3.63 [Computer Software]. Chicago, IL:
Author.
Nichols, P., & Kirkpatrick, R. (2005, April). Comparability of the
computer-administered tests with existing paper-and-pencil tests in reading and
mathematics tests. Paper presented at the Annual Meeting of the American
Educational Research Association, Montreal, Canada.
Poggio, J., Glasnapp, D.
R., Yang, X., and Poggio, A. J. (2005). A comparative
evaluation of score results from computerized and paper and pencil mathematics
testing in a large scale state assessment program. Journal of Technology,
Learning, and Assessment, 3(6). Retrieved July 9, 2005, from http://www.jtla.org.
Pommerich, M. (2004). Developing computerized
versions of paper-and-pencil tests: Mode effects for passage-based tests.
Journal of Technology, Learning, and Assessment, 2(6). Retrieved July 9, 2005,
from http://www.jtla.org.
Rasch, G. (1960). Probabilistic models for some
intelligence and attainment tests. Copenhagen: Danish Institute for Educational
Research.
Smith, R. M. (1991). The distributional properties of Rasch
item fit statistic. Educational and Psychological Measurement, 51, 541-565.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA
Press.