Stability of Rasch Latent Trait between Paper-and-Pencil and Computer-Based Modes of Test Administration for Gender and Ethnicity Subgroups

Do-Hong Kim
dkim15@uncc.edu
University of North Carolina at Charlotte
Huynh Huynh
hhuynh@gwm.sc.edu

This study illustrated Rasch-based procedures for investigating the stability of Rasch latent trait across paper- and computer-administration modes by using a 9th grade statewide English test. The Rasch dichotomous model was used to examine the item-level mode effects for the total group and four subgroups of gender and ethnicity. The procedures used were the robust Z statistic and the fit indices programmed in the WINSTEPS software. In the application, the robust Z statistic indicated that item parameters (difficulty) estimates were quite stable across modes for the total group. When analyzing the data by subgroups, no conclusive evidence was found to suggest that the mode effects were related to gender or ethnicity. Taken together, this study suggests that the Rasch latent trait remains stable across modes of administration, and the computerized testing appeared to have no noticeable effect on item-level performance of students.  

Objectives

The current study examined the stability of the Rasch latent trait between the paper-and-pencil testing (PPT) and computer-based testing (CBT) modes of the statewide end-of-course (EOC) grade 9 English test. The study focused on the total student group as well as subgroups aggregated by gender and ethnicity. With PPT and CBT data collected from the same test form, it was possible to compare student performance at the item level. Aside from a study on differences between PPT and CBT, this paper also contributes to the methodology that can be used for other comparability studies of the same nature that rely on the Rasch model.

Perspective

In recent years, a computer-based mode of administration has become more prevalent in statewide testing programs, partially due to the No Child Left Behind Act (NCLB)’s accountability and reporting requirements. Under increasing pressure to reduce turnaround time for testing results, CBT has become a popular choice for statewide assessment programs. For the convenience of students and depending on the availability of computer infrastructure at their schools, several states are now offering both paper and computer modes of testing. In this dual environment, questions are often raised about comparability of CBT and PPT scores. Recently, a number of comparability studies were conducted for statewide computerized assessments (e.g., Fitzpatrick & Triscari, 2005; Nichols & Kirkpatrick, 2005; Poggio, Glasnapp, Yang, & Poggio, 2005). Many of these studies examined comparability at the total test level. To further insure test score equivalence in meaning, it is also necessary to examine item-level mode effects because some items may be more susceptible to administration modes, even though there is no or only a small mode effect at the total score level (Pommerich, 2004). Item-level mode effects can be identified by examining stability of item parameter estimates across administration modes. Item-fit statistics can also be used for evaluating the performance of items. Misfitting items may indicate measurement disturbances such as guessing, item bias, content/person interactions, and curriculum/content interactions (Smith, 1991). If item characteristics appear to be unstable across modes, statistical adjustment or separate equating studies may be needed. Otherwise, student scores obtained from the two administration modes are not comparable. Another critical issue that needs to be addressed with CBT is the comparability of administration modes for different subgroups of students. Many of the early research on mode effects for subgroups involved young adult and adult populations in licensure, certification, admissions, and

Methods

Stability of Rasch latent trait was examined by comparing Rasch item parameter estimates (b) and item fit statistics between PPT and CBT. Item calibration was carried out for the Rasch dichotomous model (Rasch, 1960; Wright & Stone, 1979), using the WINSTEPS software (Linacre, 2006). For each administration mode, a separate calibration was performed for the total group (i.e., all students under study) and for the four subgroups of female, male, African American, and White/Non-Hispanic students. After calibrations, comparisons between PPT and CBT were made for each of the aforementioned five groups separately. For example, item b estimates and item fit statistics of female students taking PPT were compared to those of female students taking CBT. The robust Z statistic was used to identify items that show substantial discrepancy between the PPT and CBT modes.

Data source

This study was based on an operational statewide testing of a state in the Southeast. Examinees were administered the grade 9 English test (an end-of-course, EOC exam) via either PPT or CBT modes during 2006. Data were also available for all students for the grade 8 English language arts (ELA) test that had been administered in 2005. Students were allowed to take the English EOC test in either the PPT or the CBT modes. Thus the CBT student groups did not constitute a random sample of students from the state. To address nonrandomness inherent in the data, a quasi-experimental approach was taken to assure some degree of equivalence between the PPT and CBT student groups. In order to accomplish this, a stratified random sample of examinees taking the PPT was selected to match the characteristics of examinees taking the same test on CBT using both the school and student-level variables. School-level matching variables included school average scale scores on the 2005 grade 8 ELA test, size of schools, and percentage of White/Non-Hispanic students. Student-level matching variables included 2005 grade 8 ELA test score, gender, and ethnicity.

Results and conclusions

Overall, both modes of test administration resulted in similar patterns across the total group and subgroups. Item-level analyses showed that item parameter estimates and item fit statistics were strongly correlated between PPT and CBT. Item parameter (difficulty; b) estimates were quite stable across modes, and only a few items were identified as having significant mode differences in b for the total group and subgroups. Interestingly, most items showing significant differences across modes favored examinees taking the test on computer. Across all groups under study, only two items consistently appeared in favor of examinees taking the test on computer. These two items were attached to the same reading passage.

Results of fit statistic differences varied by groups and types of fit statistics. The overall differences were small with no apparent patterns in the results. Many of the items displaying unstable item parameter estimates or item fit statistics across modes corresponded to the Reading Comprehension and Analysis of Text content domains. When analyzing the data by subgroups, the mode effects on item b estimates were more evident, although the overall effects were marginal. The number of items with significant b differences was higher in the gender and ethnicity subgroups than in the total group, with the largest number of unstable items identified for male students. The findings suggest that it is important to examine subgroups because the focus on the performance of the student body as a whole may mask mode effects for specific groups of examinees. The effects may be small, but it can still provide more comprehensive information regarding mode effects.

Overall, there were no consistent, systematic differences in the results of item fit statistics across the total group and subgroups. Thus, there was no evidence suggesting that the mode effects were related to gender or ethnicity. In all, the findings of this study suggest that the Rasch latent trait remains stable across modes of administration for the statewide grade 9 English test, and the computerized testing appeared to have no adverse impact on item-level performance of students as a whole as well as performance of subgroups of students. Finally this study also illustrates Rasch-based procedures that can be used to assess differences in important characteristics of test items (such as item difficulty and item fit) when test administration changes from one mode (such as PPT) to another one (such as CBT).
 
References

Fitzpatrick, S., & Triscari, R. (2005, April). Comparability studies of the Virginia computer-delivered tests. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Linacre, J. M. (2006). WINSTEPS version 3.63 [Computer Software]. Chicago, IL: Author.

Nichols, P., & Kirkpatrick, R. (2005, April). Comparability of the computer-administered tests with existing paper-and-pencil tests in reading and mathematics tests. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada.

Poggio, J., Glasnapp, D. R., Yang, X., and Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper and pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment, 3(6). Retrieved July 9, 2005, from http://www.jtla.org.

Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects for passage-based tests. Journal of Technology, Learning, and Assessment, 2(6). Retrieved July 9, 2005, from http://www.jtla.org.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

Smith, R. M. (1991). The distributional properties of Rasch item fit statistic. Educational and Psychological Measurement, 51, 541-565.

Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.