A Developmental Framework for the Measurement of Writing Ability
Hal Burdick hburdick@lexile.com MetaMetrics Inc.
jack stenner, MetaMetrics, jstenner@lexile.com carl
swartz, MetaMetrics, cswartz@lexile.com
This study provides the empirical foundation for a developmental scale of writer ability that generalizes across rater, prompt, rubric, and occasion. A developmental scale of writing ability allows educators, researchers, and policy-makers to monitor student growth in written language expression both within and across grades from elementary through post-secondary education. The proposed developmental scale is a significant advance over human and machine-scoring systems that employ raw scores based on holistic and analytical scoring rubrics. Participants (n=589) enrolled in grades 4, 6, 8, 10, or 12 in a rural/suburban school district in north-central Mississippi wrote one essay in response to each of six prompts; two narrative, two informative, and two persuasive. Three prompts, one from each genre, were shared with the adjacent grade thus ensuring the connectivity needed to detect a developmental scale. Many-Facet Rasch Measurement (MFRM; Linacre, 1994) was used to create the developmental scale while adjusting for genre, prompt, and rubric variation. Results provide strong evidence that the goal of constructing a developmental scale of writing across grades 4 to 12 was achieved. Implications for research on writing assessment and instruction are discussed.
Introduction
Unlike reading and mathematical ability that are routinely, if not universally, measured on a developmental scale potentially spanning grades K-16, writing achievement is typically assessed by raters (human or computer) and reported on a grade specific raw score scale ranging from 1-4 or 1-6 points. Oftentimes student essays are rated on a single holistic 1-6 scale, whereas, other testing programs attempt analytic scoring in which writing traits, such as, "organization" or "conventions knowledge" are similarly rated on a 1-4 or 1-6 scale. Whether a holistic or analytic scoring scheme is employed, the resulting raw scores are local to the context of measurement and no pretense is made regarding how a 5th grade student's score of "4" compares to a seventh grade student's score of "4". With few exceptions, writing ability has not been expressed on a developmental scale. Thus, researchers, policy-makers and educators know little about the qualitative and quantitative changes a developing writer undergoes from kindergarten through college, how these changes can be described by individual growth trajectories, or how reading and writing development co-vary. The difficulties in generating such a scale for writing are numerous and varied. For instance, the time involved in writing assessment is lengthy, considering a student can complete many multiple choice items on a reading or mathematics assessment in the same time that it would take to produce a single essay. Furthermore, one essay is not enough to build the necessary connectivity to produce a developmental scale using the Rasch model. Multiple prompts must be administered to the same students, requiring even more assessment time. For the developmental aspects of the scale, the participants in a study need to come from different grade levels spanning the developmental continuum of writing. Once enough data is collected from an appropriate sample of students, human raters need to score the essays against a rubric, which can be a costly procedure. Participants from each grade sample need to be scored in relation to participants from other grade samples to vertically calibrate the scale. Connectivity among raters is needed to ensure that the effects of rater severity on measures can be estimated. Various genres of writing need to be accounted for to determine their effect on prompt difficulty and the dimensionality of the writing construct. Enough replications of the measurement procedure need to occur to ensure the precision and accuracy of writing ability estimates along the scale.
Purpose of the Research
The purpose of the research was to evaluate the possibility and quality of a developmental scale for writing if proper measurement methods were used. This paper goes into some detail about what this methodology should be with a discussion of the models needed for the scale's construction.
Methods
Participants and Design
A sample of 663 students from 5 grades (4, 6, 8, 10, 12) in a school district in Mississippi participated in the research. Students in the fourth grade attended one of two elementary schools, all students in the sixth and eighth grades attended one middle school, and all students in 10th and 12th grades attended one high school. Participants were selected from intact classrooms. The number of teachers at each grade was as follows: (1) 7 fourth-grade teachers, (2) 7 sixth-grade teachers, and (3) one teacher each for grades 8, 10, and 12. Each student responded to six National Assessment of Educational Progress (NAEP) writing prompts over a four week period. 18 total prompts were used across the five grades. The linking design had students in 4th grade share 3 prompts in common with 6th graders and writing to 3 prompts that they alone were assigned; 6th graders shared 3 more prompts to which they responded with 8th graders; 8th graders shared 3 more prompts to which they responded with 10th graders; 10th graders shared 3 more prompts to which they responded with 12th graders; and 12th graders responded to an additional 3 prompts that they alone were assigned. The shared prompts were always grouped to include all three genres. Only 6th and 10th graders received prompts that were not specifically targeted for their grade by NAEP. Students were trimmed from the sample if they did not complete at least 4 essays or either of the two companion reading tests. This left a trimmed sample of 589 students.
Procedure
Students wrote their essays with paper and pencil. Randomization of prompt order occurred at the classroom level enabling all the students within the class to hear the same instruction for the task. This instruction included a teacher reading the prompt to the students and providing the administrative guidelines. Each student was given 5 minutes of planning time and additional 25 minutes to complete an essay. Each essay was scored by 4 experienced raters employed by a top firm which specialized in the scoring of essays with human raters. 19 total raters scored the resulting 3976 essays (3534 in the trimmed sample) using the NAEP six point rubric. FACETS software produced the analysis to convert the raw scores into measures upon a scale using a multi-faceted Rasch model. MGENOVA software produced generalizability coefficients and average standard errors of measurement (ASEMs) to evaluate the reliability and precision of the instrumentation.
Results (Tables and figures do not translate well into this textbox)
A full paper is available with tables and figures upon request.
Discussion
Direct assessment of writing ability is becoming more common as its role as a post-secondary gateway competency is more fully appreciated. Many practitioners are aware, however, of the fact that writing assessment technology is remarkably primitive when compared with reading or mathematics assessments. As a society we invest a lot of money to obtain a raw rating on a 1-4 or 1-6 scale. These raw ratings are neither comparable within grade (because no adjustments are made for rater effects) nor across grades (because the developmental nature of writing ability is ignored). Reporting a raw score or percent correct on a statewide reading or mathematics test would be recognized immediately as 100-year-old technology, yet, we accept such practices in the assessment of writing ability. The methodology and data presented in this paper demonstrates that the issues causing these limitations are surmountable and that a developmental scale assessing writing ability is possible using human ratings of essays.
The measurement context for writing is more complicated than is the context for reading and mathematics assessment. However, psychometric models are available to handle the additional complexity. The present study has employed a multivariate extension of the well known Rasch model (Rasch, 1960). The FACETS computer program takes as input, counts or ratings produced by a complicated measurement context (multiple prompts, multiple raters, multiple occasions of measurement) and outputs linear quantities that behave like centimeters and degrees Celsius. It is because these FACET measures of writing ability behave as quantities that we can precisely model individual growth trajectories over the life course. Reading measures constructed with FACETS, WINSTEPS and RUMM 2020 have been used to build individual growth trajectories for reading ability that have features that are remarkably similar to individual growth trajectories for other biological processes (height and weight). It seems likely that these similarities in growth trajectories manifest because the measures of reading (and now potentially writing) behave like measures of height, weight, and temperature. However, these methods necessitate a linking design where students write to multiple prompts and are scored by multiple raters. The precision of the measures improves greatly with multiple administrations, demonstrating that accuracy is just a function of the number of replications of the measurement process. With each new measure, however, the cost associated with assessing writing increases. Though a developmental scale is needed if we want to track a writer's growth over time, the benefit of that scale needs to be weighed against the considerable cost of the scale's construction. The solution for reducing the cost of writing assessment so that its efficacy is equal to reading and mathematics assessments may reside in the development of a substantive theory that explains the variation in person abilities that writing assessments detect. The field of automated essay scoring has existed for 40 years. Advances in that field have improved to the point that a computer can assess a student's writing ability with nearly the same accuracy as human raters (Shermis & Bernstein, 2003). Underlying the algorithms within these programs must be a substantive theory as to what writing assessments measure.
Answering this question is our next step. Future research should demonstrate that a writer develops similarly to a developing reader-by improving their semantic and syntactic store. Future research is needed on whether or not reading and writing can be measured on the same scale, that the relationship is consistent throughout the developmental continuum, and that the SEM for computer scoring of essays is equivalent to human scoring of essays.