A Comparative Analysis of Human and AI-Based Assessments of Second Language Essays Across Typed and Handwritten Formats: An Experimental Study
Karimi, Mehrnoush
Citations
Abstract
The issue of evaluating handwritten versus typed responses in standardized writing assessments first appeared two to three decades ago, when personal computing was still emerging. While computer-based testing is now widely used, many high-stakes exams still require handwritten responses. As a result, concerns about scoring fairness across different response formats remain relevant and warrant renewed investigation. Moreover, the effect of mode on raters’ behavior is relatively under-explored, and it is necessary to understand whether raters evaluate handwritten and typed essays differently. This dissertation investigates whether the essay mode (typed versus handwritten) influences ratings assigned by human raters and an AI-based rater. GPT-4 is included as a rater to examine whether AI systems, like human raters, are affected by format-related bias when scoring the same written content in different modes. Although the use of AI-based scoring systems has been widely studied in assessment contexts, their reliability and fairness across response formats, especially with the addition of Optical Character Recognition (OCR) capabilities, are relatively unexplored. 75 essays from a paper-based high-stakes English proficiency exam were transcribed into typed format to allow for direct comparison. Essays were evaluated by 10 human raters and GPT-4 analytically across four categories: Grammatical Accuracy, Vocabulary Range, Development and Task Completion, and Genre Appropriateness and Writing Conventions. Preliminary results revealed no significant overall score differences between formats at the group level. However, results from Many-Facet Rasch Measurement (MFRM) model revealed that seven out of ten human raters exhibited statistically significant format-related bias and 27 out of 75 essays (36 %) showed a statistically significant difference in their Rasch scores across the two formats, indicating the presence of mode-related rating discrepancies. While the AI rater was generally consistent, rater-by-format-by-criterion analysis showed that even the AI rater demonstrated systematic variation when evaluating specific traits. Qualitative analysis further showed that all human raters expressed a preference for evaluating typed essays. Despite this preference, the quantitative findings showed that some raters favored handwritten versions. These findings shed light on persistent scoring mode effects and emphasize the need for rater training, equitable test design, and responsible integration of AI into assessment systems.
Comments
Description
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Files
- Embargoed until 2027-07-25
