Abstract
The development of automated essay grading systems with minimal human intervention has been pursued for decades. While these systems have advanced significantly in English, there is still a lack of in-depth analysis of the use of modern Large Language Models for automatic essay scoring in Portuguese. This work addresses this gap by evaluating different language model architectures (encoder-only, decoder-only, reasoning-based), fine-tuning and prompt engineering strategies. Our study focuses on scoring argumentative essays written as practice exercises for the Brazilian national entrance exam regarding five trait-specific criteria. Our results show that no architecture is always dominant, and that encoder-only models offer a good balance between accuracy and computational cost. We obtain state-of-the-art results for the dataset, obtaining trait-specific performance that ranges from .60 to .73 measured in Quadratic Weighted Kappa.
Type
Publication
Journal of the Brazilian Computer Society