An Empirical Analysis of Large Language Models for Automated Cross-Prompt Essay Trait Scoring in Brazilian Portuguese

Oct 1, 2025·

André Barbosa

Igor Cataneo Silveira

Denis D. Mauá

· 0 min read

URL Project DOI

Abstract

The development of automated essay grading systems with minimal human intervention has been pursued for decades. While these systems have advanced significantly in English, there is still a lack of in-depth analysis of the use of modern Large Language Models for automatic essay scoring in Portuguese. This work addresses this gap by evaluating different language model architectures (encoder-only, decoder-only, reasoning-based), fine-tuning and prompt engineering strategies. Our study focuses on scoring argumentative essays written as practice exercises for the Brazilian national entrance exam regarding five trait-specific criteria. Our results show that no architecture is always dominant, and that encoder-only models offer a good balance between accuracy and computational cost. We obtain state-of-the-art results for the dataset, obtaining trait-specific performance that ranges from .60 to .73 measured in Quadratic Weighted Kappa.

Type

Journal article

Publication

Journal of the Brazilian Computer Society

Last updated on Oct 1, 2025

Authors

Denis D. Mauá (he/him)

Associate Professor

← Probabilities of Causation and Root Cause Analysis with Quasi-Markovian Models Jan 1, 2026

Dealing with cycles in graph-based probabilistic models: the case of Logical Credal Networks Jul 1, 2025 →