Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity.
By: Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky
Published: 2026-04-27
View on arXiv →#cs.AI
Abstract
This research proposes a novel framework for evaluating mathematical reasoning in large language models (LLMs) that moves beyond rigid symbolic checks, introducing an LLM-as-a-judge paradigm for robust assessment.