ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

By: Nearchos Potamitis, Lars Klein, Akhil Arora

Published: 2025-12-09

#cs.AI✓ AI Analyzed#LLM#Reasoning#Benchmarking#Robustness#NLP#AI SafetyArtificial IntelligenceSoftware DevelopmentFintechLegalTechAcademic Research

Abstract

This paper presents ReasonBENCH, a new benchmark designed to evaluate and quantify the stability and consistency of reasoning capabilities in Large Language Models. The findings are vital for understanding and improving the reliability of LLMs in real-world applications where dependable decision-making is critical.

Impact

practical

Topics

💡 Simple Explanation

Imagine a student who aces a math test but fails if you simply change the names of the people in the word problems. This paper shows that advanced AI models are like that student. The researchers created a test called ReasonBENCH that gives AIs variations of the same problem. They found that AIs often memorize answers or guess based on patterns, rather than truly understanding the logic, causing them to fail when the question looks slightly different.

🎯 Problem Statement

Large Language Models (LLMs) achieve high scores on reasoning benchmarks, but it is unclear if this reflects true reasoning capabilities or data contamination (memorization). Existing benchmarks are static, allowing models to overfit to specific phrasings, masking their inability to generalize logic across semantically equivalent contexts.

🔬 Methodology

The study proposes a benchmark framework that generates multiple perturbed versions of reasoning tasks (Arithmetic, Commonsense, Logic). Perturbations include surface-level changes (synonyms, formatting) and structural changes (order of premises). The models' performance is evaluated using metrics like 'Average Accuracy' and 'Consistency Score' (probability of answering all variants of a seed problem correctly).

📊 Results

The experiments demonstrated a widespread lack of stability across major LLMs. Models that scored high on standard benchmarks showed significant performance drops when faced with ReasonBENCH perturbations. The 'Consistency Score' was consistently lower than 'Average Accuracy', proving that models often get the right answer for the wrong reasons (or by chance/memorization). Chain-of-Thought prompting improved general accuracy but did not fully mitigate instability issues.

✨ Key Takeaways

High accuracy on static benchmarks is a deceptive metric for reasoning capability. True reasoning requires stability across semantic variations. ReasonBENCH serves as a critical diagnostic tool, highlighting that current SOTA models are still brittle and prone to pattern-matching fallacies rather than executing abstract logic reliably.

🔍 Critical Analysis

ReasonBENCH provides a necessary reality check for the LLM industry. By systematically exposing the fragility of current reasoning capabilities, it challenges the narrative of AGI proximity. The methodology is rigorous, though reliance on synthetic perturbations requires careful validation to ensure human-level difficulty remains constant. It is a valuable contribution that shifts the focus from 'passing the test' to 'understanding the material'.

💰 Practical Applications

SaaS platform for evaluating corporate LLM deployments for robustness.
Licensing the perturbation dataset for model fine-tuning.
Consulting services for 'Hardening' AI agents against prompt injection and variability.

🏷️ Tags

#LLM#Reasoning#Benchmarking#Robustness#NLP#AI Safety

🏢 Relevant Industries

Artificial IntelligenceSoftware DevelopmentFintechLegalTechAcademic Research