Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
By: Ruicheng Ao, David Simchi-Levi, Xinshang Wang
Published: 2026-01-21
View on arXiv →#cs.AI
Abstract
This work introduces two new benchmarks, ORDebug and ORBias, that integrate a solver into the evaluation loop for AI models. ORDebug assesses iterative self-correction in solving infeasible operations research models, while ORBias evaluates behavioral rationality in newsvendor instances. This approach aims to improve the diagnostic and self-repair capabilities of large language models in practical optimization settings.