Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

This work introduces two new benchmarks, ORDebug and ORBias, that integrate a solver into the evaluation loop for AI models. ORDebug assesses iterative self-correction in solving infeasible operations research models, while ORBias evaluates behavioral rationality in newsvendor instances. This approach aims to improve the diagnostic and self-repair capabilities of large language models in practical optimization settings.

Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Abstract

Projects