Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Automating clinical risk score calculations can significantly reduce physician administrative burden and improve patient care. Current benchmarks like MedCalc-Bench, constructed using LLM-based extraction, risk perpetuating historical model errors, especially when used for RL rewards. This work proposes treating benchmarks as "living documents" with systematic, physician-in-the-loop pipelines. They leverage agentic verifiers to audit and relabel MedCalc-Bench, using automated triage to focus clinician attention on contentious cases. Their audit revealed significant discrepancies from medical ground truth. Fine-tuning a Qwen3-8B model on corrected labels yielded an 8.7% absolute accuracy improvement, demonstrating the material impact of label noise. These findings highlight rigorous benchmark maintenance as crucial for genuine model alignment in safety-critical domains.

Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Abstract

Projects