Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

By: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

Published: 2025-12-22

View on arXiv →
#cs.AI

Abstract

Automating clinical risk score calculations can significantly reduce physician administrative burden and improve patient care. Current benchmarks like MedCalc-Bench, constructed using LLM-based extraction, risk perpetuating historical model errors, especially when used for RL rewards. This work proposes treating benchmarks as "living documents" with systematic, physician-in-the-loop pipelines. They leverage agentic verifiers to audit and relabel MedCalc-Bench, using automated triage to focus clinician attention on contentious cases. Their audit revealed significant discrepancies from medical ground truth. Fine-tuning a Qwen3-8B model on corrected labels yielded an 8.7% absolute accuracy improvement, demonstrating the material impact of label noise. These findings highlight rigorous benchmark maintenance as crucial for genuine model alignment in safety-critical domains.

FEEDBACK

Projects

No projects yet