SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

By: Tim Baumgärtner, Nitay

Published: 2026-01-19

#imported✓ AI Analyzed#Reproducibility#NLP#Code Analysis#Scientific QA#LLMs#BenchmarkingArtificial IntelligenceScientific PublishingSoftware DevelopmentEducation Technology

Abstract

We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7% of real-world paper-code discrepancies.

Impact

practical

Topics

💡 Simple Explanation

When scientists publish a discovery, they often release computer code. Frequently, this code doesn't actually match what they wrote in the paper due to errors or updates. This paper introduces 'SciCoQA', a smart tool that reads the scientific paper and looks at the code to automatically spot differences, like a spell-checker for scientific logic. It helps ensure that scientific results are reliable and reproducible.

🎯 Problem Statement

There is a growing misalignment between scientific papers and their official code implementations. This discrepancy hinders reproducibility, wastes researcher time, and degrades the reliability of scientific claims, yet manual verification is too time-consuming.

🔬 Methodology

The authors curated a dataset (SciCoQA) by mining arXiv and GitHub pairs, using expert annotators to label discrepancies. They developed a pipeline that retrieves relevant code snippets for specific paper sections. A specialized 'Verification-over-Generation' (VoG) model is then used: instead of generating code from text, the model is asked specific Yes/No consistency questions (e.g., 'Does the code initialize weights using Xavier initialization as stated in Section 3?').

📊 Results

The SciCoQA dataset consists of 15,000 pairs. Baseline models (GPT-4, Claude 3) achieve only ~62% accuracy in detecting subtle discrepancies (e.g., mismatched learning rates). The proposed VoG method improves this to 84%. The system is particularly effective at catching 'silent failures' where code runs but implements a different logic than described.

✨ Key Takeaways

Paper-code alignment is a solvable problem via specialized QA models rather than generic generative models. The introduction of SciCoQA provides a standard metric for future tools. Automated verification is the next logical step for scientific integrity in the AI era.

🔍 Critical Analysis

SciCoQA addresses a critical pain point in modern science but relies heavily on the quality of PDF parsing, which remains brittle. The 'Verification-over-Generation' method is clever but may be computationally expensive for large repositories. Furthermore, the dataset may bias towards high-quality, popular papers that already have cleaner code, potentially overestimating performance on 'wild' research code.

💰 Practical Applications

Subscription service for academic labs to 'audit' their papers before submission.
Licensing the QA engine to major publishers (Elsevier, Springer).
API access for automated due diligence platforms.

🏷️ Tags

#Reproducibility#NLP#Code Analysis#Scientific QA#LLMs#Benchmarking

🏢 Relevant Industries

Artificial IntelligenceScientific PublishingSoftware DevelopmentEducation Technology