daVinci-Dev: Agent-native Mid-training for Software Engineering

By: Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, Pengfei Liu

Published: 2026-01-26

View on arXiv →
#cs.AI

Abstract

This paper introduces daVinci-Dev, a systematic agentic mid-training approach that equips large language models (LLMs) with foundational agentic behaviors for software engineering. It addresses the distribution mismatch between static training data and dynamic development environments by using "agent-native data" (contextually-native trajectories from GitHub Pull Requests and environmentally-native trajectories from real Docker interactions). This enables LLMs to autonomously navigate, edit, and test complex codebases, achieving state-of-the-art resolution rates on SWE-Bench Verified.

FEEDBACK

Projects

No projects yet