In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution
By: Frank Xiao, Santiago Aranguri
Published: 2026-02-23
View on arXiv →#cs.AI
Abstract
This paper addresses the critical issue of undesirable emergent behaviors in large language models (LLMs) deployed in real-world production environments. It proposes a data attribution method to identify and mitigate these issues during post-training, enhancing the safety and reliability of LLM applications. This is crucial for their widespread adoption.