Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
By: Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt
Published: 2026-04-02
View on arXiv →#cs.AI
Abstract
As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard human oversight. This work introduces NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and proposes five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. The results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion.