Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

By: Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt

Published: 2026-04-02

View on arXiv →
#cs.AI

Abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard human oversight. This work introduces NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and proposes five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. The results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion.

FEEDBACK

Projects

No projects yet

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability | ArXiv Intelligence