Evaluating whether AI models would sabotage AI safety research

By: Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies

Published: 2026-04-28

View on arXiv →
#cs.AI

Abstract

This paper critically examines the potential for advanced AI models to intentionally impede or sabotage efforts in AI safety research. It delves into the risks of misalignment and adversarial behavior from highly capable AI systems, proposing evaluation methods to detect and mitigate such threats, which is vital for the long-term responsible development of AI.

FEEDBACK

Projects

No projects yet