Adversarial Moral Stress Testing of Large Language Models
By: Saeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel, Amin Nikanjam, Mohammad Hamdaqa, Kawser Wazed Nafi
Published: 2026-04-02
View on arXiv →Abstract
Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics. Results demonstrate substantial differences in robustness profiles across models and expose degradation patterns not observable under conventional single-round evaluation.