RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models
By: Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He
Published: 2025-12-09
View on arXiv →#cs.AI
Abstract
This research proposes RL-MTJail, a reinforcement learning approach for automated black-box multi-turn jailbreaking of Large Language Models. The study offers crucial insights for enhancing LLM security and developing robust defenses against adversarial attacks and malicious prompts in practical deployments.