RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

By: Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He

Published: 2025-12-09

View on arXiv →
#cs.AI

Abstract

This research proposes RL-MTJail, a reinforcement learning approach for automated black-box multi-turn jailbreaking of Large Language Models. The study offers crucial insights for enhancing LLM security and developing robust defenses against adversarial attacks and malicious prompts in practical deployments.

FEEDBACK

Projects

No projects yet

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models | ArXiv Intelligence