LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
By: Bingshen Mu, Xian Shi, Xiong Wang, Hexin Liu, Jin Xu, Lei Xie
Published: 2026-01-26
View on arXiv →Abstract
Traditional forced alignment (FA) methods often suffer from language-specificity and cumulative temporal shifts. This paper introduces LLM-ForcedAligner, a novel approach that reformulates FA as a slot-filling paradigm using large language models (LLMs) for multilingual, crosslingual, and long-form speech. By treating timestamps as discrete indices and inserting special timestamp tokens as slots, the model directly predicts time indices at these slots. This design supports non-autoregressive inference, effectively avoiding hallucinations and significantly improving speed, achieving a substantial reduction in accumulated averaging shift compared to previous FA methods.