LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech

Traditional forced alignment (FA) methods often suffer from language-specificity and cumulative temporal shifts. This paper introduces LLM-ForcedAligner, a novel approach that reformulates FA as a slot-filling paradigm using large language models (LLMs) for multilingual, crosslingual, and long-form speech. By treating timestamps as discrete indices and inserting special timestamp tokens as slots, the model directly predicts time indices at these slots. This design supports non-autoregressive inference, effectively avoiding hallucinations and significantly improving speed, achieving a substantial reduction in accumulated averaging shift compared to previous FA methods.

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech

Abstract

Projects