Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning
By: Alex Echeverria, Sávio Salvarino Teles de Oliveira, Fernando Marques Federson
Published: 2026-01-20
View on arXiv →Abstract
Specific domains depend on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge. This paper presents an end-to-end automated pipeline for generating Q&A instructional datasets from such recordings, comprising audio processing, textual processing, semantic extraction, and matching via semantic search. The practical value was demonstrated through successful fine-tuning of an LLM model, highlighting its potential to create more effective AI systems for Q&A tasks in customer service.