Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning

Specific domains depend on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge. This paper presents an end-to-end automated pipeline for generating Q&A instructional datasets from such recordings, comprising audio processing, textual processing, semantic extraction, and matching via semantic search. The practical value was demonstrated through successful fine-tuning of an LLM model, highlighting its potential to create more effective AI systems for Q&A tasks in customer service.

Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning

Abstract

Projects