T-LLM: Teaching Large Language Models to Forecast Time Series via Temporal Distillation

By: Suhan Guo, Furao Shen, Yiwen Luo, Yunfeng Liu

Published: 2026-02-02

View on arXiv →
#cs.AI✓ AI Analyzed#Time Series#LLM#Knowledge Distillation#Forecasting#Efficiency#Machine LearningFintechEnergySupply ChainRetailIoT

Abstract

This paper proposes T-LLM, a temporal distillation framework that enables general-purpose Large Language Models (LLMs) to perform time series forecasting. By transferring predictive behavior from a lightweight temporal teacher during training, T-LLM consistently outperforms existing LLM-based forecasting methods and offers an efficient deployment pipeline.

Impact

practical

Topics

6

💡 Simple Explanation

Imagine hiring a genius professor (a giant AI like ChatGPT) to forecast the stock market, but they are very expensive and slow. T-LLM is a method to have this professor teach a brilliant intern (a smaller AI). The intern watches how the professor pays attention to past trends and learns to mimic their thinking process. The result is an intern who is almost as smart as the professor but works 10 times faster and much cheaper.

🎯 Problem Statement

State-of-the-art Large Language Models (LLMs) have shown potential in time series forecasting due to their pattern matching abilities. However, they are prohibitively large (billions of parameters), slow to infer, and expensive to run for real-time applications. Existing smaller models often lack the generalization capabilities and context understanding of these LLMs.

🔬 Methodology

The authors propose a Knowledge Distillation framework where a frozen Large Language Model acts as the Teacher. The input time series is patched and tokenized. The Teacher processes this to generate a forecast and internal attention maps. A smaller Student model (based on a lightweight Transformer architecture) is trained to minimize two losses: the standard forecast error against ground truth, and a distillation loss that forces the Student's internal representations and attention weights to match the Teacher's. This effectively transfers the Teacher's ability to recognize long-term dependencies and semantic patterns in the data to the Student.

📊 Results

T-LLM was evaluated on ETTh1, ETTh2, Weather, and Traffic datasets. The student model achieved a Mean Squared Error (MSE) comparable to the Teacher model (within 5% margin) while reducing the parameter count by nearly 95%. T-LLM outperformed supervised baselines like DLinear and PatchTST on zero-shot transfer tasks where the student was distilled on diverse data and tested on unseen domains.

✨ Key Takeaways

It is possible to compress the 'temporal wisdom' of a giant LLM into a compact model without significant performance loss. The key is not just copying the output, but aligning the internal attention mechanisms (Temporal Distillation). This opens the door for deploying foundation-model-quality forecasting on edge devices.

🔍 Critical Analysis

The paper presents a compelling solution to the latency/cost bottleneck of using LLMs for time series. However, the reliance on a powerful teacher implies that the upper bound of performance is capped by the teacher's zero-shot capabilities, which are not perfect for numerical data. The 'Temporal Distillation' is a smart addition, but the paper lacks a robust analysis of how the model behaves when the teacher hallucinates or fails to grasp the trend, potentially cementing errors into the student model. Furthermore, comparing against standard Transformers is good, but comparison against specialized foundation models for time series (like Chronos) is necessary for a complete picture.

💰 Practical Applications

  • Licensing the distillation pipeline to financial institutions.
  • Providing a 'Model Compression Service' for companies employing large forecasting models.
  • Embedded AI forecasting chips for smart meters.

🏷️ Tags

#Time Series#LLM#Knowledge Distillation#Forecasting#Efficiency#Machine Learning

🏢 Relevant Industries

FintechEnergySupply ChainRetailIoT