Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification
By: Yahya Masri, Emily Ma, Zifu Wang, Joseph Rogers, Chaowei Yang
Published: 2026-01-13
View on arXiv →#cs.AI
Abstract
This paper benchmarks nine small language models (SLMs) and small reasoning language models (SRLMs) on system log severity classification using real-world `journalctl` data from Linux production servers. It evaluates performance under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting, revealing strong stratification and significant improvements with RAG for some models. The tiny Qwen3-0.6B notably reaches 88.12% accuracy with retrieval, despite weak performance without it.