Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

By: Yahya Masri, Emily Ma, Zifu Wang, Joseph Rogers, Chaowei Yang

Published: 2026-01-13

View on arXiv →
#cs.AI

Abstract

This paper benchmarks nine small language models (SLMs) and small reasoning language models (SRLMs) on system log severity classification using real-world `journalctl` data from Linux production servers. It evaluates performance under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting, revealing strong stratification and significant improvements with RAG for some models. The tiny Qwen3-0.6B notably reaches 88.12% accuracy with retrieval, despite weak performance without it.

FEEDBACK

Projects

No projects yet