Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

We present a systematic evaluation of large language model families—spanning both proprietary cloud APIs and locally-hosted open-source models—on two purpose-built benchmarks for System Dynamics AI assistance: the CLD Leaderboard (53 tests, structured causal loop diagram extraction) and the Discussion Leaderboard (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77–89% overall pass rates; the best local model reaches 77% (Kimi K2.5 GGUF Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50–100% on model building steps and 47–75% on feedback explanation, but only 0–50% on error fixing—a category dominated by long-context prompts that expose memory limits in local deployments.

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Abstract

Projects