XSkill: Continual Learning from Experience and Skills in Multimodal Agents
This paper introduces XSkill, a dual-stream framework enabling multimodal agents to continually learn from visually-grounded task-level skills and action-level experiences without explicit retraining. This approach improves agent performance by enhancing tool-use efficiency and flexibility.
cs.AI
Read More →On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
This paper explores the phenomenon of "information self-locking" in reinforcement learning for active reasoning in Large Language Model (LLM) agents. It investigates how LLM agents might get stuck in suboptimal reasoning loops and proposes methods to overcome these limitations for improved active reasoning.
cs.AI
Read More →Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
This research investigates using reasoning Large Language Models (LLMs) as judges for evaluating other LLMs during post-training in non-verifiable domains, exploring their effectiveness, practical impact, and potential pitfalls in complex, subjective tasks.
cs.AI
Read More →A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
This paper presents a prospective clinical feasibility study of an LLM-based conversational AI (Amy) in a real-world primary care setting. It evaluates Amy's diagnostic capabilities, management plans, and user satisfaction, finding high safety and acceptance, despite human providers having an edge in practicality and cost-effectiveness of management plans. It's a vital step towards broader clinical translation.
cs.AI
Read More →A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control
This paper proposes a robust Multi-Agent Reinforcement Learning (MARL) framework for Traffic Signal Control, validated in the Vissim traffic simulator. It addresses generalization challenges through adaptive state representation, a novel reward function, and agent communication. The framework shows superior performance in diverse traffic scenarios.
cs.AI
Read More →Can RL Improve Generalization of LLM Agents? An Empirical Study
This empirical study investigates whether Reinforcement Learning (RL) can enhance the generalization capabilities of Large Language Model (LLM) agents. The research explores various RL techniques and their impact on LLM agents' performance across diverse and unseen tasks.
cs.AI
Read More →OpenClaw-RL: Train Any Agent Simply by Talking
This framework converts real-time "next-state signals" from AI agent interactions into continuous, online learning sources. It recovers both implicit evaluative signals and explicit directive signals, enabling agents to achieve rapid personalization in conversational settings and improve performance across diverse general agent tasks like terminal, GUI, SWE, and tool-calling environments. This allows agents to improve simply by being used, adapting to user re-queries, corrections, and explicit feedback.
cs.AI✓ AI Analyzed
Read More →Highly Autonomous Cyber-Capable Agents: Anticipating Capabilities, Tactics, and Strategic Implications
This report introduces "Highly Autonomous Cyber-Capable Agents" (HACCAs), AI systems capable of autonomously conducting multi-stage cyber campaigns comparable to top hacking groups. It defines HACCAs, forecasts their emergence, identifies five core operational tactics (e.g., autonomous infrastructure setup, detection evasion), and analyzes strategic implications like intensified interstate cyber competition and proliferation of offensive capabilities. It also flags tail risks such as inadvertent cyber-nuclear escalation and sustained loss of control, proposing policy recommendations.
cs.AI
Read More →Few-for-Many Personalized Federated Learning
This paper addresses scalability in Personalized Federated Learning (PFL) for heterogeneous data distributions by reformulating PFL as a "few-for-many" optimization problem. It maintains a small number of shared server models (K << M clients) to collectively serve all clients, rather than M distinct models. The proposed algorithm, FedFew, automatically discovers optimal model diversity through efficient gradient-based updates, achieving near-optimal personalization and outperforming state-of-the-art approaches with as few as 3 models on vision, NLP, and medical imaging datasets.
cs.AI
Read More →Ψ 0 Ψ_0 Ψ0: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
This paper introduces an open foundation model for universal humanoid loco-manipulation. It employs a decoupled learning strategy that first pre-trains on human egocentric videos to acquire generalizable visual-action representations, and then post-trains on significantly less robot data for precise joint control. This approach achieves over 40% higher success rates on complex, long-horizon tasks compared to state-of-the-art baselines and demonstrates improved data efficiency, paving the way for more capable humanoid robots.
cs.AI
Read More →When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows
This work proposes an architecture that adapts LLM agents for hospital environments to significantly improve clinical workflows. It addresses reliability, security, and long-term memory limitations by introducing a restricted execution environment, a document-centric interaction paradigm, a page-indexed memory architecture, and a curated medical skills library. The system forms the basis of an Agentic Operating System for Hospital, capable of coordinating clinical workflows with safety, transparency, and auditability.
cs.AI
Read More →DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. The paper proposes DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces. This method significantly improves tool-use generalization and outperforms quantity scaling for out-of-distribution generalization, even with 4x less data.
cs.AI
Read More →TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
TinyVLM enables zero-shot object detection directly on microcontrollers by employing vision-language distillation with Matryoshka embeddings. This significantly pushes the boundaries of edge AI, allowing powerful visual recognition capabilities on highly resource-constrained devices for IoT and embedded applications.
cs.CV
Read More →Towards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images
This research explores a data-driven approach for estimating nitrogen levels in wheat fields using multispectral images. This has direct real-world application in precision agriculture, enabling optimized fertilization, improved crop yields, and reduced environmental impact.
cs.CV
Read More →Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression
This paper introduces Latent Replay Detection, a memory-efficient approach for continual object detection on microcontrollers. It leverages task-adaptive compression to mitigate catastrophic forgetting, crucial for deploying adaptive AI systems on edge devices that learn over time.
cs.CV
Read More →OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
OmniStream introduces a unified framework for real-time perception, 3D reconstruction, and action planning in continuous data streams. This approach is crucial for embodied AI and robotics, enabling agents to understand and interact with dynamic environments in a coherent and efficient manner.
cs.CV
Read More →EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
This paper proposes EVATok, a novel adaptive length video tokenization method designed for efficient visual autoregressive generation. It aims to improve the efficiency of video generation models by dynamically adjusting token lengths, leading to better performance and reduced computational costs, particularly useful for high-quality video synthesis and editing applications. This work was accepted by CVPR 2026.
cs.CV
Read More →GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
This paper introduces GRADE, a benchmark for evaluating discipline-informed reasoning in image editing. It provides a structured framework to assess how well AI models understand and apply domain-specific rules during image manipulation, which is vital for professional artistic and design applications.
cs.CV
Read More →Automated Quality Check of Sensor Data Annotations
This paper proposes an automated method for checking the quality of sensor data annotations, a critical component for training reliable machine learning models in autonomous systems. Ensuring high-quality annotations is vital for the safety and performance of AI applications in self-driving cars, robotics, and surveillance.
cs.CV
Read More →Understanding LoRA as Knowledge Memory: An Empirical Analysis
Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.
cs.AI
Read More →Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents.
cs.AI
Read More →Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Agentic reasoning models, which leverage external tools for multi-step tasks, hold immense promise but also introduce new safety challenges. A critical aspect of their safe deployment is the ability to intelligently decide when to act and when to refuse an action, especially when faced with uncertain or potentially harmful tool outputs. This paper proposes a novel framework for guarding agentic reasoning models by explicitly training them to learn refusal policies. Our approach integrates a confidence estimation module and a refusal mechanism directly into the agent's decision-making loop. The confidence module assesses the reliability of generated tool calls and intermediate reasoning steps, while the refusal mechanism triggers a safe fallback (e.g., asking for human intervention or re-planning) if confidence drops below a learned threshold. Through extensive experiments on various tool-use benchmarks involving web navigation, API calls, and code execution, we demonstrate that our guarded agent significantly improves safety and robustness, reducing harmful actions and erroneous tool uses by up to 70% while maintaining high task completion rates. This work provides a crucial step towards building more trustworthy and controllable agentic AI systems for real-world applications.
cs.AI
Read More →Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.
cs.AI
Read More →Adaptive Confidence Regularization for Multimodal Failure Detection
The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction should be higher when all modalities agree, and lower when they disagree. ACR explicitly models this discrepancy by learning a confidence score that regularizes the multimodal prediction. Our experiments demonstrate that ACR significantly improves failure detection performance on various multimodal tasks and datasets, especially in challenging real-world scenarios with noisy and incomplete data.
cs.AI
Read More →CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40% on the hardest Level-3 setting.
cs.AI
Read More →SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
This research presents SeeThrough3D, a groundbreaking method for text-to-image generation that incorporates occlusion-aware 3D control. It allows for more precise and realistic synthesis of images by understanding and manipulating objects in a three-dimensional space, even when they are partially obscured. This advancement has significant implications for virtual reality, content creation, and design industries.
cs.AI
Read More →SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents
The SWE-MiniSandbox paper presents a novel container-free reinforcement learning environment designed for developing and testing software engineering agents. This sandbox facilitates efficient training and evaluation of AI agents capable of automating various software development tasks, from code generation to bug fixing, without the overhead of containerization.
cs.AI
Read More →Model Agreement via Anchoring
This paper introduces a novel approach to achieving agreement between different AI models through a technique called anchoring. It explores how anchoring can enhance the robustness and reliability of multi-model systems, offering a new perspective on ensemble learning and improving decision-making in complex AI applications. This method has potential applications in areas requiring high-stakes decision validation.
cs.AI
Read More →MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants
This paper introduces MHDash, a new online platform specifically designed for benchmarking mental health-aware AI assistants. It provides standardized metrics and datasets to evaluate the effectiveness, empathy, and safety of AI tools in supporting mental well-being, paving the way for more responsible and beneficial applications of AI in healthcare.
cs.AI
Read More →Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
This paper delves into the structure and function of "agentic memory" in AI systems, proposing a comprehensive taxonomy and an empirical analysis of its evaluation and inherent limitations. Understanding agentic memory is crucial for developing more intelligent and adaptive AI agents capable of complex tasks and long-term interactions in dynamic environments.
cs.AI
Read More →