arxiv:2603.17187

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

Published on Mar 17

· Submitted by

Yao on Mar 19

#1 Paper of the day

University of North Carolina at Chapel Hill

Upvote

Authors:

Abstract

A continual meta-learning framework for large language model agents that jointly evolves policies and reusable behavioral skills while minimizing downtime through opportunistic updates and skill-driven adaptation.

AI-generated summary

Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.

View arXiv page View PDF GitHub 1.89k Add to collection

Community

Huaxiu

Paper submitter about 3 hours ago

Large language model (LLM) agents have rapidly emerged as powerful assistants for complex, multi-step tasks, yet agents deployed in the wild remain largely static, trained once and served unchanged regardless of how user needs evolve. This creates a fundamental tension: they must serve users continuously without interruption, yet their capabilities grow stale as the task distribution drifts with real-world usage. On platforms such as OpenClaw, where a single agent connects to 20+ messaging channels and handles diverse, evolving workloads, existing approaches either store raw trajectories without distilling transferable behavioral knowledge, maintain static skill libraries disconnected from weight optimization, or incur service downtime during retraining. We present MetaClaw, a continual meta-learning framework that jointly maintains a base LLM policy and an evolving skill library of reusable behavioral instructions, improving both through two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories and synthesizes new skills via an LLM evolver, taking effect immediately with zero service downtime. Opportunistic policy optimization performs gradient-based weight updates via cloud LoRA fine-tuning using RL with a process reward model, triggered only during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors configurable sleep hours, system keyboard inactivity, and Google Calendar occupancy. The two mechanisms are mutually reinforcing: a better policy produces more informative failures for skill synthesis, and richer skills yield higher-reward trajectories for policy optimization. To prevent stale reward contamination, a skill generation versioning mechanism strictly separates support data (failure trajectories consumed by skill evolution) from query data (post-adaptation trajectories used for RL updates). Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without a local GPU. Experiments on MetaClaw-Bench (934 questions, 44 simulated workdays) and AutoResearchClaw (23-stage autonomous research pipeline) demonstrate consistent improvements: skill-driven adaptation improves accuracy by up to 32% relative; the full pipeline advances Kimi-K2.5 from 21.4% to 40.6% accuracy (vs. GPT-5.2 baseline 41.1%) with an 8.25x gain in end-to-end task completion; and skill injection alone improves AutoResearchClaw composite robustness by 18.3%.