-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Summary
Add a first-class Slime-powered Algorithm to Agent Lightning (similar to agl.VERL(...)) so users can:
- run existing agents (LangGraph / OpenAI Agents SDK / custom) unchanged via the Runner,
- collect spans + rewards in
LightningStore, - train/update policy weights with Slime,
- inject the updated OpenAI-compatible endpoint to rollouts via the
main_llmresource convention.
Agent Lightning architecture reference: Algorithm ↔ Runner ↔ LightningStore, with auxiliary Tracer/Adapter/LLM Proxy.
Docs: Bird’s Eye View + Store deep dive.
- https://microsoft.github.io/agent-lightning/latest/deep-dive/birds-eye-view/
- https://microsoft.github.io/agent-lightning/stable/deep-dive/store/
Slime
Slime is an RL post-training framework designed for RL scaling with Megatron + SGLang and supports flexible data generation engines.
Refs:
Motivation / problem
Agent Lightning already supports RL via the VERL wrapper, with main_llm resource injection and per-attempt endpoint formatting via ProxyLLM.
- VERL wrapper docs: https://microsoft.github.io/agent-lightning/latest/algorithm-zoo/verl/
- SQL RL recipe: https://microsoft.github.io/agent-lightning/latest/how-to/train-sql-agent/
I plan to work on Slime integration next week and want to open a technical discussion early to avoid choosing the wrong abstraction boundary.
Design constraints from Agent Lightning (what we must respect)
LightningStoreis the single control plane: rollouts, attempts, ordered spans (sequence_id), and resources.ProxyLLMexists specifically to inject rollout/attempt routing into endpoints (for attribution).- Adapters exist (e.g.,
LlmProxyTraceToTriplet) and ordering should rely onsequence_id(not timestamps) due to clock skew.
Two integration tracks
Track A — "tinker-style" training service backbone (recommended if we need a stable boundary)
Create a minimal training service API that Slime sits behind (service-ish boundary), then implement an Agent Lightning Algorithm that:
- reads spans + rewards from
LightningStore, - converts them into training batches,
- calls a stable service surface (e.g.,
train_step / load_weights / get_endpoint / checkpoint) implemented by Slime.
Inspiration: OpenTinker’s disaggregated client/scheduler/training-server model.
NOTE: You referenced
radixark/mileslink: radixark/miles#239
Pros:
- Stable interface for Agent Lightning; Slime can evolve internally.
- Easier to swap backends in the future.
Cons:
- Risk of over-abstracting and fighting Slime’s native architecture.
Track B — Native Slime integration (fastest POC)
Implement agl.SLIME(config) as a thin wrapper that:
- launches Slime router + trainer using Slime-native config/entrypoints,
- publishes a
ProxyLLM-like OpenAI endpoint undermain_llm, - builds a TraceAdapter that emits Slime-consumable batches.
Pros:
- Quick to get working.
- Minimal extra layers.
Cons:
- Tighter coupling to Slime internals; upgrades may be fragile.
Open questions for maintainers
- Should
SLIMElive in Algorithm Zoo or as an external plugin? - Should we require
ProxyLLMrouting semantics for attribution (recommended), or allow a “dumb endpoint”? - What is the expected minimal training signal in spans? (token IDs, logprobs, tool-call segmentation)
- v1 scope: single-turn only, or multi-turn/tool-call credit assignment?
Acceptance criteria
agl.SLIME(...)exists and can run an end-to-end loop withTrainer.- Runners can consume
main_llmendpoint (like the SQL RL recipe). - Training updates change the served endpoint (resource versioning) without modifying agent code.
- Trace ordering and attribution are correct at attempt granularity.
Ownership
I plan to self-assign and drive the initial implementation + example.