Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,22 @@ Join our [Discord community](https://discord.gg/RYk7CdvDR7) to connect with othe

Read more on our [documentation website](https://microsoft.github.io/agent-lightning/).

## Why Tinker?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this as it's not related to this PR.


- Running large scale LLM experiments locally can be difficult and resource intensive,especially for users without GPUs or complex infrastructure.

- Tinker allows Agent Lightning users to offload experiment execution to a managed third party service.This removes the need for local GPU setup and reduces operational complexity.

- Compared to the alternatives such as `verl`,Tinker provides a simpler API and easier integration,making it suitable for rapid experimentation and onboarding.

- Use Tinker when you want fast setup and managed execution,use local backends when you need full control over infrastructure.


<p align="center">
<img src="docs/assets/readme-diff.svg" alt="Agent-Lightning Core Quickstart" style="width:100%"/>
</p>


## ⚡ Installation

```bash
Expand Down
6 changes: 5 additions & 1 deletion agentlightning/adapter/messages.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,11 @@ def adapt(self, source: Sequence[Span], /) -> List[OpenAIMessages]:
if not isinstance(prompt, list):
raise ValueError(f"Extracted prompt from trace is not a list: {prompt}")
if not isinstance(completion, list):
raise ValueError(f"Extracted completion from trace is not a list: {completion}")
raise ValueError(
f"Expected completion to be a list, got {type(completion)}. "
f"Value: {repr(completion)[:200]}. "
"If the trace contains a single completion, wrap it in a list before passing it."
)
if not isinstance(request, dict):
raise ValueError(f"Extracted request from trace is not a dict: {request}")
if not isinstance(response, dict):
Expand Down
16 changes: 16 additions & 0 deletions docs/how-to/failed-rollouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Handling Failed Rollouts

Rollouts may fail due to transient system issues such as network errors, timeouts or external service failures.

## Retry behavior
- Rollout retries are configured via `RolloutConfig`, including settings such as `max_attempts`, retry conditions and timeouts.
- If a rollout fails and returns `None`, it still counts as an attempt and follows the configured retry limits.

## Batch behavior
- Failed rollouts are handled at the individual rollout level.
- There is currently no built-in mechanism to a automatically skip an entire batch when multiple rollouts fail.

## Best practices
- Retries are useful for transient failures (e.g. temporary network issues).
- If failures occur frequently, this usually indicates an infrastructure problem rather than an issue retries can fix.
- In such cases, it is recommended to address the underlying system issue instead of increasing retry limits.