Context Management with LangSmith Evaluations

Context engineering is controlling what makes it into your prompt. Managing context incorrectly can lead to bugs and unexpected behaviors. This repository demonstrates common context management pitfalls and how to use LangSmith to evaluate, iterate, and develop better strategies for building more reliable agents.

Setup

# Install dependencies
uv sync

# Set environment variables
cp .env.example .env  # Add your ANTHROPIC_API_KEY and LANGSMITH_API_KEY

# Run notebooks
jupyter notebook notebooks/

Structure

context-failure-evals/
├── notebooks/
│   ├── context_confusion_demo.ipynb     # Context confusion demonstration
│   └── context_distraction_demo.ipynb   # Context distraction demonstration
├── context_confusion/
│   ├── tools.py                          # 75 tools including near-duplicates
│   ├── instructions.py                   # Base instructions
│   ├── additional_context.py             # Irrelevant domain instructions
│   ├── resources/                        # Mock data and test cases
│   ├── tests/                            # Evaluators and dataset utilities
│   ├── utils/                            # Agent helpers and plotting utilities
│   └── solutions/                        # Consolidated tools and solutions
└── context_distraction/
    ├── agent.py                          # Standard ReAct agent
    ├── graph.py                          # Graph agent with context isolation
    ├── resources/                        # Mock APIs and test tasks
    ├── tests/                            # Evaluators and dataset utilities
    └── debug/                            # Claude Code debugging utilities

Notebooks

1. Context Confusion (`notebooks/context_confusion_demo.ipynb`)

Demonstrates how context confusion - superfluous context from excessive tools, verbose instructions, and irrelevant information - degrades LLM agent performance.

The Problem: The Berkeley Function-Calling Leaderboard shows every model performs worse with more tools. Too much in the context leads to poor tool selection, unnecessary calls, and incorrect responses.

Three problems measured with trajectory-based evaluation:

Tool Overload - ~75 tools → confusion and poor selection
Irrelevant Noise - Unrelated tools distract even at moderate counts
Instruction Bloat - Verbose multi-domain instructions reduce focus

Evaluators:

Trajectory Match: Do tool calls match expected tools?
Success Criteria: Is the response accurate and complete?
LLM Trajectory Judge: Are tool calls appropriate?
Tool Efficiency: Ratio of expected/actual tool calls

Solutions demonstrated:

Context compression via tool consolidation and pruning
Context selection via prompt routing

2. Context Distraction (`notebooks/context_distraction_demo.ipynb`)

Demonstrates how context distraction - accumulated tool call results over long task sequences - degrades recall accuracy in complex, multi-step research tasks.

The Problem: As agents perform complex tasks with many operations, each tool call and result accumulates in the conversation context. Research shows LLMs struggle to maintain recall accuracy over very long contexts.

Evaluators:

Recall Accuracy: Does the agent correctly recall facts from throughout the task?
Tool Call Completeness: Are all expected research steps executed?
Tool Call Efficiency: Ratio of expected to actual tool calls

Solutions demonstrated:

Context isolation via supervisor/researcher pattern
Reflection tools for maintaining plans over long tasks
Explicit information passing between nodes

Debugging utilities: Includes Claude Code debugging scripts in context_distraction/debug/ for inspecting traces and agent behavior

3. Context Clash

Coming soon

4. Context Poisoning

Coming soon

5. Context Isolation

Coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
context_confusion		context_confusion
context_distraction		context_distraction
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Context Management with LangSmith Evaluations

Setup

Structure

Notebooks

1. Context Confusion (`notebooks/context_confusion_demo.ipynb`)

2. Context Distraction (`notebooks/context_distraction_demo.ipynb`)

3. Context Clash

4. Context Poisoning

5. Context Isolation

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

langchain-samples/context-failure-evals

Folders and files

Latest commit

History

Repository files navigation

Context Management with LangSmith Evaluations

Setup

Structure

Notebooks

1. Context Confusion (notebooks/context_confusion_demo.ipynb)

2. Context Distraction (notebooks/context_distraction_demo.ipynb)

3. Context Clash

4. Context Poisoning

5. Context Isolation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages

1. Context Confusion (`notebooks/context_confusion_demo.ipynb`)

2. Context Distraction (`notebooks/context_distraction_demo.ipynb`)