Massively Decomposed Agentic Processes - A CLI tool implementing the MAKER pattern for ultra-reliable AI task execution
MAKER is a novel approach to LLM reliability that challenges the obsession with "bigger models" and "larger context windows." Instead of relying on a single model to complete complex tasks, MAKER:
- Decomposes tasks into atomic steps
- Executes each step with fresh, stateless agents
- Votes on each step's output using consensus (First-to-Ahead-by-K)
If a model is 99% accurate per step, a 100-step task has a 36% success rate (0.99^100).
At 1,000 steps, you drop to 0.004% success.
By running multiple cheap models in parallel and voting, MAKER achieves higher reliability than using a single expensive model. Each step has a fresh context, preventing error propagation.
Read the paper: MAKER: Massively Decomposed Agentic Processes
I created this as a proof of concept for the MAKER paper. Think of this as "gemini-flash-lite-infinity"โa way to scale lightweight models to solve tasks far beyond their normal context limits. I hope this inspires others to improve current models and advance the state of AI. Bring your own Gemini key.
Verified at Scale:
- drunkWalker: Successfully completed 1,000 steps (navigating in a square pattern 250 times), maintaining perfect spatial accuracy and returning to the origin
{x: 0, y: 0}. - brokenCalculator: Successfully completed 1,000 steps of continuous calculation without error.
Test results can be found in tests/results/ to verify these claims.
My next goal is to push this approach to more complex tasksโcoding challenges, game playing, multi-step puzzles, and real-world problem-solving. I want to see how far the current engineering can take us before hitting fundamental limitations. If the architecture proves insufficient, I'll iterate on the methods (smarter decomposition, dynamic voting thresholds, error recovery mechanisms). I encourage others to experiment with MAKER on their own challenging tasks and share what worksโand what doesn't.
npm installCreate a .env file:
GEMINI_API_KEY=your_api_key_here
DEV_MODE=falsenode bin/maker.js "Your task here"
# High quality mode (uses gemini-flash-latest instead of lite)
node bin/maker.js "Your task here" --high--high mode uses gemini-flash-latest which has a 10,000 RPM rate limit and can get expensive quickly. Only use for tasks requiring higher quality reasoning.
Example:
node bin/maker.js "Start with 0, add 10, multiply by 2, then subtract 5"Output:
๐ค MAKER CLI - Initializing...
โ Plan created with 4 steps.
โ Step 1 Complete
โ Step 2 Complete
โ Step 3 Complete
โ Step 4 Complete
๐ Final Result:
15
โโโโโโโโโโโโโโโ
โ Planner โ Decomposes task into steps
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Step 1 โ
โโโโโโโโโโโโโโโค
โ Agent A ... โ Multiple stateless agents vote
โ Agent B ... โ
โ Agent C โโโ โ First-to-Ahead-by-K wins
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Step 2 โ Fresh agents, no memory
โโโโโโโโโโโโโโโ
-
Maximal Agentic Decomposition (MAD)
Tasks are broken into atomic steps. Each step is executed by a fresh agent with no memory of previous steps. -
Red-Flagging
Malformed outputs (bad JSON, syntax errors) are immediately discarded. No retry logicโtreat it like a "failed neuron." -
First-to-Ahead-by-K Voting
Multiple agents vote on each step. The first answer to lead byKvotes (default: 2) wins.
d:/Dev/MAKER/
โโโ .env # API key (not committed)
โโโ package.json
โโโ bin/
โ โโโ maker.js # CLI entry point
โโโ src/
โ โโโ config.js # Configuration
โ โโโ agent.js # Worker agent (stateless)
โ โโโ consensus.js # Voting mechanism
โ โโโ planner.js # Task decomposition
โ โโโ utils.js # Canonical JSON stringify
โโโ tests/
โโโ suite.js # Test definitions
โโโ runner.js # Test executor
Edit src/config.js:
export const CONFIG = {
MODEL_NAME: "gemini-flash-lite-latest",
VOTE_MARGIN_K: 10, // Votes ahead to win
MAX_ATTEMPTS_PER_STEP: 15, // Max voting rounds
// Speed Optimization Settings
MAX_RPM: 3500, // Rate limit (flash-lite: 4000 RPM)
BATCH_SIZE: 50, // Parallel agents per batch
ENABLE_PARALLEL: true, // Parallel consensus voting
EARLY_TERMINATION: true, // Stop when K margin reached
};Performance Optimizations Implemented:
- Parallel Batch Voting: Launch 10 agents simultaneously instead of sequentially
- Early Termination: Stop voting as soon as K=2 margin is reached (saves ~40% on average)
- Rate Limiting: Smart throttling to maximize throughput within API limits
Benchmark Results (brokenCalculator 10 steps):
- Sequential baseline: ~60s
- With optimizations: ~6.0s (10x faster with 50 parallel agents)
We've included reliability tests with known answers:
# 10-step math chain
node tests/runner.js 10
# 50-step chain
node tests/runner.js 50
# 100-step chain
node tests/runner.js 100These tests verify MAKER's ability to maintain accuracy over very long step sequences:
node tests/lite-runner.js drunkWalker 100 # Spatial reasoning - navigation
node tests/lite-runner.js paritySwitch 100 # Conditional logic - even/odd rules
node tests/lite-runner.js brokenCalculator 1000 --high # Instruction parsing - varied formatsNote: The
brokenCalculator 1000test has been verified to run successfully, maintaining accuracy over 1,000 steps of continuous calculation. This demonstrates the power of stateless decompositionโerrors do not accumulate because each step is verified by consensus.
Results are logged to tests/results/.
Note: Natural language task handling (e.g., "Convert 100 USD to EUR") is also supported via the CLI. See tests/results/usd-to-eur-natural-language.txt for an example of consensus-based multi-step reasoning on ambiguous real-world tasks.
MAKER isn't just a task runnerโit's a tool for understanding and improving LLMs. Here is how you can use it for Self-Learning and research:
Use MAKER with a high VOTE_MARGIN_K (e.g., K=10 or K=20) to generate "Gold Standard" reasoning traces.
- Run complex tasks.
- Save the successful step-by-step execution logs.
- Use these logs to fine-tune smaller models (like Gemma 2B or Llama 3 8B) to perform better at reasoning without needing consensus.
If you see a step where consensus is split (e.g., 50 votes for Answer A, 45 votes for Answer B), you have found a model blind spot.
- These are valuable data points!
- Collect these ambiguous cases to create "Adversarial Evaluation Datasets."
- Example: If
3055 + 5results in310(a hallucination) even with K=5, increasing to K=10 helps filter it out, but the existence of the error reveals a weakness in the base model's tokenization or arithmetic handling.
Use the consensus vote count as a Reward Signal for Reinforcement Learning (RLHF).
- If an answer gets 100/100 votes, it has high reward.
- If it gets 51/100 votes, it has low reward.
- Train a Reward Model to predict the "Consensus Score" of an answer without actually running 100 agents.
- Ask MAKER to write a prompt for a task.
- Run the task using that prompt with MAKER (K=10).
- Measure the success rate.
- Ask MAKER to "Analyze the failure cases and improve the prompt."
- Repeat.
MAKER is a research prototype demonstrating the power of decomposition + consensus. Here are potential improvements for the community to explore:
- Dynamic K Adjustment: Use K=1 for simple early steps, K=3 for critical final steps
- Adaptive Batch Sizing: Automatically scale batch size based on rate limit headroom
- Prompt Caching: Cache the enriched system prompt to reduce token costs
- Smarter Decomposition: Train a specialized planner model to create better step sequences
- Self-Healing: Detect when consensus fails and automatically simplify the step
- Verification Layers: Add final validation agents that check results for sanity
- Confidence Scoring: Track which steps required many retries vs. quick consensus
- Checkpointing: Save intermediate state to resume long tasks after failures
What MAKER is Good At:
- Long-Horizon Reasoning: 100-1000+ step tasks that require perfect accuracy at each step
- Structured Problem Solving: Math chains, code generation with verification, multi-step analysis
- Reliability-Critical Tasks: Where a single error cascades into total failure
- Cost-Sensitive Scaling: Using cheap models (flash-lite) to achieve flagship-level reliability
Real-World Use Cases:
- Code Migration Tools: Decompose large refactoring tasks into atomic, verifiable steps
- Data Processing Pipelines: Multi-stage transformations where each step must be correct
- Educational Tutoring: Step-by-step problem solving with verification at each stage
- Planning Assistants: Break complex goals into actionable, validated subtasks
This implementation uses gemini-flash-lite-latest as a proof of concept. But what if we applied MAKER to frontier models like GPT-4o, Claude Sonnet, or Gemini Pro?
Key Insight: If decomposition + consensus can make a lightweight model reliable at 100-step tasks, could it push frontier models to 10,000-step tasks? Or enable them to solve problems currently considered impossible?
Open Research Questions:
- Does consensus voting still help when the base model is already very capable?
- Can smarter decomposition strategies unlock exponential scaling in reasoning depth?
- What's the theoretical limit of task complexity we can achieve with this approach?
- Could MAKER-style techniques be integrated into model training itself?
We encourage researchers to experiment with:
- Applying MAKER to frontier models and measuring the reliability gains
- Testing on domains like mathematics, formal verification, or multi-agent simulations
- Comparing different voting strategies (K-margin, confidence weighting, tiered consensus)
- Exploring hybrid approaches (MAKER for decomposition, frontier model for synthesis)
We are actively working on pushing the boundaries of what's possible with decomposed agentic processes. Here is our roadmap for the immediate future:
- Current State: Basic syntax and JSON validation.
- Goal: Implement semantic red-flagging where agents can vote to "veto" a step if it seems logically unsound, even if syntactically correct.
- Implementation: Add a pre-voting "sanity check" layer that runs cheaply before the main consensus round.
- Current State: Linear history of step results.
- Goal: Move to a Tree-of-Thoughts data structure where branches can be explored and pruned.
- Benefit: Allows for backtracking and exploring alternative solutions when a path hits a dead end, rather than just failing the task.
- Current State: Excellent at linear reasoning and math chains.
- Goal: Tackle tasks requiring multi-file coding, creative writing, and strategic planning.
- Testing: We are building a suite of "Hard" tests including:
- Writing a complete mini-game in Python
- Solving cryptic crosswords
- Debugging complex race conditions
- Goal: Reduce the "K" margin needed for consensus by improving the quality of individual agents.
- Method: Fine-tuning a small model (Gemma 2B/7B) specifically on successful MAKER traces to create a "Specialized Worker" model that is cheaper and more accurate than generic flash-lite.
Contributions are welcome! Please open an issue or submit a pull request.
Ideas for contributions:
- Implement Phase 3 optimizations (dynamic K, adaptive batching)
- Add support for other model providers (OpenAI, Anthropic, OpenRouter)
- Create new test suites for different problem domains
- Build visualization tools for consensus voting patterns
- Improve the planner with few-shot examples or fine-tuning
MIT License - see LICENSE for details.
This project is open source and free to use, modify, and distribute. We encourage experimentation and welcome improvements from the community.
Inspired by the MAKER research paper on Massively Decomposed Agentic Processes.
| Approach | Model | Cost | Reliability |
|---|---|---|---|
| Single Call | GPT-4 | $$$$ | ~36% (100 steps) |
| MAKER | Gemini Flash Lite | $ | >95% (100 steps) |
Economics: Running 10 cheap models per step is cheaper and more reliable than one expensive model.