A governing pyramid, five architecture types, the agent harness, lab deep dives, memory
systems, failure modes, architectural limits, benchmarks, and the 12-Factor principles for
building production agents.
Governing Pyramid
Three pillars that hold the field up. Foundation, middle, apex.
foundation
The Harness, Not the Model, Is the Product
Production agents are 70%+ harness: context, memory, tools, verification, observability. The model is the CPU; the harness is the OS.
Frontier capability gains plateau quickly without the surrounding scaffolding that turns a model into a system. Claude Code, Cursor, and Anthropic's research agents all show that the long tail of reliability — context curation, tool ergonomics, verification loops, trace observability — dominates end-to-end performance, and that swapping models inside a strong harness produces smaller deltas than swapping harnesses around a strong model.
Once each agent is capable enough, the bottleneck shifts to how agents share context, hand off work, and recover from each other's failures. Coordination is the new scaling axis.
The MAST failure taxonomy (arXiv 2503.13657) finds that 70%+ of multi-agent failures are systemic — miscommunication, context loss, role confusion — not single-agent reasoning errors. Anthropic's multi-agent research system shows that disciplined orchestrator-worker coordination can yield 90.2% improvements, while Cognition's counter-argument warns that naive multi-agent setups fragment context and degrade below a strong single agent. Either way, the decisive variable is coordination design, not raw model IQ. However, three counterpoints emerged from primary sources. First, Anthropic's own data shows token usage explains 80% of performance variance — raw compute, not coordination finesse, is the primary driver. Upgrading from Sonnet 3.7 to Sonnet 4 delivered a larger gain than doubling the token budget. Second, Google Research found a capability saturation threshold at ~45%: once a single agent exceeds this baseline, adding coordination yields diminishing returns. Third, Deloitte and McKinsey emphasize organizational readiness (process redesign, governance, data architecture) as the primary blocker, not technical coordination alone — only 30% of organizations reach maturity level 3+ in agentic AI governance.
The agents that ship are narrow, well-scoped, and composable. Generality emerges from composition of focused agents, not from one omni-agent that does everything.
12-Factor Agents Factor III ('One agent, one job') and Anthropic's simplicity thesis both argue that scope discipline is what separates demos from production. Focused agents are easier to evaluate, debug, and sandbox; they fail in smaller ways and compose into larger systems via explicit protocols like A2A or MCP. The pattern across Claude Code sub-agents, Devin, and LangChain Deep Agents is the same: a top-level planner dispatches to single-purpose workers rather than asking one agent to be universally competent.
Deloitte 2025 Emerging Tech Trends, 500 US tech leaders
40%+
of agentic projects projected to be canceled by 2027
Gartner, June 2025
70%+
of multi-agent failures are systemic — system design ~41.8% + inter-agent misalignment ~31% = ~73%. Coordination, not model capability, is the dominant failure mode
MAST study, arXiv
90.2%
improvement from multi-agent vs. single-agent on research tasks
Anthropic internal eval (Claude Opus 4 + Sonnet 4)
15×
token consumption for multi-agent vs. plain chat
Anthropic multi-agent research system
4×
token consumption for single agent vs. plain chat
Anthropic multi-agent research system
40%
decrease in task completion time from tool description rewriting
Anthropic tool-testing agent
13.7pt
improvement on TerminalBench 2.0 from harness changes alone
LangChain — no model swap
280×
drop in inference costs over two years
Stanford AI Index 2025
0.6%
accuracy over 100 steps, starting from 95% per-step accuracy
Chip Huyen compound error analysis
50×
cost variation for same accuracy across agent configurations on SWE-bench (Kapoor et al., Princeton, TMLR 2025). Separately, Anthropic's data shows token usage explains 80% of performance variance — raw compute is the primary driver.
AI Agents That Matter — Kapoor et al.
77%
human vs. AI gap on GAIA benchmark (466 tasks testing agent reasoning over tool use).
Mialon et al. — GAIA
37%
average gap between agent benchmark lab accuracy and enterprise production accuracy.
CLASSic framework analyses
2,800%
reduction in hallucination propagation achieved by multi-agent verification pipelines over unchecked cascades.
Spark to Fire — City University of Macau, 2026
97M
monthly SDK downloads for the Model Context Protocol (MCP), making it the dominant agent-to-tool connectivity standard
Anthropic / MCP ecosystem data, mid-2026
150+
organizations supporting the Agent-to-Agent (A2A) protocol, including Adobe, SAP, ServiceNow, Salesforce, and all major cloud providers
Google / A2A Protocol, 2025-2026
~45%
baseline accuracy threshold above which adding multi-agent coordination yields diminishing or negative returns (Google Research, 180 configurations)
Chen et al. — Towards a Science of Scaling Agent Systems, Google Research Dec 2025
+80.8%
improvement from centralized multi-agent coordination on parallelizable tasks, but -39-70% degradation on sequential reasoning (Google Research, 180 configurations)
Chen et al. — Towards a Science of Scaling Agent Systems, Google Research Dec 2025
30%
of organizations reach maturity level 3+ in agentic AI governance, making organizational readiness — not technology — the primary deployment blocker
McKinsey — State of AI Trust in 2026
Architecture Types, Frameworks, Patterns
Five architecture types, seven frameworks, and seven coordination patterns. Click any card for full details.
You need an AI copilot for a specific workflow but want a human in the loop to see and approve every action before it runs. ChatGPT with plugins drafting emails, basic RAG for knowledge Q&A, GitHub Copilot suggesting completions.
How it differs
Differs from Type II: no fixed workflow — the human drives what step happens next instead of a coded sequence.
Key insight
Dominant production pattern. The LLM decides what to do, the human approves each step. Minimal risk, minimal coordination complexity. 'Start simple, add complexity only when validated' — every major source converges on this as the first rung.
The process steps are known and stable, but individual steps need LLM judgment. Invoice processing pipelines, document classification, email triage where rules + LLM gates combine cleanly.
How it differs
Differs from Type III: the sequence is hard-coded — there is no LLM orchestrator deciding what to do next at runtime.
Key insight
The LLM is a component in a workflow the human designed. Predictable, debuggable, but rigid — the system only handles paths the designer anticipated. Most 'agents' in production today are actually Type II workflows.
The task benefits from parallel specialized sub-work. Research reports, code generation across many files, complex multi-domain analysis. Examples in production: Claude Code delegating file edits to sub-agents, Anthropic's research system running specialist searches in parallel.
How it differs
Differs from Type IV: one-level only — the lead controls all workers directly, no peer-to-peer negotiation or nested delegation.
Key insight
The current frontier. Orchestrator decides WHAT needs doing; workers handle HOW. Anthropic reports 90.2% improvement over single-agent on research tasks — but at 15× token cost. One-level-only delegation prevents cascading hallucination. Near-term projection: 40%+ adoption by 2027.
Multiple autonomous systems owned by different principals need to coordinate. Agent marketplaces, cross-enterprise workflows, research sandboxes testing A2A/MCP protocols at scale.
How it differs
Differs from Type III: no single leader — agents coordinate peer-to-peer through standardized protocols rather than through an orchestrator.
Key insight
Agents communicate through standardized protocols (A2A, MCP) without a central orchestrator. Scales beyond single-team coordination but introduces distributed-systems problems: Byzantine agents, consensus failure, emergent goal drift. 70%+ of multi-agent failures in this class are systemic (MAST study).
Research exploration only. Voyager learning Minecraft skills autonomously, AutoGPT derivatives, open-ended scientific discovery experiments. Not production-ready — safety and governance frameworks for self-modifying systems don't exist yet.
How it differs
Differs from Type IV: agents choose their own objectives and modify their own capabilities, not just coordinate on human-set goals.
Key insight
Agents set their own goals and modify their own code or skill library. Raises safety questions that current governance frameworks aren't built for. AGI candidates may emerge through this pattern — not via single model scale-up but via multi-agent collaboration at scale.
Examples
Voyager (Minecraft)AutoGPT derivativesResearch-only systems
Pregel/BSP-inspired execution engine. Each superstep selects nodes whose subscribed channels changed, executes in parallel with isolated state, then applies updates deterministically. Supports arbitrary cycles — critical for agent loops, impossible in DAG-only frameworks like Airflow.
LangGraph Cloud decouples triggers from execution with retry from last checkpoint
Adoption
~25K GitHub stars (LangGraph repo; the langchain parent repo has ~100K), 400+ companies (Uber, LinkedIn, Replit)
Trade-offs
Low-abstraction design requires explicit state schema and edge wiring. Learning curve is the #1 cited limitation — but this explicitness is what makes complex workflows debuggable.
Agents defined by role, goal, and backstory. Three process types: sequential, hierarchical (manager agent delegates dynamically), custom. Flows architecture: event-driven control with single LLM calls at each node, adding deterministic orchestration with typed state.
Dual-workflow: Crews for autonomous collaboration, Flows for deterministic orchestration
Hierarchical mode — manager LLM acts as dispatcher with meta-reasoning
Unified memory: LanceDB with 0.85 similarity consolidation threshold
MCP + LangChain tool compatibility
AMP (agent management platform) for deployment/monitoring
Cognitive Memory backed by LanceDB with five cognitive operations (encode, consolidate, recall, extract, forget) and hierarchical scope trees
Previous ChromaDB backend caused 'database is locked' errors under concurrent access
Adoption
100K+ certified devs, 60% of Fortune 500, 1.4B executions
Trade-offs
Less fine-grained control than LangGraph. Communication overhead scales quadratically in fully-connected patterns. Key open problem: detecting consensus vs. unproductive debate loops. Flows + Cognitive Memory represent a maturing production story, but enterprise features require paid AMP (Agent Management Platform).
Separates WHAT (Signatures) from HOW (Optimizers). Define the task specification, then let the framework find optimal prompts automatically through metric-driven optimization.
FSM-based graph execution with Pydantic models for all inputs and outputs. Built-in dependency injection via RunContext. History processors run before each LLM call.
Managed agent service with policy-governed execution. Integrates with AWS identity, security, and compliance infrastructure — the closest thing to 'enterprise agents out of the box'.
You need an AI copilot for a specific workflow but want a human in the loop to see and approve every action before it runs. ChatGPT with plugins drafting emails, basic RAG for knowledge Q&A, GitHub Copilot suggesting completions.
How it differs
Differs from Type II: no fixed workflow — the human drives what step happens next instead of a coded sequence.
Key insight
Dominant production pattern. The LLM decides what to do, the human approves each step. Minimal risk, minimal coordination complexity. 'Start simple, add complexity only when validated' — every major source converges on this as the first rung.
The process steps are known and stable, but individual steps need LLM judgment. Invoice processing pipelines, document classification, email triage where rules + LLM gates combine cleanly.
How it differs
Differs from Type III: the sequence is hard-coded — there is no LLM orchestrator deciding what to do next at runtime.
Key insight
The LLM is a component in a workflow the human designed. Predictable, debuggable, but rigid — the system only handles paths the designer anticipated. Most 'agents' in production today are actually Type II workflows.
The task benefits from parallel specialized sub-work. Research reports, code generation across many files, complex multi-domain analysis. Examples in production: Claude Code delegating file edits to sub-agents, Anthropic's research system running specialist searches in parallel.
How it differs
Differs from Type IV: one-level only — the lead controls all workers directly, no peer-to-peer negotiation or nested delegation.
Key insight
The current frontier. Orchestrator decides WHAT needs doing; workers handle HOW. Anthropic reports 90.2% improvement over single-agent on research tasks — but at 15× token cost. One-level-only delegation prevents cascading hallucination. Near-term projection: 40%+ adoption by 2027.
Multiple autonomous systems owned by different principals need to coordinate. Agent marketplaces, cross-enterprise workflows, research sandboxes testing A2A/MCP protocols at scale.
How it differs
Differs from Type III: no single leader — agents coordinate peer-to-peer through standardized protocols rather than through an orchestrator.
Key insight
Agents communicate through standardized protocols (A2A, MCP) without a central orchestrator. Scales beyond single-team coordination but introduces distributed-systems problems: Byzantine agents, consensus failure, emergent goal drift. 70%+ of multi-agent failures in this class are systemic (MAST study).
Research exploration only. Voyager learning Minecraft skills autonomously, AutoGPT derivatives, open-ended scientific discovery experiments. Not production-ready — safety and governance frameworks for self-modifying systems don't exist yet.
How it differs
Differs from Type IV: agents choose their own objectives and modify their own capabilities, not just coordinate on human-set goals.
Key insight
Agents set their own goals and modify their own code or skill library. Raises safety questions that current governance frameworks aren't built for. AGI candidates may emerge through this pattern — not via single model scale-up but via multi-agent collaboration at scale.
Examples
Voyager (Minecraft)AutoGPT derivativesResearch-only systems
Pregel/BSP-inspired execution engine. Each superstep selects nodes whose subscribed channels changed, executes in parallel with isolated state, then applies updates deterministically. Supports arbitrary cycles — critical for agent loops, impossible in DAG-only frameworks like Airflow.
LangGraph Cloud decouples triggers from execution with retry from last checkpoint
Adoption
~25K GitHub stars (LangGraph repo; the langchain parent repo has ~100K), 400+ companies (Uber, LinkedIn, Replit)
Trade-offs
Low-abstraction design requires explicit state schema and edge wiring. Learning curve is the #1 cited limitation — but this explicitness is what makes complex workflows debuggable.
Agents defined by role, goal, and backstory. Three process types: sequential, hierarchical (manager agent delegates dynamically), custom. Flows architecture: event-driven control with single LLM calls at each node, adding deterministic orchestration with typed state.
Dual-workflow: Crews for autonomous collaboration, Flows for deterministic orchestration
Hierarchical mode — manager LLM acts as dispatcher with meta-reasoning
Unified memory: LanceDB with 0.85 similarity consolidation threshold
MCP + LangChain tool compatibility
AMP (agent management platform) for deployment/monitoring
Cognitive Memory backed by LanceDB with five cognitive operations (encode, consolidate, recall, extract, forget) and hierarchical scope trees
Previous ChromaDB backend caused 'database is locked' errors under concurrent access
Adoption
100K+ certified devs, 60% of Fortune 500, 1.4B executions
Trade-offs
Less fine-grained control than LangGraph. Communication overhead scales quadratically in fully-connected patterns. Key open problem: detecting consensus vs. unproductive debate loops. Flows + Cognitive Memory represent a maturing production story, but enterprise features require paid AMP (Agent Management Platform).
Separates WHAT (Signatures) from HOW (Optimizers). Define the task specification, then let the framework find optimal prompts automatically through metric-driven optimization.
FSM-based graph execution with Pydantic models for all inputs and outputs. Built-in dependency injection via RunContext. History processors run before each LLM call.
Managed agent service with policy-governed execution. Integrates with AWS identity, security, and compliance infrastructure — the closest thing to 'enterprise agents out of the box'.
You need an AI copilot for a specific workflow but want a human in the loop to see and approve every action before it runs. ChatGPT with plugins drafting emails, basic RAG for knowledge Q&A, GitHub Copilot suggesting completions.
How it differs
Differs from Type II: no fixed workflow — the human drives what step happens next instead of a coded sequence.
Key insight
Dominant production pattern. The LLM decides what to do, the human approves each step. Minimal risk, minimal coordination complexity. 'Start simple, add complexity only when validated' — every major source converges on this as the first rung.
The process steps are known and stable, but individual steps need LLM judgment. Invoice processing pipelines, document classification, email triage where rules + LLM gates combine cleanly.
How it differs
Differs from Type III: the sequence is hard-coded — there is no LLM orchestrator deciding what to do next at runtime.
Key insight
The LLM is a component in a workflow the human designed. Predictable, debuggable, but rigid — the system only handles paths the designer anticipated. Most 'agents' in production today are actually Type II workflows.
The task benefits from parallel specialized sub-work. Research reports, code generation across many files, complex multi-domain analysis. Examples in production: Claude Code delegating file edits to sub-agents, Anthropic's research system running specialist searches in parallel.
How it differs
Differs from Type IV: one-level only — the lead controls all workers directly, no peer-to-peer negotiation or nested delegation.
Key insight
The current frontier. Orchestrator decides WHAT needs doing; workers handle HOW. Anthropic reports 90.2% improvement over single-agent on research tasks — but at 15× token cost. One-level-only delegation prevents cascading hallucination. Near-term projection: 40%+ adoption by 2027.
Multiple autonomous systems owned by different principals need to coordinate. Agent marketplaces, cross-enterprise workflows, research sandboxes testing A2A/MCP protocols at scale.
How it differs
Differs from Type III: no single leader — agents coordinate peer-to-peer through standardized protocols rather than through an orchestrator.
Key insight
Agents communicate through standardized protocols (A2A, MCP) without a central orchestrator. Scales beyond single-team coordination but introduces distributed-systems problems: Byzantine agents, consensus failure, emergent goal drift. 70%+ of multi-agent failures in this class are systemic (MAST study).
Research exploration only. Voyager learning Minecraft skills autonomously, AutoGPT derivatives, open-ended scientific discovery experiments. Not production-ready — safety and governance frameworks for self-modifying systems don't exist yet.
How it differs
Differs from Type IV: agents choose their own objectives and modify their own capabilities, not just coordinate on human-set goals.
Key insight
Agents set their own goals and modify their own code or skill library. Raises safety questions that current governance frameworks aren't built for. AGI candidates may emerge through this pattern — not via single model scale-up but via multi-agent collaboration at scale.
Examples
Voyager (Minecraft)AutoGPT derivativesResearch-only systems
Pregel/BSP-inspired execution engine. Each superstep selects nodes whose subscribed channels changed, executes in parallel with isolated state, then applies updates deterministically. Supports arbitrary cycles — critical for agent loops, impossible in DAG-only frameworks like Airflow.
LangGraph Cloud decouples triggers from execution with retry from last checkpoint
Adoption
~25K GitHub stars (LangGraph repo; the langchain parent repo has ~100K), 400+ companies (Uber, LinkedIn, Replit)
Trade-offs
Low-abstraction design requires explicit state schema and edge wiring. Learning curve is the #1 cited limitation — but this explicitness is what makes complex workflows debuggable.
Agents defined by role, goal, and backstory. Three process types: sequential, hierarchical (manager agent delegates dynamically), custom. Flows architecture: event-driven control with single LLM calls at each node, adding deterministic orchestration with typed state.
Dual-workflow: Crews for autonomous collaboration, Flows for deterministic orchestration
Hierarchical mode — manager LLM acts as dispatcher with meta-reasoning
Unified memory: LanceDB with 0.85 similarity consolidation threshold
MCP + LangChain tool compatibility
AMP (agent management platform) for deployment/monitoring
Cognitive Memory backed by LanceDB with five cognitive operations (encode, consolidate, recall, extract, forget) and hierarchical scope trees
Previous ChromaDB backend caused 'database is locked' errors under concurrent access
Adoption
100K+ certified devs, 60% of Fortune 500, 1.4B executions
Trade-offs
Less fine-grained control than LangGraph. Communication overhead scales quadratically in fully-connected patterns. Key open problem: detecting consensus vs. unproductive debate loops. Flows + Cognitive Memory represent a maturing production story, but enterprise features require paid AMP (Agent Management Platform).
Separates WHAT (Signatures) from HOW (Optimizers). Define the task specification, then let the framework find optimal prompts automatically through metric-driven optimization.
FSM-based graph execution with Pydantic models for all inputs and outputs. Built-in dependency injection via RunContext. History processors run before each LLM call.
Managed agent service with policy-governed execution. Integrates with AWS identity, security, and compliance infrastructure — the closest thing to 'enterprise agents out of the box'.
Six components that turn a language model into an agent. The harness is the OS,
the model is the CPU.
Harness Component
System Prompt Architecture
Layered documents built to delete with each model release.
When to invest
Prioritize the moment your system prompt exceeds ~500 tokens or you're about to spend real engineering time tuning it. Minimum viable: split your prompt into 3 layers — project-specific (bottom), reusable (middle), deletion-ready (top) — with comments marking which is which. Red flag: you're editing one giant string with no versioning.
The system prompt isn't a single string — it's a stack of layers (custom, built-in, skills, filesystem, tool-specific) that evolve as models improve.
Key data pointClaude Code rewrites prompts across model upgrades without code changes
Claude Code uses a layered stack: custom → built-in → skills → filesystem → tool-specific. Manus uses XML semantic markup. The key insight: write prompts assuming they'll be rewritten when the next model drops. Don't invest in prompt engineering that locks you into current model limitations — invest in the scaffolding that lets you swap prompts cleanly.
Each MCP tool costs 500–1,000+ context tokens. On-demand loading is critical.
When to invest
Prioritize the moment you have more than 5 tools loaded, or your agent is taking obvious wrong tool choices. Minimum viable: audit every tool description against the question "would a new hire understand exactly when to use this vs. the others?" — rewrite any that fail. Red flag: tools with overlapping descriptions.
Tool descriptions aren't free — they consume context you can't get back. Treat them with the same engineering rigor as system prompts.
Key data point40% completion time reduction from description rewriting alone
Anthropic built a tool-testing agent that rewrites tool descriptions automatically based on task outcomes. Result: 40% reduction in task completion time. Tools that are always loaded should be rare; most should be surfaced on-demand via progressive disclosure. Vercel famously removed 80% of their tools and got better, faster results with fewer steps.
Five memory layers, from ephemeral context window to procedural AGENTS.md files.
When to invest
Prioritize the moment your agent needs to remember anything across sessions. Minimum viable: start with filesystem (.md files in a known directory) before reaching for vector stores. The filesystem is free, debuggable, survives context compression, and covers 80% of production cases. Only add vector search when you have >100 memories AND semantic retrieval is the bottleneck. Red flag: reaching for Pinecone on day one.
Memory isn't a single store — it's a hierarchy: Working (context) → Short-term/Session → Long-term Semantic (vectors) → Long-term Episodic (logs) → Procedural (AGENTS.md files).
Key data pointThe filesystem IS the external cognition layer
The COALA framework (Cognitive Architectures for Language Agents) separates memory by lifetime and retrieval method. Working memory is the context window — ephemeral, fast, expensive. Short-term is session-durable. Long-term semantic is vector-retrievable. Long-term episodic is conversation logs. Procedural is codified in AGENTS.md / CLAUDE.md files. Key insight from Manus, Claude Code, and Voyager: the filesystem IS the external cognition layer. Plain .md files are the dominant production substrate.
The #1 bottleneck. Six strategies, no silver bullet.
When to invest
Prioritize the moment your traces show the context getting >70% full. Minimum viable (in order): (1) enable prompt caching breakpoints — free 90% savings, (2) write intermediate work to files instead of holding in context, (3) delegate long side-tasks to sub-agents. Only reach for compaction/summarization when the first three aren't enough. Red flag: your agent fails at long tasks and you don't know the current context utilization.
Every production agent eventually hits the context wall. Solving it is the hardest harness engineering problem — and the biggest differentiator between toy and shipped systems.
Key data point98.7% context savings via MCP code-as-API pattern (Anthropic)
Six proven strategies: (1) compaction at 92% capacity — summarize old context into new, (2) sub-agent delegation for context isolation — Claude Code's core trick, (3) file buffering — write intermediate work to disk, (4) fan-out to smaller models — Haiku for 50%+ of calls, (5) todo lists as attention anchors, (6) progressive disclosure of tools and skills. Claude Code's compressor triggers at 92% and uses 4 cache breakpoints for 90% savings. Context management is harness work, not model work — LLM improvements alone don't fix it.
Four tiers: self-recovery → validation → critic → human.
When to invest
Prioritize the moment your tasks exceed ~5 steps. Compound math: 95% per-step accuracy drops to 77% at 5 steps and 0.6% at 100 steps without recovery. Minimum viable in order: (1) exponential backoff on tool failures with max-retry cap, (2) schema validation on every tool output, (3) add a critic agent only for quality-critical outputs, (4) human escalation path for the remaining 1% of cases. Red flag: your agent silently retries forever, or returns junk without any check.
Agents will fail. The harness decides how gracefully. Build escalation paths in, not as an afterthought.
Key data point95% step accuracy → 0.6% over 100 steps without error recovery
Tier 1: self-recovery via retries with exponential backoff. Tier 2: validation gates that check tool outputs against schemas. Tier 3: critic agents that review work before finalization. Tier 4: human escalation when the agent can't recover. Max-retry caps prevent runaway loops. The 0.6% compound-error math (Chip Huyen) is why this matters: 95% per-step accuracy collapses to 0.6% over 100 steps if errors aren't caught and corrected along the way.
Way more impactful in agents than in single LLM apps.
When to invest
Prioritize this BEFORE your first multi-step failure — you cannot debug what you cannot see. Minimum viable: pick one of LangSmith, Logfire, or Langfuse and wire it on day one. Trace every tool call, every LLM request/response, and every retry. This is the highest-leverage item on the entire list — you can't fix the other five components if you can't see them working. Red flag: you're debugging by re-running prompts manually.
Without traces, you can't debug multi-step failures. Observability is the difference between 'the agent is broken' and 'the agent called tool X with wrong args on step 4 of 7'.
Key data point13.7pt TerminalBench 2.0 improvement from harness changes alone
LangSmith, Logfire (OpenTelemetry), and Langfuse are the current production choices. Emerging pattern: harness-as-dataset — failure trajectories become training data for the next iteration of the harness. LangChain used this approach to achieve a 13.7pt improvement on TerminalBench 2.0 through harness changes alone, with no model swap. Observability isn't just debugging — it's the feedback loop that makes agents improvable.
Fourteen labs building agent systems — what they build, how they think about
coordination, what distinguishes their approach.
Anthropic
US
Claude Code harness, multi-agent research systems, constitutional AI
Key systemClaude Code 'nO' harness
Simplicity thesis — the harness is the product, not the model. Uses structured tool-calling loops over heavy orchestration frameworks. Multi-agent research system reports 90.2% improvement on research tasks via parallel sub-agents with isolated contexts, while Claude Code itself defaults to single-agent with focused subagent dispatch. Claude Code operates as a 5-layer stack: MCP (connectivity), Skills (task-specific knowledge), Agent (primary agentic loop), Subagents (parallel workers), and Agent Teams (coordination, shipped early 2026). The system prompt is assembled from ~80 modular pieces across three injection points, split by a cache boundary with 1-hour global TTL and 5-minute per-session TTL. The Tool Search Tool dynamically discovers relevant tools, reducing token usage by 85% (from ~72K to ~8.7K tokens). Sub-agents get independent context windows but cannot spawn their own sub-agents.
Read full approach
Approach
Simplicity thesis — the harness is the product, not the model. Uses structured tool-calling loops over heavy orchestration frameworks. Multi-agent research system reports 90.2% improvement on research tasks via parallel sub-agents with isolated contexts, while Claude Code itself defaults to single-agent with focused subagent dispatch. Claude Code operates as a 5-layer stack: MCP (connectivity), Skills (task-specific knowledge), Agent (primary agentic loop), Subagents (parallel workers), and Agent Teams (coordination, shipped early 2026). The system prompt is assembled from ~80 modular pieces across three injection points, split by a cache boundary with 1-hour global TTL and 5-minute per-session TTL. The Tool Search Tool dynamically discovers relevant tools, reducing token usage by 85% (from ~72K to ~8.7K tokens). Sub-agents get independent context windows but cannot spawn their own sub-agents.
Signature contribution
First production-grade coding agent with a transparent subagent model. Popularized the 'focused-first, general-later' pattern and published the Claude Agent SDK for external builders.
Evidence
Claude Code (GA 2025, 'nO' harness internally)
Multi-Agent Research System engineering writeup (June 2025)
Claude Agent SDK (2025)
Model Context Protocol (MCP) specification
Claude Code #1 AI coding tool in 8 months, $2.5B+ annualized revenue
5-layer stack: MCP/Skills/Agent/Subagents/Agent Teams
~80 modular prompt pieces with cache boundary architecture
Bets that simplicity and focused agents beat complex orchestration at current model capability levels. Risk: if frontier models plateau, the simplicity thesis breaks — you need more harness sophistication to compensate for weaker reasoning. For developers: start with Claude Code's single-agent pattern; add sub-agents only when parallelizable breadth tasks justify the coordination cost.
Agent primitives in the API, reasoning-model-driven tool use, consumer agents
Key systemResponses API + tool_use loop
Agent capabilities are exposed as first-class API primitives rather than a framework. The Responses API replaces Assistants with a lighter, stateful tool-calling loop; reasoning models (o-series) drive long-horizon planning internally rather than via external orchestration. Deep Research and Operator are productized agents layered on the same primitives. The Agents SDK (March 2025, evolved from experimental Swarm) provides 3 core primitives: Agents (LLMs with instructions and tools), Handoffs (one-way agent-to-agent delegation implemented as tool calls), and Guardrails (input/output validation running in parallel with execution, with tripwire halting). Deep Research uses a 4-agent pipeline: triage → clarifier → instruction builder → research agent, powered by specialized o3/o4-mini models.
Read full approach
Approach
Agent capabilities are exposed as first-class API primitives rather than a framework. The Responses API replaces Assistants with a lighter, stateful tool-calling loop; reasoning models (o-series) drive long-horizon planning internally rather than via external orchestration. Deep Research and Operator are productized agents layered on the same primitives. The Agents SDK (March 2025, evolved from experimental Swarm) provides 3 core primitives: Agents (LLMs with instructions and tools), Handoffs (one-way agent-to-agent delegation implemented as tool calls), and Guardrails (input/output validation running in parallel with execution, with tripwire halting). Deep Research uses a 4-agent pipeline: triage → clarifier → instruction builder → research agent, powered by specialized o3/o4-mini models.
Signature contribution
Shifted the industry from framework-heavy orchestration to model-driven agents where the reasoning model owns the loop.
Bets that reasoning models will internalize planning so well that external orchestration becomes unnecessary. Risk: vendor lock-in — the Responses API is tightly coupled to OpenAI's model ecosystem with no portability story. For developers: fastest path to production agents if you're already on OpenAI, but switching costs grow with every o-series-specific optimization you adopt.
Open agent framework, inter-agent protocols, Gemini-native tool use
Key systemAgent Development Kit (ADK) + A2A protocol
Two-layer strategy: ADK as a code-first framework for building multi-agent systems on Gemini, and A2A (Agent-to-Agent) as an open protocol for cross-vendor agent interoperability. Emphasizes standardized agent discovery, capability negotiation, and typed message passing over ad-hoc orchestration. ADK is an open-source (Apache 2.0) toolkit with 5 components: Agent (BaseAgent with LlmAgent, Workflow, and Custom variants), Tool, Callbacks, Runner, and Session/State. The December 2025 Developers Blog documents 8 multi-agent patterns: sequential pipeline, parallel dispatch, routing/coordinator, generator-critic, hierarchical delegation, aggregator, human-in-the-loop, and dynamic routing. ADK is model-agnostic (optimized for Gemini), deployment-agnostic, and GA on Vertex AI.
Read full approach
Approach
Two-layer strategy: ADK as a code-first framework for building multi-agent systems on Gemini, and A2A (Agent-to-Agent) as an open protocol for cross-vendor agent interoperability. Emphasizes standardized agent discovery, capability negotiation, and typed message passing over ad-hoc orchestration. ADK is an open-source (Apache 2.0) toolkit with 5 components: Agent (BaseAgent with LlmAgent, Workflow, and Custom variants), Tool, Callbacks, Runner, and Session/State. The December 2025 Developers Blog documents 8 multi-agent patterns: sequential pipeline, parallel dispatch, routing/coordinator, generator-critic, hierarchical delegation, aggregator, human-in-the-loop, and dynamic routing. ADK is model-agnostic (optimized for Gemini), deployment-agnostic, and GA on Vertex AI.
Signature contribution
Pushed the first broadly-adopted open protocol for agent-to-agent communication, separating transport from orchestration.
Evidence
Agent Development Kit (ADK) open-source release (2025)
Bets that open inter-agent protocols (A2A) will matter more than any single framework. Risk: protocol adoption is slow and A2A competes with MCP for developer attention — if neither wins critical mass, the interop layer fragments. For developers: ADK is a solid Gemini-native choice, but the A2A protocol bet only pays off if multiple vendors actually implement it.
Open-weight reasoning models purpose-built for agentic tool use
Key systemDeepSeek V3.1 / V3.2 agent-era models
Trains base and reasoning models against 1,800+ agentic training environments so tool-calling, planning, and error recovery are baked into the weights rather than bolted on via prompting. V3.1 was framed as the 'first step toward the agent era'; V3.2 integrates reasoning and tool invocation in a single unified decoding loop. V3.2 (December 2025) is the first model to reason while executing tools — maintaining chain-of-thought across multiple tool calls rather than reasoning first, then executing. Three innovations: DeepSeek Sparse Attention for efficient long-context, a scalable RL framework allocating 10%+ of pre-training compute, and a large-scale agentic task synthesis pipeline covering 1,800+ environments and 85K+ complex instructions.
Read full approach
Approach
Trains base and reasoning models against 1,800+ agentic training environments so tool-calling, planning, and error recovery are baked into the weights rather than bolted on via prompting. V3.1 was framed as the 'first step toward the agent era'; V3.2 integrates reasoning and tool invocation in a single unified decoding loop. V3.2 (December 2025) is the first model to reason while executing tools — maintaining chain-of-thought across multiple tool calls rather than reasoning first, then executing. Three innovations: DeepSeek Sparse Attention for efficient long-context, a scalable RL framework allocating 10%+ of pre-training compute, and a large-scale agentic task synthesis pipeline covering 1,800+ environments and 85K+ complex instructions.
Signature contribution
First open-weight model family explicitly optimized end-to-end for agent workloads at frontier scale.
V3.2: reasoning-while-executing (maintains CoT across tool calls)
DeepSeek Sparse Attention
10%+ pre-training compute allocated to RL
85K+ complex instructions across 1,800+ environments
SO WHAT?
Bets that baking agentic capabilities into model weights via RL over 1,800+ environments will outperform bolting tools onto generic models. Risk: open weights mean anyone can fine-tune, but the agentic RL training infrastructure is not open — so the moat is in training pipeline, not weights. For developers: strongest open-weight option for agent workloads, but you're dependent on DeepSeek's training choices with no way to retrain the agentic behaviors yourself.
Stateful conversational agents and European sovereign agent infrastructure
Key systemMistral Conversations API + Le Chat agents
Treats agent state as a first-class API concept. The Conversations API persists messages, tool calls, and agent identities server-side so clients do not have to reconstruct context on every turn. Le Chat exposes these agents to end users with web search, code execution, and image generation tools.
Read full approach
Approach
Treats agent state as a first-class API concept. The Conversations API persists messages, tool calls, and agent identities server-side so clients do not have to reconstruct context on every turn. Le Chat exposes these agents to end users with web search, code execution, and image generation tools.
Signature contribution
European frontier lab pushing a sovereignty-aware, persistently-stateful agent API as an alternative to stateless tool-calling.
Evidence
Mistral Agents API / Conversations API (2025)
Le Chat agent capabilities (code interpreter, web search, image gen)
Mistral Large and Medium models with native function calling
SO WHAT?
Bets that server-side state persistence is the right abstraction for agent memory, and that European sovereignty matters for enterprise adoption. Risk: stateful APIs create stickier vendor lock-in than stateless alternatives, and the sovereignty advantage only matters in regulated EU verticals. For developers: simplifies multi-turn agent state management significantly, but migrating away means rebuilding your persistence layer.
Enterprise retrieval-grounded agents with verifiable citations
Key systemCommand models with grounded generation and tool_plan
Treats citation and tool planning as structural outputs rather than prompt conventions. Command models emit a tool_plan field describing intended tool calls before execution and produce inline citations tied to source spans, making enterprise audits and RAG verification tractable. The tool_plan is a natural-language reasoning step generated before tool calls — an explicit chain-of-thought for tool selection. Two citation modes: fast (inline during generation) and accurate (post-generation, higher precision). Command A (111B) supports 256K context with 150% higher throughput than R+.
Read full approach
Approach
Treats citation and tool planning as structural outputs rather than prompt conventions. Command models emit a tool_plan field describing intended tool calls before execution and produce inline citations tied to source spans, making enterprise audits and RAG verification tractable. The tool_plan is a natural-language reasoning step generated before tool calls — an explicit chain-of-thought for tool selection. Two citation modes: fast (inline during generation) and accurate (post-generation, higher precision). Command A (111B) supports 256K context with 150% higher throughput than R+.
Signature contribution
Made inline citation and explicit tool planning first-class model outputs, not prompt hacks.
Evidence
Command R / R+ tool use with tool_plan field
Grounded generation with inline citations
Cohere RAG and connector APIs
North enterprise agent platform
tool_plan: explicit chain-of-thought for tool selection
Two citation modes: fast (inline) and accurate (post-generation)
Bets that citation provenance and explicit tool planning as model outputs solve the enterprise trust problem. Risk: structural citation is only as good as the retrieval — garbage sources with perfect citations create false confidence. For developers: if your use case requires auditable RAG with traceable sources, Cohere's approach is materially ahead; if you don't need citation, the overhead adds complexity.
Launched February 2026. Uses a meta-router that classifies incoming requests by type and complexity, dispatching to the optimal model from a pool of ~19 models (Claude Opus 4.6 for core reasoning, Gemini for deep research, Grok for speed, GPT-5.2 for long-context recall). Complex tasks decompose into task graphs with sub-agents in isolated Linux sandboxes. The citation-first architecture is built atop Perplexity's search-native retrieval backbone — grounding came before the agent layer.
Read full approach
Approach
Launched February 2026. Uses a meta-router that classifies incoming requests by type and complexity, dispatching to the optimal model from a pool of ~19 models (Claude Opus 4.6 for core reasoning, Gemini for deep research, Grok for speed, GPT-5.2 for long-context recall). Complex tasks decompose into task graphs with sub-agents in isolated Linux sandboxes. The citation-first architecture is built atop Perplexity's search-native retrieval backbone — grounding came before the agent layer.
Signature contribution
Multi-model routing with citation provenance from a search-native foundation. The agent layer was built on top of an existing retrieval system, not the other way around.
Evidence
Perplexity Computer (Feb 2026)
~19 model pool with task-type routing
Isolated Linux sandboxes for sub-agents
Search-native citation backbone
SO WHAT?
Bets that the best agent is a router, not a single powerful model. Risk: model pool management complexity grows with each new frontier model; the routing heuristics need constant tuning. For users: strongest choice when grounded, cited answers matter more than deep reasoning chains.
Native inference-time multi-agent reasoning with council-based architecture
Key systemGrok 4.20 Beta — 4-agent council on MoE backbone
Grok 4.20 Beta (February 2026) implements native inference-time multi-agent processing with 4 specialized agents — Grok (coordinator), Harper (research/facts), Benjamin (logic/code), Lucas (creative/divergent) — running as heads on the same MoE backbone. A cross-attention block enables critique embedding exchange, and MARL during post-training rewards rapid convergence (average debate under 180 tokens). A lightweight router bypasses the council for simple queries. SuperGrok Heavy scales to 16 agents.
Read full approach
Approach
Grok 4.20 Beta (February 2026) implements native inference-time multi-agent processing with 4 specialized agents — Grok (coordinator), Harper (research/facts), Benjamin (logic/code), Lucas (creative/divergent) — running as heads on the same MoE backbone. A cross-attention block enables critique embedding exchange, and MARL during post-training rewards rapid convergence (average debate under 180 tokens). A lightweight router bypasses the council for simple queries. SuperGrok Heavy scales to 16 agents.
Signature contribution
First production system to embed multi-agent debate as a native inference-time mechanism rather than an application-layer orchestration pattern.
~65% reduction in hallucinations on multi-step reasoning
SuperGrok Heavy: 16 agents
SO WHAT?
⚠️ Internal architecture details are partially speculative — xAI has not confirmed the exact parameter count or adapter mechanism. Bets that multi-agent reasoning should be in the weights, not in external orchestration. Risk: higher inference cost per query; fixed agent roles may not generalize. For users: strongest for multi-step reasoning tasks where hallucination reduction justifies the compute cost.
Strong single-agent stance: one context, one decision-maker, tools and subagents only when strictly necessary. The 'Don't Build Multi-Agents' essay argues that context fragmentation between parallel agents is the dominant failure mode in long-horizon work. Devin operates end-to-end on real repositories with its own VM, browser, and editor. The general Devin agent uses frontier models (currently Claude Sonnet 4.5), not a custom model. Kevin-32B is a separate open-source model trained with multi-turn RL specifically for CUDA kernel generation (91% correctness on KernelBench), not the general Devin agent model. Devin 2.0 supports fleet parallelism for scaling across tasks and dropped pricing from $500/month to $20/month.
Read full approach
Approach
Strong single-agent stance: one context, one decision-maker, tools and subagents only when strictly necessary. The 'Don't Build Multi-Agents' essay argues that context fragmentation between parallel agents is the dominant failure mode in long-horizon work. Devin operates end-to-end on real repositories with its own VM, browser, and editor. The general Devin agent uses frontier models (currently Claude Sonnet 4.5), not a custom model. Kevin-32B is a separate open-source model trained with multi-turn RL specifically for CUDA kernel generation (91% correctness on KernelBench), not the general Devin agent model. Devin 2.0 supports fleet parallelism for scaling across tasks and dropped pricing from $500/month to $20/month.
Signature contribution
Reported 67% PR merge rate on real open-source repos and publicly pushed back on multi-agent orthodoxy.
Evidence
Devin 2.0 release (2025)
'Don't Build Multi-Agents' essay (2025)
67% merge rate on open-source PR benchmark
Cognition agent VM + browser harness
SO WHAT?
Bets that a single agent with unified context outperforms multi-agent decomposition for long-horizon software tasks. Risk: single-context scaling has hard limits — as repository size grows, the agent eventually can't hold enough context to reason effectively, and the 'no multi-agent' stance may not survive that ceiling. For developers: the 67% merge rate is impressive but measured on curated OSS tasks; expect lower rates on messy internal codebases with implicit conventions.
Originally Codeium, now owned by Cognition AI. Uses the Cascade engine with two modes: Code and Chat. A specialized planning agent refines long-term plans in the background while the selected model handles short-term actions. Cascade tracks all user actions — file edits, terminal commands, navigation — to infer intent in real time (the 'Flow' paradigm). Graph-based codebase reasoning uses RAG-based indexing converting files to 768-dimensional embeddings with a proprietary M-Query retrieval technique.
Read full approach
Approach
Originally Codeium, now owned by Cognition AI. Uses the Cascade engine with two modes: Code and Chat. A specialized planning agent refines long-term plans in the background while the selected model handles short-term actions. Cascade tracks all user actions — file edits, terminal commands, navigation — to infer intent in real time (the 'Flow' paradigm). Graph-based codebase reasoning uses RAG-based indexing converting files to 768-dimensional embeddings with a proprietary M-Query retrieval technique.
Signature contribution
Real-time intent inference from user behavior patterns. The agent watches what you do, not just what you ask.
Evidence
Cascade engine (Code + Chat modes)
Flow paradigm — real-time user action tracking
Graph-based codebase reasoning
768-dimensional embeddings
M-Query retrieval technique
SO WHAT?
⚠️ 'Dependency graph' and 'mini-compiler' descriptions come from community analysis, not official documentation. Bets that observing developer behavior gives better context than explicit instructions. Risk: privacy concerns with action tracking; accuracy degrades in unfamiliar codebases where the graph is sparse. For users: strongest for long editing sessions in established codebases.
Treats harness design as the core product and iterates aggressively — publicly documented five full harness refactors in six months. Uses a single long-lived agent with a virtual computer, file system, and browser rather than multi-agent decomposition. Leaked system prompts became a widely-studied case in the agent community. Acquired by Meta for ~$2B in December 2025. Uses CodeAct (ICML 2024): instead of JSON tool calls, the agent generates Python scripts as actions, achieving ~20% higher success rates. A context-aware state machine manages tool availability using logit masking during decoding — constraining which tools can be selected at each state without invalidating KV-cache. The todo.md recitation pattern pushes the global plan into the model's recent attention span. Average task involves ~50 tool calls with a ~100:1 input-to-output token ratio.
Read full approach
Approach
Treats harness design as the core product and iterates aggressively — publicly documented five full harness refactors in six months. Uses a single long-lived agent with a virtual computer, file system, and browser rather than multi-agent decomposition. Leaked system prompts became a widely-studied case in the agent community. Acquired by Meta for ~$2B in December 2025. Uses CodeAct (ICML 2024): instead of JSON tool calls, the agent generates Python scripts as actions, achieving ~20% higher success rates. A context-aware state machine manages tool availability using logit masking during decoding — constraining which tools can be selected at each state without invalidating KV-cache. The todo.md recitation pattern pushes the global plan into the model's recent attention span. Average task involves ~50 tool calls with a ~100:1 input-to-output token ratio.
Signature contribution
Public case study for single-agent iteration velocity and for the operational realities of shipping a consumer autonomous agent.
Evidence
Manus consumer autonomous agent launch (2025)
Publicly discussed five harness refactors in six months
Bets that rapid harness iteration — five refactors in six months — will converge on a good consumer agent architecture faster than careful upfront design. Risk: iteration velocity without public benchmarks makes it hard to assess actual capability vs. demo polish. For developers: Manus is more instructive as a case study in harness design evolution than as a technology to build on, since the stack is closed.
Code-graph-aware IDE agents with shadow execution environments
Key systemShadow Workspace + Background Agents
Runs agents against a mirrored copy of the user's project (the Shadow Workspace) so speculative edits, builds, and tests never touch the live workspace. Background Agents run longer horizons asynchronously against remote VMs. Heavy investment in code graph indexing so retrieval is structure-aware rather than purely embedding-based. Cursor 3 (April 2026) rebuilt around Composer 2, a custom model trained via real-time RL in the exact Cursor harness. Cloud agents run in isolated VMs. A 'semantic diff' pipeline has the main LLM produce diffs → a cheaper apply-model writes files → linter checks feed back for self-correction. The original shadow workspace (September 2024) used a hidden Electron window for parallel iteration but was later removed.
Read full approach
Approach
Runs agents against a mirrored copy of the user's project (the Shadow Workspace) so speculative edits, builds, and tests never touch the live workspace. Background Agents run longer horizons asynchronously against remote VMs. Heavy investment in code graph indexing so retrieval is structure-aware rather than purely embedding-based. Cursor 3 (April 2026) rebuilt around Composer 2, a custom model trained via real-time RL in the exact Cursor harness. Cloud agents run in isolated VMs. A 'semantic diff' pipeline has the main LLM produce diffs → a cheaper apply-model writes files → linter checks feed back for self-correction. The original shadow workspace (September 2024) used a hidden Electron window for parallel iteration but was later removed.
Signature contribution
Pioneered isolated shadow execution environments for in-IDE agents, decoupling agent exploration from user state.
Shadow workspace concept (Sep 2024, later removed)
SO WHAT?
Bets that shadow execution environments and code-graph-aware retrieval are the right primitives for IDE agents. Risk: the Shadow Workspace adds significant infrastructure complexity — if models get good enough to reason about edits without speculative execution, the isolation layer becomes overhead. For developers: best-in-class for speculative multi-file edits today, but the approach is tightly coupled to IDE context and doesn't transfer to non-coding agent use cases.
Open multi-agent frameworks and enterprise agent tooling
Key systemAutoGen / Microsoft Agent Framework
AutoGen pioneered conversational multi-agent patterns; the newer Microsoft Agent Framework merges AutoGen and Semantic Kernel into a single supported stack. Magentic-One demonstrates a generalist orchestrator coordinating specialized agents for web, file, and code tasks. Copilot Studio exposes these patterns to enterprise low-code builders. ⚠️ Critical context: original AutoGen creators Chi Wang and Qingyun Wu departed Microsoft in late 2024 to establish AG2 as a community fork. Multiple sources report AutoGen has 'virtually disappeared from production environments' in 2026, with the pyautogen PyPI package no longer under Microsoft control. Microsoft merged AutoGen and Semantic Kernel into the Microsoft Agent Framework with RC 1.0 shipped February 2026.
Read full approach
Approach
AutoGen pioneered conversational multi-agent patterns; the newer Microsoft Agent Framework merges AutoGen and Semantic Kernel into a single supported stack. Magentic-One demonstrates a generalist orchestrator coordinating specialized agents for web, file, and code tasks. Copilot Studio exposes these patterns to enterprise low-code builders. ⚠️ Critical context: original AutoGen creators Chi Wang and Qingyun Wu departed Microsoft in late 2024 to establish AG2 as a community fork. Multiple sources report AutoGen has 'virtually disappeared from production environments' in 2026, with the pyautogen PyPI package no longer under Microsoft control. Microsoft merged AutoGen and Semantic Kernel into the Microsoft Agent Framework with RC 1.0 shipped February 2026.
Signature contribution
Largest body of open research on conversational multi-agent orchestration, now being consolidated into a unified enterprise stack.
Evidence
AutoGen open-source framework
Microsoft Agent Framework (AutoGen + Semantic Kernel merger)
Magentic-One generalist multi-agent system (2024)
Copilot Studio agent builder
SO WHAT?
Bets that conversational multi-agent patterns are the right abstraction for enterprise agent systems, now consolidating AutoGen and Semantic Kernel into one stack. Risk: framework merger creates migration churn for existing AutoGen users, and the multi-agent-first philosophy adds orchestration overhead that simpler reasoning-model approaches avoid. Additional framework stability risk: the departure of AutoGen's original creators and the pyautogen package changing hands raises questions about continuity for teams that built on the original AutoGen. For developers: strongest option if you need structured multi-agent workflows with enterprise governance, but evaluate whether a single-agent approach solves your problem first.
Enterprise data catalog agents, production simplification case study
Key systemo3-backed single-agent stack (replacement for hierarchical multi-agent)
Publicly documented replacing a hierarchical multi-agent orchestration with a single reasoning-model-driven agent once o3-class models became available. The case study is widely cited as evidence that model capability gains can collapse entire orchestration layers, and that most enterprise agent work does not need multi-agent decomposition.
Read full approach
Approach
Publicly documented replacing a hierarchical multi-agent orchestration with a single reasoning-model-driven agent once o3-class models became available. The case study is widely cited as evidence that model capability gains can collapse entire orchestration layers, and that most enterprise agent work does not need multi-agent decomposition.
Signature contribution
Canonical production case for 'the model ate the framework' — retiring a multi-agent system in favor of one reasoning model.
Evidence
Alation engineering talk / writeup on retiring hierarchical multi-agent stack
Migration from multi-agent orchestration to o3 single-agent
Production deployment in enterprise data catalog workflows
Argues that enterprise agents should be expressed as explicit state machines with auditable transitions, approval gates, and reversible actions rather than free-form LLM loops. This aligns agent behavior with existing enterprise risk, audit, and compliance structures and makes failure modes inspectable.
Read full approach
Approach
Argues that enterprise agents should be expressed as explicit state machines with auditable transitions, approval gates, and reversible actions rather than free-form LLM loops. This aligns agent behavior with existing enterprise risk, audit, and compliance structures and makes failure modes inspectable.
Signature contribution
Leading voice for governance-first agent architectures in regulated enterprise deployments.
Evidence
QuantumBlack publications on enterprise agent architecture
McKinsey Digital reports on generative AI operating models
Client case studies on state-machine agent governance
SO WHAT?
Bets that explicit state machines with approval gates are worth the upfront design cost for enterprise agent deployments. Risk: state-machine rigidity can bottleneck iteration speed — every new agent behavior requires a new auditable state transition, which slows teams used to prompt-and-ship cycles. For developers: the right pattern if you're in regulated industries (finance, healthcare) where auditability is non-negotiable; overkill for internal tools or consumer products.
Open-source AI research + Llama Stack pluggable-provider API framework for agent infrastructure
Key systemLlama Stack — provider-agnostic agent API
Long-running open research program on agents that must reason about other agents — negotiation, planning under partial information, and natural-language coordination. CICERO combined a strategic planning module with a dialogue model to reach human-level performance in Diplomacy. Subsequent work has focused on open tool-use datasets and evaluation. Llama Stack is a pluggable-provider API framework with full OpenAI API compatibility. Key APIs span inference, safety (Llama Guard 3, Prompt Guard), memory (vector/KV/keyword/graph), and agentic orchestration. Core design principle is provider-agnostic: 'Develop locally with Ollama, deploy to production with vLLM — the API stays the same.' Llama 3.1+ models have built-in tool calling. Toolformer's research influence is indirect — Llama Stack standardizes the infrastructure, not the model capability.
Read full approach
Approach
Long-running open research program on agents that must reason about other agents — negotiation, planning under partial information, and natural-language coordination. CICERO combined a strategic planning module with a dialogue model to reach human-level performance in Diplomacy. Subsequent work has focused on open tool-use datasets and evaluation. Llama Stack is a pluggable-provider API framework with full OpenAI API compatibility. Key APIs span inference, safety (Llama Guard 3, Prompt Guard), memory (vector/KV/keyword/graph), and agentic orchestration. Core design principle is provider-agnostic: 'Develop locally with Ollama, deploy to production with vLLM — the API stays the same.' Llama 3.1+ models have built-in tool calling. Toolformer's research influence is indirect — Llama Stack standardizes the infrastructure, not the model capability.
Signature contribution
Only lab to demonstrate human-level performance in a natural-language negotiation game, and a primary source of open agent research artifacts.
Evidence
CICERO Diplomacy agent (Science, 2022)
Open tool-use and agent evaluation datasets
Llama models with native tool use
FAIR open research publications on multi-agent interaction
Llama Stack API framework
Llama Guard 3 + Prompt Guard safety
OpenAI API compatibility layer
SO WHAT?
Bets on open-weight ecosystem moat. Risk: Meta's agent infrastructure maturity lags behind Anthropic/OpenAI/Google; Llama Stack adoption is early. For developers: strongest value proposition when you want to avoid vendor lock-in and own the full stack.
Domain-specialized agents that combine a learned policy with extensive verified search. AlphaCode samples and filters programs against test cases; AlphaProof translates problems into Lean and searches proof space under a learned prior; Isomorphic Labs applies analogous ideas to drug discovery. All share the pattern of grounding LLM candidates in a verifier rather than trusting single-shot generation.
Read full approach
Approach
Domain-specialized agents that combine a learned policy with extensive verified search. AlphaCode samples and filters programs against test cases; AlphaProof translates problems into Lean and searches proof space under a learned prior; Isomorphic Labs applies analogous ideas to drug discovery. All share the pattern of grounding LLM candidates in a verifier rather than trusting single-shot generation.
Signature contribution
Defines the verifier-grounded frontier — agents whose outputs are filtered through formal or experimental verification rather than self-critique.
Evidence
AlphaCode and AlphaCode 2 (competitive programming)
Bets that formal verification (proof checkers, test suites) as a filter on LLM-generated candidates is the path to reliable long-horizon agents. Risk: verifier-grounded approaches only work in domains where verification is tractable — math proofs and code tests qualify, but most business tasks don't have clean verifiers. For developers: the pattern (generate many, verify few) is transferable, but building your own verifier is the hard part.
Twelve approaches to agent memory. Memory is a learnable decision, not a data structure.
Memory-R1Hybrid
Reinforcement-learned memory management for long-horizon agents
Trains a policy to decide what to write, recall, and forget in agent memory. The policy is optimized end-to-end against task rewards rather than handcrafted heuristics, and outperforms rule-based memory managers on long-horizon reasoning benchmarks.
Key insight
Memory is a learnable policy, not a data structure. Writing and forgetting are decisions that can be optimized.
Tradeoff
Requires training data and RL compute; policies are brittle outside the trained task distribution.
Revisits completed agent trajectories after the fact and rewrites them into cleaner, more instructive episodes before storing them as memory. The revised traces act as higher-quality exemplars for future retrieval, improving downstream task performance without changing the base policy.
Key insight
What the agent remembers about an episode should not be the raw trace — it should be the lesson extracted in hindsight.
Tradeoff
Adds an offline revision pass and risks introducing hindsight bias if the revision model hallucinates.
Treats the LLM context window like RAM and external storage like disk. A controller LLM pages information in and out of context using explicit function calls, maintaining a working set, a recall archive, and a core persona block. This lets agents operate over effectively unbounded histories with a fixed context window.
Key insight
Virtual memory is the right abstraction for bounded context windows — give the model tools to manage its own paging.
Tradeoff
Every page-in/page-out is a tool call that costs tokens and latency, and the controller can mis-page under pressure.
Cognitive architectures framework for language agents
Provides a reference architecture decomposing language agents into working memory, episodic memory, semantic memory, and procedural memory, connected via reasoning, retrieval, learning, and grounding actions. CoALA is primarily a conceptual framework that lets researchers classify and compare otherwise incompatible agent designs.
Key insight
Language agents already implement cognitive-architecture concepts implicitly; naming them explicitly makes design choices comparable.
Tradeoff
Descriptive rather than prescriptive — does not specify implementations, so two CoALA-compliant agents can differ enormously.
Timestamped observation stream with reflection and importance-weighted retrieval
Each agent maintains an append-only stream of natural-language observations. Retrieval scores candidates on recency, importance, and relevance. Periodically the agent generates higher-level 'reflections' by asking itself what the recent stream implies, and those reflections are written back into the same stream, producing a ladder of abstraction.
Key insight
A flat observation log plus periodic self-reflection is enough to produce emergent long-horizon behavior in social simulations.
Tradeoff
Retrieval quality degrades as the stream grows, and reflections can amplify errors if earlier observations were wrong.
Procedural memory as a growing library of executable skill code
In Minecraft, Voyager asks an LLM to write JavaScript functions that accomplish tasks, verifies them in the environment, and stores successful functions in a skill library keyed by natural-language descriptions. Future tasks retrieve relevant skills by description and compose them, so the agent's capability grows monotonically as code rather than as weights.
Key insight
Procedural memory can be literal, inspectable source code — skills that accumulate form a curriculum the agent writes for itself.
Tradeoff
Limited to domains where actions can be expressed as verifiable code and where failures are cheap to retry.
User-scoped long-term semantic memory with explicit user control
Projects bundle documents, instructions, and conversations into a persistent workspace that Claude can reference across sessions. Claude Memory extends this with automatically maintained user facts that the model can read and update. Both are scoped per user and designed for transparent inspection and editing by the user rather than opaque background personalization.
Key insight
Long-term memory in a consumer product is as much a UX problem as a retrieval problem — users need to see and edit what the agent remembers.
Tradeoff
Explicit memory surfaces add UI friction and require the model to stay consistent with user-edited facts.
ChatGPT automatically extracts salient facts from conversations and stores them in a user-scoped memory store that is injected into future system prompts. Users can list, edit, and delete memories, and can disable the feature entirely. Memory is cross-session but bounded per user account.
Key insight
Most consumer agent memory value comes from a small set of stable user facts, not from full conversational replay.
Tradeoff
Salience extraction is a heuristic; over-remembering creates privacy concerns and under-remembering makes the feature feel inert.
Stateful agent platform productizing the MemGPT memory architecture
Letta (formerly the MemGPT company) provides a server and SDK for building agents whose memory, tools, and identity persist across sessions. It exposes MemGPT-style core memory blocks, archival memory, and recall memory as first-class concepts, with a database backend so state survives process restarts.
Key insight
Stateful agents need a memory database, not a chat history — persistence should be a platform concern, not a prompt concern.
Tradeoff
Introduces a server and schema that application developers must operate and reason about.
Mem0 adds a pluggable memory layer to LLM apps with user-level, session-level, and agent-level scopes. It extracts facts from conversations, deduplicates and updates them over time, and exposes a simple add/search API so applications do not have to build their own retrieval pipeline.
Key insight
Most production LLM apps need memory at multiple scopes simultaneously (user, session, agent); a shared layer is cheaper than rebuilding each.
Tradeoff
Extraction and dedup quality depends on the underlying model; stale or duplicated facts leak into retrieval results.
Zep builds a temporal knowledge graph from agent conversations and documents, where entities and relationships carry validity intervals. Retrieval returns facts that were true at a given time, not just semantically similar text. This lets agents answer questions about how user state evolved and avoid recalling stale facts.
Key insight
Semantic memory without time is a lie detector waiting to happen — facts have validity windows and memory should model them.
Tradeoff
Maintaining a temporal graph is heavier than vector search and requires reliable entity resolution.
Session-scoped checkpoints plus cross-session store as framework primitives
LangGraph separates memory into two primitives. The Checkpointer persists the full graph state after every step, giving session-scoped short-term memory, resumability, and time-travel debugging. The Store provides a cross-session, namespaced key-value memory for long-term facts. Both are pluggable across SQLite, Postgres, and Redis backends.
Key insight
Short-term and long-term memory have different failure modes — treating them as separate primitives with different backends avoids conflating durability with retrieval.
Tradeoff
Two APIs to reason about, and the Store leaves higher-level concerns like extraction and dedup to the application.
Five unresolved questions shaping how agents are built. Each shows pro, con, and a
practitioner resolution.
Single vs. MultiSingle-Agent vs. Multi-Agent
Pro
Multi-agent systems deliver measurable improvements on parallelizable breadth tasks and outperform single-agent baselines on research and analysis work. Google Research (December 2025, 180 configurations) found centralized coordination improves parallelizable tasks by 80.8%.
Reasoning models handle planning natively. Adding agents adds coordination overhead, failure modes, and 15× token cost without corresponding gains on depth tasks, but degrades sequential reasoning by 39–70%. The 'Rule of 4' limits effective team sizes to 3-4 agents before communication overhead dominates. Princeton NLP found single-agent matched or outperformed multi-agent on 64% of benchmarked tasks.
Princeton NLP — single-agent wins 64% of benchmarks
Resolution
Task-dependent, not universal. Multi-agent for parallelizable breadth. Single-agent for sequential depth. The Rule of 4 caps effective team size. Cost multiplier: single agents 4× tokens vs chat, multi-agent 15×.
Frameworks capture collective wisdom from thousands of production deployments. They provide useful infrastructure — task queues, persistence, checkpointing — that you'd otherwise rebuild. Harrison Chase distinguishes frameworks (abstractions), runtimes (durable execution), and harnesses (batteries-included) — arguing the runtime and harness layers provide lasting value. 'Use LangGraph for agents, not LangChain.'
Extra abstraction layers obscure what the model actually sees. When prompts and responses are hidden behind framework code, debugging becomes archaeology. MindStudio argues better models have made 'many framework abstractions unnecessary or actively harmful.'
Own your cognitive architecture (it's your differentiator). Outsource agentic infrastructure (task queues, persistence, checkpointing — commodity work). The line isn't framework vs no framework — it's 'what part do I need to see directly to reason about my system?' The debate is evolving: the consensus is moving toward raw SDK for simple agents, LangGraph for stateful orchestration, CrewAI for rapid prototyping.
Scaffolding vs. MinimalThe "Bitter Lesson" for Agents
Pro
Harness engineering yields measurable improvements independent of model quality. LangChain achieved 13.7pt TerminalBench 2.0 gains through harness changes alone — no model swap. Google Research's capability saturation threshold (~45%) supports this — as models improve, engineering's marginal value decreases.
Heavy scaffolding has diminishing returns. As models improve, the scaffolding you built last year becomes the ceiling that prevents this year's model from showing its real capability. But Manus rebuilt their harness 5 times with the same models and improved each time. And compound-error math (0.99^100 = 36.6%) constrains all architectures regardless.
Manus — 5 harness rebuilds, same models, improving each time
Resolution
Current evidence supports a hybrid: engineering matters now, but simpler architectures survive model upgrades better. The bitter lesson applies at the architecture level but not at the harness level.
Full automation isn't the goal — cognitive augmentation is. The best AI products extend human capability rather than replace it. Tesla ran Autopilot for 12 years without achieving full self-driving. Karpathy's autonomy slider (YC Keynote, June 2025): 'Less AGI hype and flashy demos that don't work, more partial autonomy, custom GUIs and autonomy sliders.' Cursor's Tab → Cmd+K → Cmd+L → Agent mode exemplifies graduated autonomy.
Genuine autonomous capability exists today in bounded domains. Claude Code and Devin demonstrate long-horizon autonomy when the task shape is well-understood. Refusing to automate is leaving value on the table. Karpathy's autonomy slider (YC Keynote, June 2025): 'Less AGI hype and flashy demos that don't work, more partial autonomy, custom GUIs and autonomy sliders.' Cursor's Tab → Cmd+K → Cmd+L → Agent mode exemplifies graduated autonomy.
Karpathy's "autonomy slider" — Cursor demonstrates this cleanly: Tab → Cmd+K → Cmd+L → Agent mode. Let users dial in the autonomy level themselves based on task risk and their own trust. The debate isn't 'which mode' — it's 'expose the slider'. This debate is effectively settled as a design pattern rather than an ideological argument — builders should implement autonomy sliders, not choose sides.
Focused vs. GeneralFocused-First vs. General-Orchestrator
Pro
Focused agents with narrow scope win. Production evidence shows monolithic agents reach ~40% success on complex tasks while focused-and-composed agents hit ~95% at 20–30% context each — scope discipline is the single biggest reliability lever. Google Research's December 2025 study (180 configurations) found task type determines architecture more than any other variable — builders should choose by task type, not autonomy level.
General orchestrators scale better as model capability grows. The Bitter Lesson applied to agents says that frontier reasoning models (DeepSeek V3.1, o3-class) absorb planning and decomposition natively, so a single capable agent outperforms hand-built hierarchies.
DeepSeek V3.1 technical report
Alation — replaced a multi-agent hierarchy with a single o3 agent
Resolution
Focused-first, general-later. Start with narrow agents and compose them; migrate toward a single-agent architecture as frontier-model capability absorbs orchestration overhead. The right answer shifts with each model generation. A dual taxonomy is emerging: the 5-type autonomy framework for strategic planning, but task-type-first (parallelizable vs sequential vs exploratory) for day-to-day design decisions.
How multi-agent systems fail in production — and the mitigations that measurably
reduce them.
Cascade amplification
critical
A single erroneous output propagates through a multi-agent topology until all nodes hold the corrupted state. Downstream agents treat upstream claims as ground truth and re-emit them with added confidence.
Evidence
5 of 6 frameworks tested reached 100% infection from one seed error (Spark to Fire, City University of Macau, 2026).
Mitigation
Multi-agent verification pipelines — require independent re-derivation of load-bearing claims before accepting upstream output, with verifiers that cannot see the original chain of reasoning.
Impact~2,800% reduction in hallucination propagation reported by Spark to Fire with verification pipelines in place.
Topological fragility
high
Errors injected at hub nodes dominate the outcome far more than errors at leaves. Star and tree topologies concentrate blast radius on whichever agent routes or aggregates traffic.
Evidence
Hub-vs-leaf injection showed a ~10.3× impact gap in Spark to Fire experiments across standard multi-agent topologies.
Mitigation
Avoid hub bottlenecks. Prefer flatter topologies, replicate critical routing/aggregation roles, and add independent cross-checks on any node the system cannot afford to be wrong.
ImpactFlattening topology and duplicating hub responsibilities brings worst-case impact closer to leaf-injection baseline.
Consensus inertia
high
Iterative debate and voting loops lock in early errors. Each round inherits the prior round's framing, so agents rationalize rather than revisit, and consensus hardens around whichever answer got traction first.
Evidence
~3.9× contextual debt accumulated by round 6 in Spark to Fire's debate protocols, with error rates increasing rather than decreasing with additional rounds.
Mitigation
Bound the number of rounds. Periodically restart from independent initial conditions and compare, rather than letting a single thread converge unchallenged.
ImpactBounded-round protocols with independent restarts recover most of the accuracy lost to runaway debate loops.
Genealogy drift
high
Agents forget the provenance of intermediate claims. As outputs get summarized, paraphrased, and forwarded, the link back to the originating source (and its confidence) is lost, so every claim ends up looking equally authoritative.
Evidence
Spark to Fire reports defense success against adversarial injection rising from 0.32 to 0.89 when genealogy-graph governance is enforced end-to-end.
Mitigation
Attach and propagate a genealogy graph with every intermediate artifact — who produced it, from which inputs, with what confidence. Refuse to act on claims whose provenance has been stripped.
ImpactDefense success 0.32 → 0.89 (Spark to Fire) once provenance is mandatory rather than optional.
Systemic coordination failure
critical
Most multi-agent failures are not individual-agent capability gaps — they are role confusion, dropped handoffs, redundant work, and missing termination conditions between otherwise-capable agents.
Evidence
MAST (Multi-Agent System Failure Taxonomy) finds 70%+ of observed multi-agent failures are systemic coordination issues, not model capability limits.
Mitigation
Reduce agent count. Apply 12-Factor Factor X (small, focused agents) and collapse overlapping roles; every additional agent is a new coordination surface and a new failure mode.
ImpactTeams that cut agent count and tightened roles after MAST-style audits recover most of the lost reliability without changing models.
Hallucination compounding
high
Per-step hallucination rates that look tolerable in isolation compound multiplicatively across a multi-agent pipeline. A 5% step error over ten steps leaves roughly 40% of trajectories corrupted end-to-end.
Evidence
Both MAST (arXiv 2503.13657) and Spark to Fire (arXiv 2603.04474) document compounding error rates as a dominant source of multi-agent unreliability.
Mitigation
Ground every step with retrieval (RAG) against authoritative sources and refuse to act on unsourced claims. Prefer fewer, grounded steps over many ungrounded ones.
Impact42–68% reduction in compounded hallucination reported across RAG-grounded multi-agent pipelines vs. ungrounded baselines.
Context poisoning
critical
Adversarial instructions embedded in untrusted inputs (web pages, emails, tool outputs, retrieved docs) are ingested into the agent's context window and then executed as if they were operator instructions. One poisoned input can compromise an entire downstream agent chain.
Evidence
Spark to Fire and MAST both document prompt-injection propagation as a first-class multi-agent failure mode, not merely a single-agent concern.
Mitigation
Per-agent untrusted-input firewalls: segregate untrusted content into quarantined channels, strip or escape instruction-like patterns, and never allow untrusted text to reach the instruction layer of downstream agents.
ImpactFirewalled architectures dramatically reduce cross-agent propagation of injected instructions, though they do not eliminate single-agent exposure.
Reward hacking in review
high
An agent grading its own work (or a sibling's work from the same lineage) systematically inflates scores, ignores its own errors, and rationalizes failures as successes. Self-grading loops converge on self-justification rather than truth.
Evidence
MAST identifies self-verification as a recurring coordination failure; production teams repeatedly report accuracy metrics collapsing once the grader is replaced with an independent judge.
Mitigation
Apply 12-Factor Factor V — no agent grades its own work. Use a separately-prompted, separately-contexted judge agent, or better, deterministic verification (tests, schema checks, re-derivation).
ImpactIndependent judges and deterministic checks routinely surface failures that self-graders marked as passing, restoring calibrated accuracy signals.
Architecture Limits
Five architectures we can't yet build and the engineering blockers in front of each.
Persistent Cognitive Systems
Agents with durable identity and evolving expertise across weeks-to-years of continuous operation, accumulating skill rather than restarting each session.
Memory structure
No consensus on how to organize episodic vs. semantic vs. procedural memory at scale without context degradation. Current systems collapse distinctions into flat vector stores that lose temporal and causal ordering.
Context degradation
Long rolling contexts accumulate noise and contradictions; summaries lose fidelity. Every compaction step is lossy, and the loss is unevenly distributed across facts.
Autonomous memory management
Forgetting is a decision, not a side effect. Current agents rely on brittle heuristics or fixed token windows and cannot autonomously decide what to retain, revise, or discard.
Large populations of specialist agents that discover, bid for, and trade work through price signals rather than hand-coded routing — coordination as an emergent property of an internal economy.
O(n²) coordination cost
Pairwise negotiation and discovery costs scale quadratically with agent count; without a matching layer the economy collapses under its own message volume long before producing useful allocations.
Latency
Bidding, settlement, and arbitration add round-trips on top of already-slow LLM inference. Any task sensitive to wall-clock time cannot tolerate a market in the hot path.
Shared knowledge
Markets presuppose a common ontology of goods, qualities, and contracts. Agents with divergent world models cannot reliably price each other's outputs, leading to adverse selection and collapse of the price signal.
Nearest precedentResearch prototypes (MetaGPT-style role markets, auction-based MAS) — none in production at meaningful scale.
RossLabs synthesis
Self Improving Agent Systems
Agents that measurably get better at their own job through deployment — not just more context, but genuinely updated capability that compounds over time.
Static LLM weights
The underlying model is frozen between releases. All 'learning' happens in prompts, tools, or retrieval layers — which are easier to poison than to harden, and reset on every model upgrade.
Weak feedback signals
Production traces rarely include ground-truth outcomes. Self-reported success, user thumbs, and downstream proxies are noisy and game-able, so gradient direction is unreliable even when a learning loop exists.
Catastrophic drift
Online updates optimized against noisy rewards routinely degrade previously-solid behavior. Without regression harnesses comparable to pretraining evals, each 'improvement' risks silent capability loss.
Nearest precedentVoyager, Reflexion, STaR-style self-training — bounded domains, no durable identity carried across sessions.
RossLabs synthesis
Hierarchical Cognitive Organizations
Deeply-layered agent organizations where strategy at the top decomposes into tactics in the middle and execution at the leaves — coherent over long horizons, like a functional org chart that actually works.
Planning horizons
Current models lose coherence beyond a few dozen dependent steps. Multi-week plans fragment into locally plausible but globally inconsistent subgoals, and no layer can reliably detect the divergence.
Constraint enforcement
Upper layers cannot reliably bind lower layers to their intent. Sub-agents routinely rationalize around constraints rather than respect them, and the parent has no privileged channel to verify compliance.
Weak world models
Hierarchy presumes the top layer can reason about consequences in the environment. LLMs still lack the durable causal models needed for long-horizon planning to stay grounded rather than drift into narrative.
Nearest precedentMetaGPT, ChatDev, AutoGen hierarchical teams — impressive demos, brittle under load.
RossLabs synthesis
Distributed Reasoning Networks
Large networks of agents collectively reasoning over shared problems — no single coordinator, intermediate results flowing through the graph until consensus emerges.
Memory infrastructure
Shared blackboards, vector stores, and message queues are not yet engineered for the throughput, consistency, and provenance guarantees a reasoning network requires. Current infra is built for request/response, not continuous collective cognition.
Observability
When a distributed reasoning network produces a wrong answer, there is no tractable way to attribute the error. Traces explode combinatorially and existing tracing tools were built for microservices, not for semantic causality.
Compute cost
Every additional node multiplies inference spend. Without order-of-magnitude cost reductions or drastic model specialization, the economic floor for a meaningful network is already above most budgets.
Nearest precedentResearch demos on graph-of-thought and debate ensembles — small node counts, short horizons.
RossLabs synthesis
Benchmarks
Six benchmarks scored across the CLASSic framework — Cost, Latency, Accuracy, Security,
Stability.
SWE-bench
2,294 real GitHub issues across 12 popular Python repositories
Measures whether agents can resolve real bug reports and feature requests end-to-end against a repository's own test suite.
Top scoreSWE-bench Verified >65% (2026)
Key finding50× cost variations observed for equivalent accuracy (Kapoor et al., Princeton TMLR 2025).
CLASSic
CostHigh — full pytest runs per iteration
LatencyMinutes per task
AccuracyPrimary axis of comparison
SecurityN/A — sandboxed containers
StabilityContamination sensitive; see SWE-bench-Live
WebArena
812 long-horizon tasks across 4 self-hosted, fully reproducible web apps (e-commerce, forum, dev, CMS)
Measures whether agents can complete realistic multi-step web tasks in a deterministic, offline-replicable environment.
Top scoreTop agents ~40-50% task success (2025)
Key findingGap between LLM 'knows the steps' and 'executes them reliably on a live DOM' remains the dominant failure mode.
CLASSic
CostMedium — many DOM observations per task
LatencyTens of seconds to minutes per task
AccuracyExact end-state matching
SecuritySandboxed, self-hosted apps
StabilityHigh — fully reproducible snapshots
AgentBench
8 distinct environments spanning OS, DB, knowledge graph, card game, web shopping, web browsing, household, and code
Cross-domain evaluation of LLM-as-agent across reasoning, tool use, and decision-making in heterogeneous environments.
Top scoreFrontier closed models significantly outperform open models on multi-turn tasks
Key findingPersistent gap between closed and open models in long-horizon instruction following, even when single-turn scores are comparable.
CLASSic
CostVariable by environment
LatencySeconds to minutes per task
AccuracyPer-environment success metrics
SecuritySandboxed per environment
StabilityMixed — some environments stochastic
GAIA
466 real-world assistant questions requiring reasoning, multi-modality, web browsing, and tool use
Tests general AI assistant competence on tasks easy for humans but hard for models — conceptually simple but execution-heavy.
Top scoreTop agents ~70%+ on Level 1, sharply lower on Levels 2-3 (2025)
Key findingHumans score ~92% with no training; frontier agents still drop steeply as task depth increases.
CLASSic
CostMedium-high — web and tool calls per question
LatencyVariable, often minutes per question
AccuracyExact-match answer
SecurityLive web — read-only
StabilityAnchored to frozen question set
WorkArena
33+ enterprise knowledge-worker tasks on a live ServiceNow instance (forms, lists, dashboards, workflows)
Measures agent competence on realistic enterprise SaaS workflows that knowledge workers actually perform day-to-day.
Top scoreFrontier agents succeed on basic tasks but drop sharply on compositional workflows
Key findingEnterprise UIs expose the gap between web-browsing agents and true workflow execution; small UI variations break otherwise capable agents.
CLASSic
CostMedium — live enterprise UI interaction
LatencyTens of seconds to minutes per task
AccuracyTask-completion verification via instance state
SecuritySandboxed ServiceNow dev instances
StabilityMedium — depends on instance version
SWE-bench-Live
Continuously refreshed stream of recent GitHub issues, replacing the frozen SWE-bench snapshot
Addresses training-set contamination in SWE-bench by evaluating agents on issues that post-date model training cutoffs.
Top scoreScores typically lower than SWE-bench Verified, exposing contamination deltas
Key findingGap between SWE-bench and SWE-bench-Live quantifies how much reported agent performance is memorization vs. genuine repair capability.
CLASSic
CostHigh — same pytest-per-iteration profile as SWE-bench
LatencyMinutes per task
AccuracyPer-issue test pass/fail
SecuritySandboxed containers
StabilityIntentionally non-stationary — that is the point
12-Factor Agent Principles
Production discipline for agentic systems.
I
Natural Language to Tool Calls
The job of the model is to turn natural language into structured tool calls, not to run the program.
Treat the LLM as a translator from intent to a typed action, then hand that action to deterministic code. Keeping the model on the translation side of the line is what makes agent behavior testable, observable, and replayable.
Anti-pattern: Letting the model emit free-form prose that downstream code has to parse with regex and vibes.
II
Own Your Prompts
Prompts are production code. Write them, version them, diff them, and test them like any other critical system.
Hidden framework prompts, autogenerated templates, and 'smart' wrappers make behavior impossible to reason about. The prompts your model actually sees should live in your repo, under review, with the same rigor as application code.
Anti-pattern: Relying on a framework's default prompt and shipping whatever it happens to emit this release.
III
Own Your Context Window
You, not the framework, decide what goes into the context window on every turn.
Context is the agent's short-term world model. Leaving its construction to a framework black box means you cannot reproduce behavior, cannot audit failures, and cannot optimize cost. Build the window explicitly from typed inputs you control.
Anti-pattern: Auto-appending full chat history, all tool outputs, and retrieved docs without a deliberate shaping step.
IV
Tools Are Just Structured Outputs
A tool call is nothing more than the model producing a structured output that your code chooses to act on.
Demystifying tools removes the magic: the model emits typed JSON, a handler validates and executes it, and the result is fed back as context. Framing everything this way lets you add, remove, or swap tools without reaching for a new abstraction layer.
Anti-pattern: Treating tool-calling as a special model feature that requires a framework-specific runtime to manage.
V
Unify Execution State and Business State
The agent's execution state and the application's business state should live in the same durable store.
Splitting 'where the agent is in its loop' from 'what the business thinks happened' guarantees drift. Writing both to one transactional store makes pause, resume, retry, and audit trivial, and lets you reason about the system as a state machine instead of a ghost.
Anti-pattern: Keeping agent progress in memory or ephemeral framework objects while business records live in a database.
VI
Launch / Pause / Resume With Simple APIs
Agents should be startable, pausable, and resumable through plain HTTP-style APIs, not bespoke runtimes.
Treating an agent run as a resource you create, inspect, and continue — with durable state behind it — gives you the operational surface every other production system already has: retries, timeouts, scaling, and human intervention, all without special tooling.
Anti-pattern: A long-running in-process loop that cannot be stopped, inspected, or resumed after a crash.
VII
Contact Humans With Tool Calls
Asking a human for input is just another tool call — model it that way.
When the agent needs approval, clarification, or review, emit a structured 'ask_human' call with a typed payload. The human response becomes a typed result that flows back into context like any other tool output, so human-in-the-loop is a first-class part of the control flow, not a bolted-on escape hatch.
Anti-pattern: Pausing via side-channel UI widgets whose responses never re-enter the agent's formal state.
VIII
Own Your Control Flow
You write the loop. The model advises; your code decides.
Framework-owned control flow hides the most load-bearing logic in your system: when to call a tool, when to retry, when to stop, when to escalate. Keeping control flow in code you own makes behavior debuggable, testable, and portable across models.
Anti-pattern: Handing the outer loop to a framework and hoping its heuristics match your business rules.
IX
Compact Errors Into Context Window
Errors are signal. Summarize them, feed them back, and let the model self-correct instead of crashing.
A 4KB stack trace is noise; a one-line compacted error ("HTTP 429 from api/x, retry after 30s") is something the model can act on. Treating exceptions as just another tool result — in a compact, typed form — turns most transient failures into recoverable steps.
Anti-pattern: Letting raw exceptions bubble up and kill the run, or dumping full stack traces into the prompt.
X
Small, Focused Agents
Each agent owns a single, narrow responsibility — then you compose them.
Production evidence shows monolithic agents achieve roughly 40% success on complex tasks while focused-and-composed agents reach ~95% at 20–30% context each. Scope discipline is the single biggest lever on reliability, and it costs nothing except restraint.
Anti-pattern: A 'god agent' that plans, executes, verifies, and reports in one ever-growing prompt.
XI
Trigger From Anywhere, Meet Users Where They Are
Agents should be reachable from Slack, email, CLI, cron, webhooks — wherever work actually originates.
Pinning an agent to a single UI wastes most of its value. If the trigger surface is just 'post a structured event to the agent's API', every new channel is a thin adapter rather than a rewrite, and users get help in the tools they already live in.
Anti-pattern: Building a dedicated chat UI and requiring users to context-switch into it for every interaction.
XII
Make Your Agent a Stateless Reducer
An agent step is a pure function: (state, event) → (new state, actions). Keep it stateless and push durability to the store.
Modeling each turn as a reducer makes the system replayable, testable, and horizontally scalable. State lives in the database; the agent process is disposable. This is the property that turns agent runs from fragile long-lived processes into ordinary distributed systems.
Anti-pattern: Stateful in-memory agent objects whose behavior depends on how long they've been alive.
Architecture Timeline
Five eras of agent architecture, 2015 to today.
2015–2017
RL Pioneers
Goal-directed agents in closed environments. Reinforcement learning produces the first systems that plan and act toward objectives.
AlphaGo defeats Lee Sedol
DQN (Deep Q-Networks)
OpenAI Gym standardizes RL benchmarks
2017–2020
Transformer
Scalable reasoning emerges from attention. The foundation for every subsequent agent architecture lands in a single 2017 paper.
Attention Is All You Need (2017)
BERT (2018)
GPT-2 / GPT-3 (2019–2020)
2021–2023
First LLM Agents
Chain-of-thought plus tool use creates the first real agent primitives. The field discovers that loops + prompts + tools = planning.
ReAct reasoning framework
Toolformer tool integration
AutoGPT, BabyAGI autonomous loops
2023–2025
Framework Explosion
Multi-agent orchestration matures. Frameworks compete to codify 'collective wisdom' from early production experience.
A2A and MCP become de facto standards. Vertical agents dominate specific domains — legal, finance, healthcare, code. Enterprise trust frameworks mature. Type III orchestrated teams reach 40%+ adoption. The phase where 'agents' stops being a buzzword and becomes infrastructure.
Mid · 3–5 years2027–2029
Agent Economies
Type IV Networked Fabrics reach production. Agent-to-agent marketplaces emerge where agents transact on behalf of principals. Multi-agent governance becomes a regulatory requirement in high-stakes domains. The human role shifts from operator to strategic auditor.
Far · 5+ years2029+
AGI Through Agents
Type V Autonomous Ecosystems enter narrow production. Self-improving systems raise safety considerations current frameworks weren't built for. International governance for autonomous AI networks becomes a treaty-level concern. AGI may emerge through multi-agent collaboration at scale — not via single model breakthroughs.
Points of Consensus
Five things every major source agrees on — the stable foundation underneath the debates.
01
Minimal agent = LLM + tools in a loop
Every major source converges on the same primitive. Strip away frameworks and you're left with a language model that calls tools and iterates on the results.
Anthropic, Chip Huyen, and Harrison Chase emphasize this independently. Most 'complex agent' projects succeed only after they're simplified. The default move should be subtraction.
Anthropic engineering guidance·Chip Huyen — AI Engineering·Harrison Chase (LangChain)
03
Evaluations are the single biggest predictor of execution quality
Without evals you can't tell whether a change improved things. Andrew Ng frames evals as the difference between teams that ship and teams that ship regressions.
Andrew Ng·Anthropic evaluation guidance
04
Compound errors are the fundamental math challenge
95% accuracy per step becomes 0.6% accuracy over 100 steps. Multi-step agents live or die by error recovery — there is no 'good enough' step accuracy that survives long horizons.
Chip Huyen compound error analysis
05
Tool design matters as much as prompt design
Anthropic's tool-testing agent achieved 40% completion time reduction just by rewriting tool descriptions. Tools deserve the same engineering rigor as system prompts.
Anthropic tool-testing agent research
Sources & Research
Every citation used on this page, grouped by category. Click any source to preview
in-app, or open externally.