Research Reference

Agent Architectures

A governing pyramid, five architecture types, the agent harness, lab deep dives, memory systems, failure modes, architectural limits, benchmarks, and the 12-Factor principles for building production agents.

Jump to

Governing Pyramid

Three pillars that hold the field up. Foundation, middle, apex.

foundation

The Harness, Not the Model, Is the Product

Production agents are 70%+ harness: context, memory, tools, verification, observability. The model is the CPU; the harness is the OS.

Frontier capability gains plateau quickly without the surrounding scaffolding that turns a model into a system. Claude Code, Cursor, and Anthropic's research agents all show that the long tail of reliability — context curation, tool ergonomics, verification loops, trace observability — dominates end-to-end performance, and that swapping models inside a strong harness produces smaller deltas than swapping harnesses around a strong model.

middle

Coordination Reliability Beats Model Capability

Once each agent is capable enough, the bottleneck shifts to how agents share context, hand off work, and recover from each other's failures. Coordination is the new scaling axis.

The MAST failure taxonomy (arXiv 2503.13657) finds that 70%+ of multi-agent failures are systemic — miscommunication, context loss, role confusion — not single-agent reasoning errors. Anthropic's multi-agent research system shows that disciplined orchestrator-worker coordination can yield 90.2% improvements, while Cognition's counter-argument warns that naive multi-agent setups fragment context and degrade below a strong single agent. Either way, the decisive variable is coordination design, not raw model IQ. However, three counterpoints emerged from primary sources. First, Anthropic's own data shows token usage explains 80% of performance variance — raw compute, not coordination finesse, is the primary driver. Upgrading from Sonnet 3.7 to Sonnet 4 delivered a larger gain than doubling the token budget. Second, Google Research found a capability saturation threshold at ~45%: once a single agent exceeds this baseline, adding coordination yields diminishing returns. Third, Deloitte and McKinsey emphasize organizational readiness (process redesign, governance, data architecture) as the primary blocker, not technical coordination alone — only 30% of organizations reach maturity level 3+ in agentic AI governance.

apex

Focused-First, General-Later

The agents that ship are narrow, well-scoped, and composable. Generality emerges from composition of focused agents, not from one omni-agent that does everything.

12-Factor Agents Factor III ('One agent, one job') and Anthropic's simplicity thesis both argue that scope discipline is what separates demos from production. Focused agents are easier to evaluate, debug, and sandbox; they fail in smaller ways and compose into larger systems via explicit protocols like A2A or MCP. The pattern across Claude Code sub-agents, Devin, and LangChain Deep Agents is the same: a top-level planner dispatches to single-purpose workers rather than asking one agent to be universally competent.

Key Numbers

Verified statistics that ground the field.

11%

of organizations have production agentic systems

Deloitte 2025 Emerging Tech Trends, 500 US tech leaders

40%+

of agentic projects projected to be canceled by 2027

Gartner, June 2025

70%+

of multi-agent failures are systemic — system design ~41.8% + inter-agent misalignment ~31% = ~73%. Coordination, not model capability, is the dominant failure mode

MAST study, arXiv

90.2%

improvement from multi-agent vs. single-agent on research tasks

Anthropic internal eval (Claude Opus 4 + Sonnet 4)

15×

token consumption for multi-agent vs. plain chat

Anthropic multi-agent research system

4×

token consumption for single agent vs. plain chat

Anthropic multi-agent research system

40%

decrease in task completion time from tool description rewriting

Anthropic tool-testing agent

13.7pt

improvement on TerminalBench 2.0 from harness changes alone

LangChain — no model swap

280×

drop in inference costs over two years

Stanford AI Index 2025

0.6%

accuracy over 100 steps, starting from 95% per-step accuracy

Chip Huyen compound error analysis

50×

cost variation for same accuracy across agent configurations on SWE-bench (Kapoor et al., Princeton, TMLR 2025). Separately, Anthropic's data shows token usage explains 80% of performance variance — raw compute is the primary driver.

AI Agents That Matter — Kapoor et al.

77%

human vs. AI gap on GAIA benchmark (466 tasks testing agent reasoning over tool use).

Mialon et al. — GAIA

37%

average gap between agent benchmark lab accuracy and enterprise production accuracy.

CLASSic framework analyses

2,800%

reduction in hallucination propagation achieved by multi-agent verification pipelines over unchecked cascades.

Spark to Fire — City University of Macau, 2026

97M

monthly SDK downloads for the Model Context Protocol (MCP), making it the dominant agent-to-tool connectivity standard

Anthropic / MCP ecosystem data, mid-2026

150+

organizations supporting the Agent-to-Agent (A2A) protocol, including Adobe, SAP, ServiceNow, Salesforce, and all major cloud providers

Google / A2A Protocol, 2025-2026

~45%

baseline accuracy threshold above which adding multi-agent coordination yields diminishing or negative returns (Google Research, 180 configurations)

Chen et al. — Towards a Science of Scaling Agent Systems, Google Research Dec 2025

+80.8%

improvement from centralized multi-agent coordination on parallelizable tasks, but -39-70% degradation on sequential reasoning (Google Research, 180 configurations)

Chen et al. — Towards a Science of Scaling Agent Systems, Google Research Dec 2025

30%

of organizations reach maturity level 3+ in agentic AI governance, making organizational readiness — not technology — the primary deployment blocker

McKinsey — State of AI Trust in 2026

Lab Deep Dives

Fourteen labs building agent systems — what they build, how they think about coordination, what distinguishes their approach.

Anthropic

Claude Code harness, multi-agent research systems, constitutional AI

Key system Claude Code 'nO' harness

Simplicity thesis — the harness is the product, not the model. Uses structured tool-calling loops over heavy orchestration frameworks. Multi-agent research system reports 90.2% improvement on research tasks via parallel sub-agents with isolated contexts, while Claude Code itself defaults to single-agent with focused subagent dispatch. Claude Code operates as a 5-layer stack: MCP (connectivity), Skills (task-specific knowledge), Agent (primary agentic loop), Subagents (parallel workers), and Agent Teams (coordination, shipped early 2026). The system prompt is assembled from ~80 modular pieces across three injection points, split by a cache boundary with 1-hour global TTL and 5-minute per-session TTL. The Tool Search Tool dynamically discovers relevant tools, reducing token usage by 85% (from ~72K to ~8.7K tokens). Sub-agents get independent context windows but cannot spawn their own sub-agents.

Read full approach

Approach

Signature contribution

First production-grade coding agent with a transparent subagent model. Popularized the 'focused-first, general-later' pattern and published the Claude Agent SDK for external builders.

Evidence

Claude Code (GA 2025, 'nO' harness internally)
Multi-Agent Research System engineering writeup (June 2025)
Claude Agent SDK (2025)
Model Context Protocol (MCP) specification
Claude Code #1 AI coding tool in 8 months, $2.5B+ annualized revenue
5-layer stack: MCP/Skills/Agent/Subagents/Agent Teams
~80 modular prompt pieces with cache boundary architecture
Tool Search Tool: 85% token reduction (72K → 8.7K)

SO WHAT?

Bets that simplicity and focused agents beat complex orchestration at current model capability levels. Risk: if frontier models plateau, the simplicity thesis breaks — you need more harness sophistication to compensate for weaker reasoning. For developers: start with Claude Code's single-agent pattern; add sub-agents only when parallelizable breadth tasks justify the coordination cost.

Source Anthropic Engineering Blog

OpenAI

Agent primitives in the API, reasoning-model-driven tool use, consumer agents

Key system Responses API + tool_use loop

Agent capabilities are exposed as first-class API primitives rather than a framework. The Responses API replaces Assistants with a lighter, stateful tool-calling loop; reasoning models (o-series) drive long-horizon planning internally rather than via external orchestration. Deep Research and Operator are productized agents layered on the same primitives. The Agents SDK (March 2025, evolved from experimental Swarm) provides 3 core primitives: Agents (LLMs with instructions and tools), Handoffs (one-way agent-to-agent delegation implemented as tool calls), and Guardrails (input/output validation running in parallel with execution, with tripwire halting). Deep Research uses a 4-agent pipeline: triage → clarifier → instruction builder → research agent, powered by specialized o3/o4-mini models.

Read full approach

Approach

Signature contribution

Shifted the industry from framework-heavy orchestration to model-driven agents where the reasoning model owns the loop.

Evidence

Responses API (2025)
Agents SDK (2025)
Deep Research with o3 (2025)
Operator computer-use agent (2025)
Agents SDK: 3 primitives (Agents, Handoffs, Guardrails)
Deep Research: 4-agent pipeline (triage → clarifier → instruction builder → research)
Guardrails with tripwire halting

SO WHAT?

Bets that reasoning models will internalize planning so well that external orchestration becomes unnecessary. Risk: vendor lock-in — the Responses API is tightly coupled to OpenAI's model ecosystem with no portability story. For developers: fastest path to production agents if you're already on OpenAI, but switching costs grow with every o-series-specific optimization you adopt.

Source OpenAI — New tools for building agents

Google DeepMind

Open agent framework, inter-agent protocols, Gemini-native tool use

Key system Agent Development Kit (ADK) + A2A protocol

Two-layer strategy: ADK as a code-first framework for building multi-agent systems on Gemini, and A2A (Agent-to-Agent) as an open protocol for cross-vendor agent interoperability. Emphasizes standardized agent discovery, capability negotiation, and typed message passing over ad-hoc orchestration. ADK is an open-source (Apache 2.0) toolkit with 5 components: Agent (BaseAgent with LlmAgent, Workflow, and Custom variants), Tool, Callbacks, Runner, and Session/State. The December 2025 Developers Blog documents 8 multi-agent patterns: sequential pipeline, parallel dispatch, routing/coordinator, generator-critic, hierarchical delegation, aggregator, human-in-the-loop, and dynamic routing. ADK is model-agnostic (optimized for Gemini), deployment-agnostic, and GA on Vertex AI.

Read full approach

Approach

Signature contribution

Pushed the first broadly-adopted open protocol for agent-to-agent communication, separating transport from orchestration.

Evidence

Agent Development Kit (ADK) open-source release (2025)
Agent-to-Agent (A2A) protocol specification (2025)
Gemini function calling and tool use
Project Mariner browser agent research
ADK: 5 components (Agent, Tool, Callbacks, Runner, Session/State)
8 multi-agent patterns documented Dec 2025
Apache 2.0 license, model-agnostic

SO WHAT?

Bets that open inter-agent protocols (A2A) will matter more than any single framework. Risk: protocol adoption is slow and A2A competes with MCP for developer attention — if neither wins critical mass, the interop layer fragments. For developers: ADK is a solid Gemini-native choice, but the A2A protocol bet only pays off if multiple vendors actually implement it.

Source Google ADK documentation

DeepSeek

China

Open-weight reasoning models purpose-built for agentic tool use

Key system DeepSeek V3.1 / V3.2 agent-era models

Trains base and reasoning models against 1,800+ agentic training environments so tool-calling, planning, and error recovery are baked into the weights rather than bolted on via prompting. V3.1 was framed as the 'first step toward the agent era'; V3.2 integrates reasoning and tool invocation in a single unified decoding loop. V3.2 (December 2025) is the first model to reason while executing tools — maintaining chain-of-thought across multiple tool calls rather than reasoning first, then executing. Three innovations: DeepSeek Sparse Attention for efficient long-context, a scalable RL framework allocating 10%+ of pre-training compute, and a large-scale agentic task synthesis pipeline covering 1,800+ environments and 85K+ complex instructions.

Read full approach

Approach

Signature contribution

First open-weight model family explicitly optimized end-to-end for agent workloads at frontier scale.

Evidence

DeepSeek V3.1 release notes (Aug 2025) — 'first step toward agent era'
DeepSeek V3.2 reasoning+tool integration (Dec 2025)
1,800+ training environments for agentic RL
Open weights and technical report on GitHub
V3.2: reasoning-while-executing (maintains CoT across tool calls)
DeepSeek Sparse Attention
10%+ pre-training compute allocated to RL
85K+ complex instructions across 1,800+ environments

SO WHAT?

Bets that baking agentic capabilities into model weights via RL over 1,800+ environments will outperform bolting tools onto generic models. Risk: open weights mean anyone can fine-tune, but the agentic RL training infrastructure is not open — so the moat is in training pipeline, not weights. For developers: strongest open-weight option for agent workloads, but you're dependent on DeepSeek's training choices with no way to retrain the agentic behaviors yourself.

Source DeepSeek-V3 GitHub repository

Mistral

France

Stateful conversational agents and European sovereign agent infrastructure

Key system Mistral Conversations API + Le Chat agents

Treats agent state as a first-class API concept. The Conversations API persists messages, tool calls, and agent identities server-side so clients do not have to reconstruct context on every turn. Le Chat exposes these agents to end users with web search, code execution, and image generation tools.

Read full approach

Approach

Signature contribution

European frontier lab pushing a sovereignty-aware, persistently-stateful agent API as an alternative to stateless tool-calling.

Evidence

Mistral Agents API / Conversations API (2025)
Le Chat agent capabilities (code interpreter, web search, image gen)
Mistral Large and Medium models with native function calling

SO WHAT?

Bets that server-side state persistence is the right abstraction for agent memory, and that European sovereignty matters for enterprise adoption. Risk: stateful APIs create stickier vendor lock-in than stateless alternatives, and the sovereignty advantage only matters in regulated EU verticals. For developers: simplifies multi-turn agent state management significantly, but migrating away means rebuilding your persistence layer.

Source Mistral Agents documentation

Cohere

Canada

Enterprise retrieval-grounded agents with verifiable citations

Key system Command models with grounded generation and tool_plan

Treats citation and tool planning as structural outputs rather than prompt conventions. Command models emit a tool_plan field describing intended tool calls before execution and produce inline citations tied to source spans, making enterprise audits and RAG verification tractable. The tool_plan is a natural-language reasoning step generated before tool calls — an explicit chain-of-thought for tool selection. Two citation modes: fast (inline during generation) and accurate (post-generation, higher precision). Command A (111B) supports 256K context with 150% higher throughput than R+.

Read full approach

Approach

Signature contribution

Made inline citation and explicit tool planning first-class model outputs, not prompt hacks.

Evidence

Command R / R+ tool use with tool_plan field
Grounded generation with inline citations
Cohere RAG and connector APIs
North enterprise agent platform
tool_plan: explicit chain-of-thought for tool selection
Two citation modes: fast (inline) and accurate (post-generation)
Command A: 111B params, 256K context, 150% throughput improvement

SO WHAT?

Bets that citation provenance and explicit tool planning as model outputs solve the enterprise trust problem. Risk: structural citation is only as good as the retrieval — garbage sources with perfect citations create false confidence. For developers: if your use case requires auditable RAG with traceable sources, Cohere's approach is materially ahead; if you don't need citation, the overhead adds complexity.

Source Cohere tool use documentation

Perplexity

Search-grounded agentic AI with citation-first architecture and multi-model routing

Key system Perplexity Computer — meta-router + task graph

Launched February 2026. Uses a meta-router that classifies incoming requests by type and complexity, dispatching to the optimal model from a pool of ~19 models (Claude Opus 4.6 for core reasoning, Gemini for deep research, Grok for speed, GPT-5.2 for long-context recall). Complex tasks decompose into task graphs with sub-agents in isolated Linux sandboxes. The citation-first architecture is built atop Perplexity's search-native retrieval backbone — grounding came before the agent layer.

Read full approach

Approach

Signature contribution

Multi-model routing with citation provenance from a search-native foundation. The agent layer was built on top of an existing retrieval system, not the other way around.

Evidence

Perplexity Computer (Feb 2026)
~19 model pool with task-type routing
Isolated Linux sandboxes for sub-agents
Search-native citation backbone

SO WHAT?

Bets that the best agent is a router, not a single powerful model. Risk: model pool management complexity grows with each new frontier model; the routing heuristics need constant tuning. For users: strongest choice when grounded, cited answers matter more than deep reasoning chains.

Source Perplexity product announcements

xAI Grok

Native inference-time multi-agent reasoning with council-based architecture

Key system Grok 4.20 Beta — 4-agent council on MoE backbone

Grok 4.20 Beta (February 2026) implements native inference-time multi-agent processing with 4 specialized agents — Grok (coordinator), Harper (research/facts), Benjamin (logic/code), Lucas (creative/divergent) — running as heads on the same MoE backbone. A cross-attention block enables critique embedding exchange, and MARL during post-training rewards rapid convergence (average debate under 180 tokens). A lightweight router bypasses the council for simple queries. SuperGrok Heavy scales to 16 agents.

Read full approach

Approach

Signature contribution

First production system to embed multi-agent debate as a native inference-time mechanism rather than an application-layer orchestration pattern.

Evidence

Grok 4.20 Beta (Feb 2026)
4 specialized agents (Grok, Harper, Benjamin, Lucas)
MARL-trained convergence
~65% reduction in hallucinations on multi-step reasoning
SuperGrok Heavy: 16 agents

SO WHAT?

⚠️ Internal architecture details are partially speculative — xAI has not confirmed the exact parameter count or adapter mechanism. Bets that multi-agent reasoning should be in the weights, not in external orchestration. Risk: higher inference cost per query; fixed agent roles may not generalize. For users: strongest for multi-step reasoning tasks where hallucination reduction justifies the compute cost.

Source Community technical analysis

Cognition

Long-horizon autonomous software engineering agents

Key system Devin 2.0

Strong single-agent stance: one context, one decision-maker, tools and subagents only when strictly necessary. The 'Don't Build Multi-Agents' essay argues that context fragmentation between parallel agents is the dominant failure mode in long-horizon work. Devin operates end-to-end on real repositories with its own VM, browser, and editor. The general Devin agent uses frontier models (currently Claude Sonnet 4.5), not a custom model. Kevin-32B is a separate open-source model trained with multi-turn RL specifically for CUDA kernel generation (91% correctness on KernelBench), not the general Devin agent model. Devin 2.0 supports fleet parallelism for scaling across tasks and dropped pricing from $500/month to $20/month.

Read full approach

Approach

Signature contribution

Reported 67% PR merge rate on real open-source repos and publicly pushed back on multi-agent orthodoxy.

Evidence

Devin 2.0 release (2025)
'Don't Build Multi-Agents' essay (2025)
67% merge rate on open-source PR benchmark
Cognition agent VM + browser harness

SO WHAT?

Bets that a single agent with unified context outperforms multi-agent decomposition for long-horizon software tasks. Risk: single-context scaling has hard limits — as repository size grows, the agent eventually can't hold enough context to reason effectively, and the 'no multi-agent' stance may not survive that ceiling. For developers: the 67% merge rate is impressive but measured on curated OSS tasks; expect lower rates on messy internal codebases with implicit conventions.

Source Cognition — Don't Build Multi-Agents

Windsurf

IDE-native agent with background planning and real-time intent inference

Key system Cascade engine — Code + Chat modes with Flow paradigm

Originally Codeium, now owned by Cognition AI. Uses the Cascade engine with two modes: Code and Chat. A specialized planning agent refines long-term plans in the background while the selected model handles short-term actions. Cascade tracks all user actions — file edits, terminal commands, navigation — to infer intent in real time (the 'Flow' paradigm). Graph-based codebase reasoning uses RAG-based indexing converting files to 768-dimensional embeddings with a proprietary M-Query retrieval technique.

Read full approach

Approach

Signature contribution

Real-time intent inference from user behavior patterns. The agent watches what you do, not just what you ask.

Evidence

Cascade engine (Code + Chat modes)
Flow paradigm — real-time user action tracking
Graph-based codebase reasoning
768-dimensional embeddings
M-Query retrieval technique

SO WHAT?

⚠️ 'Dependency graph' and 'mini-compiler' descriptions come from community analysis, not official documentation. Bets that observing developer behavior gives better context than explicit instructions. Risk: privacy concerns with action tracking; accuracy degrades in unfamiliar codebases where the graph is sparse. For users: strongest for long editing sessions in established codebases.

Source Community technical analysis

Manus

China

General-purpose consumer autonomous agent, iterative harness design

Key system Manus single-agent harness

Treats harness design as the core product and iterates aggressively — publicly documented five full harness refactors in six months. Uses a single long-lived agent with a virtual computer, file system, and browser rather than multi-agent decomposition. Leaked system prompts became a widely-studied case in the agent community. Acquired by Meta for ~$2B in December 2025. Uses CodeAct (ICML 2024): instead of JSON tool calls, the agent generates Python scripts as actions, achieving ~20% higher success rates. A context-aware state machine manages tool availability using logit masking during decoding — constraining which tools can be selected at each state without invalidating KV-cache. The todo.md recitation pattern pushes the global plan into the model's recent attention span. Average task involves ~50 tool calls with a ~100:1 input-to-output token ratio.

Read full approach

Approach

Signature contribution

Public case study for single-agent iteration velocity and for the operational realities of shipping a consumer autonomous agent.

Evidence

Manus consumer autonomous agent launch (2025)
Publicly discussed five harness refactors in six months
Leaked system prompt case study
Virtual computer / sandbox execution environment
CodeAct (ICML 2024) — Python scripts instead of JSON tool calls
Context-aware state machine with logit masking
~100:1 input-to-output token ratio
todo.md recitation for attention span management

SO WHAT?

Bets that rapid harness iteration — five refactors in six months — will converge on a good consumer agent architecture faster than careful upfront design. Risk: iteration velocity without public benchmarks makes it hard to assess actual capability vs. demo polish. For developers: Manus is more instructive as a case study in harness design evolution than as a technology to build on, since the stack is closed.

Source Manus

Cursor

Code-graph-aware IDE agents with shadow execution environments

Key system Shadow Workspace + Background Agents

Runs agents against a mirrored copy of the user's project (the Shadow Workspace) so speculative edits, builds, and tests never touch the live workspace. Background Agents run longer horizons asynchronously against remote VMs. Heavy investment in code graph indexing so retrieval is structure-aware rather than purely embedding-based. Cursor 3 (April 2026) rebuilt around Composer 2, a custom model trained via real-time RL in the exact Cursor harness. Cloud agents run in isolated VMs. A 'semantic diff' pipeline has the main LLM produce diffs → a cheaper apply-model writes files → linter checks feed back for self-correction. The original shadow workspace (September 2024) used a hidden Electron window for parallel iteration but was later removed.

Read full approach

Approach

Signature contribution

Pioneered isolated shadow execution environments for in-IDE agents, decoupling agent exploration from user state.

Evidence

Shadow Workspace architecture writeup
Background Agents launch (2025)
Cursor code graph indexer
Composer multi-file agent mode
Cursor 3 (April 2026)
Composer 2: custom RL-trained model
Semantic diff pipeline: LLM diffs → apply-model → linter feedback
Shadow workspace concept (Sep 2024, later removed)

SO WHAT?

Bets that shadow execution environments and code-graph-aware retrieval are the right primitives for IDE agents. Risk: the Shadow Workspace adds significant infrastructure complexity — if models get good enough to reason about edits without speculative execution, the isolation layer becomes overhead. For developers: best-in-class for speculative multi-file edits today, but the approach is tightly coupled to IDE context and doesn't transfer to non-coding agent use cases.

Source Cursor

Microsoft

Open multi-agent frameworks and enterprise agent tooling

Key system AutoGen / Microsoft Agent Framework

AutoGen pioneered conversational multi-agent patterns; the newer Microsoft Agent Framework merges AutoGen and Semantic Kernel into a single supported stack. Magentic-One demonstrates a generalist orchestrator coordinating specialized agents for web, file, and code tasks. Copilot Studio exposes these patterns to enterprise low-code builders. ⚠️ Critical context: original AutoGen creators Chi Wang and Qingyun Wu departed Microsoft in late 2024 to establish AG2 as a community fork. Multiple sources report AutoGen has 'virtually disappeared from production environments' in 2026, with the pyautogen PyPI package no longer under Microsoft control. Microsoft merged AutoGen and Semantic Kernel into the Microsoft Agent Framework with RC 1.0 shipped February 2026.

Read full approach

Approach

Signature contribution

Largest body of open research on conversational multi-agent orchestration, now being consolidated into a unified enterprise stack.

Evidence

AutoGen open-source framework
Microsoft Agent Framework (AutoGen + Semantic Kernel merger)
Magentic-One generalist multi-agent system (2024)
Copilot Studio agent builder

SO WHAT?

Bets that conversational multi-agent patterns are the right abstraction for enterprise agent systems, now consolidating AutoGen and Semantic Kernel into one stack. Risk: framework merger creates migration churn for existing AutoGen users, and the multi-agent-first philosophy adds orchestration overhead that simpler reasoning-model approaches avoid. Additional framework stability risk: the departure of AutoGen's original creators and the pyautogen package changing hands raises questions about continuity for teams that built on the original AutoGen. For developers: strongest option if you need structured multi-agent workflows with enterprise governance, but evaluate whether a single-agent approach solves your problem first.

Source Microsoft AutoGen

Alation

Enterprise data catalog agents, production simplification case study

Key system o3-backed single-agent stack (replacement for hierarchical multi-agent)

Publicly documented replacing a hierarchical multi-agent orchestration with a single reasoning-model-driven agent once o3-class models became available. The case study is widely cited as evidence that model capability gains can collapse entire orchestration layers, and that most enterprise agent work does not need multi-agent decomposition.

Read full approach

Approach

Signature contribution

Canonical production case for 'the model ate the framework' — retiring a multi-agent system in favor of one reasoning model.

Evidence

Alation engineering talk / writeup on retiring hierarchical multi-agent stack
Migration from multi-agent orchestration to o3 single-agent
Production deployment in enterprise data catalog workflows

Source Alation

QuantumBlack, AI by McKinsey

Enterprise agent governance, state-machine orchestration

Key system State-machine governed enterprise agents

Argues that enterprise agents should be expressed as explicit state machines with auditable transitions, approval gates, and reversible actions rather than free-form LLM loops. This aligns agent behavior with existing enterprise risk, audit, and compliance structures and makes failure modes inspectable.

Read full approach

Approach

Signature contribution

Leading voice for governance-first agent architectures in regulated enterprise deployments.

Evidence

QuantumBlack publications on enterprise agent architecture
McKinsey Digital reports on generative AI operating models
Client case studies on state-machine agent governance

SO WHAT?

Bets that explicit state machines with approval gates are worth the upfront design cost for enterprise agent deployments. Risk: state-machine rigidity can bottleneck iteration speed — every new agent behavior requires a new auditable state transition, which slows teams used to prompt-and-ship cycles. For developers: the right pattern if you're in regulated industries (finance, healthcare) where auditability is non-negotiable; overkill for internal tools or consumer products.

Source QuantumBlack, AI by McKinsey

Meta FAIR

Open-source AI research + Llama Stack pluggable-provider API framework for agent infrastructure

Key system Llama Stack — provider-agnostic agent API

Long-running open research program on agents that must reason about other agents — negotiation, planning under partial information, and natural-language coordination. CICERO combined a strategic planning module with a dialogue model to reach human-level performance in Diplomacy. Subsequent work has focused on open tool-use datasets and evaluation. Llama Stack is a pluggable-provider API framework with full OpenAI API compatibility. Key APIs span inference, safety (Llama Guard 3, Prompt Guard), memory (vector/KV/keyword/graph), and agentic orchestration. Core design principle is provider-agnostic: 'Develop locally with Ollama, deploy to production with vLLM — the API stays the same.' Llama 3.1+ models have built-in tool calling. Toolformer's research influence is indirect — Llama Stack standardizes the infrastructure, not the model capability.

Read full approach

Approach

Signature contribution

Only lab to demonstrate human-level performance in a natural-language negotiation game, and a primary source of open agent research artifacts.

Evidence

CICERO Diplomacy agent (Science, 2022)
Open tool-use and agent evaluation datasets
Llama models with native tool use
FAIR open research publications on multi-agent interaction
Llama Stack API framework
Llama Guard 3 + Prompt Guard safety
OpenAI API compatibility layer

SO WHAT?

Bets on open-weight ecosystem moat. Risk: Meta's agent infrastructure maturity lags behind Anthropic/OpenAI/Google; Llama Stack adoption is early. For developers: strongest value proposition when you want to avoid vendor lock-in and own the full stack.

Source Meta AI Research

DeepMind Alpha Series

Long-horizon reasoning agents in code, math, and science

Key system AlphaCode / AlphaProof / Isomorphic Labs pipelines

Domain-specialized agents that combine a learned policy with extensive verified search. AlphaCode samples and filters programs against test cases; AlphaProof translates problems into Lean and searches proof space under a learned prior; Isomorphic Labs applies analogous ideas to drug discovery. All share the pattern of grounding LLM candidates in a verifier rather than trusting single-shot generation.

Read full approach

Approach

Signature contribution

Defines the verifier-grounded frontier — agents whose outputs are filtered through formal or experimental verification rather than self-critique.

Evidence

AlphaCode and AlphaCode 2 (competitive programming)
AlphaProof + AlphaGeometry — IMO silver-medal performance (2024)
Isomorphic Labs drug discovery pipelines
Gemini Deep Think long-horizon reasoning

SO WHAT?

Bets that formal verification (proof checkers, test suites) as a filter on LLM-generated candidates is the path to reliable long-horizon agents. Risk: verifier-grounded approaches only work in domains where verification is tractable — math proofs and code tests qualify, but most business tasks don't have clean verifiers. For developers: the pattern (generate many, verify few) is transferable, but building your own verifier is the hard part.

Source Google DeepMind Research

Memory Systems

Twelve approaches to agent memory. Memory is a learnable decision, not a data structure.

Memory-R1 Hybrid

Reinforcement-learned memory management for long-horizon agents

Trains a policy to decide what to write, recall, and forget in agent memory. The policy is optimized end-to-end against task rewards rather than handcrafted heuristics, and outperforms rule-based memory managers on long-horizon reasoning benchmarks.

Key insight

Memory is a learnable policy, not a data structure. Writing and forgetting are decisions that can be optimized.

Tradeoff

Requires training data and RL compute; policies are brittle outside the trained task distribution.

Source arXiv 2508.19828

Hindsight 20/20 Episodic

Retroactive trajectory revision for agent memory

Revisits completed agent trajectories after the fact and rewrites them into cleaner, more instructive episodes before storing them as memory. The revised traces act as higher-quality exemplars for future retrieval, improving downstream task performance without changing the base policy.

Key insight

What the agent remembers about an episode should not be the raw trace — it should be the lesson extracted in hindsight.

Tradeoff

Adds an offline revision pass and risks introducing hindsight bias if the revision model hallucinates.

Source arXiv 2512.12818

MemGPT Hybrid

OS-inspired virtual context paging for LLMs

Treats the LLM context window like RAM and external storage like disk. A controller LLM pages information in and out of context using explicit function calls, maintaining a working set, a recall archive, and a core persona block. This lets agents operate over effectively unbounded histories with a fixed context window.

Key insight

Virtual memory is the right abstraction for bounded context windows — give the model tools to manage its own paging.

Tradeoff

Every page-in/page-out is a tool call that costs tokens and latency, and the controller can mis-page under pressure.

Source arXiv 2310.08560

CoALA Hybrid

Cognitive architectures framework for language agents

Provides a reference architecture decomposing language agents into working memory, episodic memory, semantic memory, and procedural memory, connected via reasoning, retrieval, learning, and grounding actions. CoALA is primarily a conceptual framework that lets researchers classify and compare otherwise incompatible agent designs.

Key insight

Language agents already implement cognitive-architecture concepts implicitly; naming them explicitly makes design choices comparable.

Tradeoff

Descriptive rather than prescriptive — does not specify implementations, so two CoALA-compliant agents can differ enormously.

Source arXiv 2309.02427

Generative Agents memory stream Episodic

Timestamped observation stream with reflection and importance-weighted retrieval

Each agent maintains an append-only stream of natural-language observations. Retrieval scores candidates on recency, importance, and relevance. Periodically the agent generates higher-level 'reflections' by asking itself what the recent stream implies, and those reflections are written back into the same stream, producing a ladder of abstraction.

Key insight

A flat observation log plus periodic self-reflection is enough to produce emergent long-horizon behavior in social simulations.

Tradeoff

Retrieval quality degrades as the stream grows, and reflections can amplify errors if earlier observations were wrong.

Source Park et al., arXiv 2304.03442

Voyager skill library Procedural

Procedural memory as a growing library of executable skill code

In Minecraft, Voyager asks an LLM to write JavaScript functions that accomplish tasks, verifies them in the environment, and stores successful functions in a skill library keyed by natural-language descriptions. Future tasks retrieve relevant skills by description and compose them, so the agent's capability grows monotonically as code rather than as weights.

Key insight

Procedural memory can be literal, inspectable source code — skills that accumulate form a curriculum the agent writes for itself.

Tradeoff

Limited to domains where actions can be expressed as verifiable code and where failures are cheap to retry.

Source arXiv 2305.16291

Anthropic Projects / Claude Memory Long-term

User-scoped long-term semantic memory with explicit user control

Projects bundle documents, instructions, and conversations into a persistent workspace that Claude can reference across sessions. Claude Memory extends this with automatically maintained user facts that the model can read and update. Both are scoped per user and designed for transparent inspection and editing by the user rather than opaque background personalization.

Key insight

Long-term memory in a consumer product is as much a UX problem as a retrieval problem — users need to see and edit what the agent remembers.

Tradeoff

Explicit memory surfaces add UI friction and require the model to stay consistent with user-edited facts.

Source Anthropic — Projects announcement

OpenAI Memory Long-term

User-scoped salience memory for ChatGPT

ChatGPT automatically extracts salient facts from conversations and stores them in a user-scoped memory store that is injected into future system prompts. Users can list, edit, and delete memories, and can disable the feature entirely. Memory is cross-session but bounded per user account.

Key insight

Most consumer agent memory value comes from a small set of stable user facts, not from full conversational replay.

Tradeoff

Salience extraction is a heuristic; over-remembering creates privacy concerns and under-remembering makes the feature feel inert.

Source OpenAI — Memory and new controls for ChatGPT

Letta Hybrid

Stateful agent platform productizing the MemGPT memory architecture

Letta (formerly the MemGPT company) provides a server and SDK for building agents whose memory, tools, and identity persist across sessions. It exposes MemGPT-style core memory blocks, archival memory, and recall memory as first-class concepts, with a database backend so state survives process restarts.

Key insight

Stateful agents need a memory database, not a chat history — persistence should be a platform concern, not a prompt concern.

Tradeoff

Introduces a server and schema that application developers must operate and reason about.

Source Letta

Mem0 Hybrid

Multi-level memory layer for LLM applications

Mem0 adds a pluggable memory layer to LLM apps with user-level, session-level, and agent-level scopes. It extracts facts from conversations, deduplicates and updates them over time, and exposes a simple add/search API so applications do not have to build their own retrieval pipeline.

Key insight

Most production LLM apps need memory at multiple scopes simultaneously (user, session, agent); a shared layer is cheaper than rebuilding each.

Tradeoff

Extraction and dedup quality depends on the underlying model; stale or duplicated facts leak into retrieval results.

Source Mem0

Zep Semantic

Temporal knowledge graph for agent memory

Zep builds a temporal knowledge graph from agent conversations and documents, where entities and relationships carry validity intervals. Retrieval returns facts that were true at a given time, not just semantically similar text. This lets agents answer questions about how user state evolved and avoid recalling stale facts.

Key insight

Semantic memory without time is a lie detector waiting to happen — facts have validity windows and memory should model them.

Tradeoff

Maintaining a temporal graph is heavier than vector search and requires reliable entity resolution.

Source Zep

LangGraph Checkpointer + Store Hybrid

Session-scoped checkpoints plus cross-session store as framework primitives

LangGraph separates memory into two primitives. The Checkpointer persists the full graph state after every step, giving session-scoped short-term memory, resumability, and time-travel debugging. The Store provides a cross-session, namespaced key-value memory for long-term facts. Both are pluggable across SQLite, Postgres, and Redis backends.

Key insight

Short-term and long-term memory have different failure modes — treating them as separate primitives with different backends avoids conflating durability with retrieval.

Tradeoff

Two APIs to reason about, and the Store leaves higher-level concerns like extraction and dedup to the application.

Source LangGraph memory documentation

Active Debates

Five unresolved questions shaping how agents are built. Each shows pro, con, and a practitioner resolution.

Single vs. Multi Single-Agent vs. Multi-Agent

Pro

Multi-agent systems deliver measurable improvements on parallelizable breadth tasks and outperform single-agent baselines on research and analysis work. Google Research (December 2025, 180 configurations) found centralized coordination improves parallelizable tasks by 80.8%.

Anthropic June 2025: 90.2% improvement on research tasks
Parallelizable breadth tasks excel with multi-agent fan-out
Google Research — 180 configurations, parallelizable tasks +80.8%

Con

Reasoning models handle planning natively. Adding agents adds coordination overhead, failure modes, and 15× token cost without corresponding gains on depth tasks, but degrades sequential reasoning by 39–70%. The 'Rule of 4' limits effective team sizes to 3-4 agents before communication overhead dominates. Princeton NLP found single-agent matched or outperformed multi-agent on 64% of benchmarked tasks.

Cognition: "Don't Build Multi-Agents"
Manus engineering: refactored 5× in 6 months
Google Research — sequential degradation 39-70%, Rule of 4
Princeton NLP — single-agent wins 64% of benchmarks

Resolution

Task-dependent, not universal. Multi-agent for parallelizable breadth. Single-agent for sequential depth. The Rule of 4 caps effective team size. Cost multiplier: single agents 4× tokens vs chat, multi-agent 15×.

Anthropic — Multi-Agent Research System writeup

Frameworks vs. Raw Frameworks vs. Raw APIs

Pro

Frameworks capture collective wisdom from thousands of production deployments. They provide useful infrastructure — task queues, persistence, checkpointing — that you'd otherwise rebuild. Harrison Chase distinguishes frameworks (abstractions), runtimes (durable execution), and harnesses (batteries-included) — arguing the runtime and harness layers provide lasting value. 'Use LangGraph for agents, not LangChain.'

Con

Extra abstraction layers obscure what the model actually sees. When prompts and responses are hidden behind framework code, debugging becomes archaeology. MindStudio argues better models have made 'many framework abstractions unnecessary or actively harmful.'

Anthropic: "extra layers of abstraction obscure prompts and responses"
Lance Martin: over time you're having to strip away structure
MindStudio — framework abstractions actively harmful

Resolution

Own your cognitive architecture (it's your differentiator). Outsource agentic infrastructure (task queues, persistence, checkpointing — commodity work). The line isn't framework vs no framework — it's 'what part do I need to see directly to reason about my system?' The debate is evolving: the consensus is moving toward raw SDK for simple agents, LangGraph for stateful orchestration, CrewAI for rapid prototyping.

Anthropic — Building Effective Agents

Scaffolding vs. Minimal The "Bitter Lesson" for Agents

Pro

Harness engineering yields measurable improvements independent of model quality. LangChain achieved 13.7pt TerminalBench 2.0 gains through harness changes alone — no model swap. Google Research's capability saturation threshold (~45%) supports this — as models improve, engineering's marginal value decreases.

Con

Heavy scaffolding has diminishing returns. As models improve, the scaffolding you built last year becomes the ceiling that prevents this year's model from showing its real capability. But Manus rebuilt their harness 5 times with the same models and improved each time. And compound-error math (0.99^100 = 36.6%) constrains all architectures regardless.

Manus engineering: refactored 5× in 6 months
Vercel AI SDK: removed 80% of tools → fewer steps, better results
Lance Martin: "over time you're having to strip away structure"
Manus — 5 harness rebuilds, same models, improving each time

Resolution

Current evidence supports a hybrid: engineering matters now, but simpler architectures survive model upgrades better. The bitter lesson applies at the architecture level but not at the harness level.

Rich Sutton — The Bitter Lesson (2019)

Augment vs. Automate Augmentation vs. Automation

Pro

Full automation isn't the goal — cognitive augmentation is. The best AI products extend human capability rather than replace it. Tesla ran Autopilot for 12 years without achieving full self-driving. Karpathy's autonomy slider (YC Keynote, June 2025): 'Less AGI hype and flashy demos that don't work, more partial autonomy, custom GUIs and autonomy sliders.' Cursor's Tab → Cmd+K → Cmd+L → Agent mode exemplifies graduated autonomy.

Karpathy: "Iron Man suit, not Iron Man"
Tesla Autopilot: 12 years without full autonomy
Karpathy — autonomy slider, YC Keynote June 2025

Con

Genuine autonomous capability exists today in bounded domains. Claude Code and Devin demonstrate long-horizon autonomy when the task shape is well-understood. Refusing to automate is leaving value on the table. Karpathy's autonomy slider (YC Keynote, June 2025): 'Less AGI hype and flashy demos that don't work, more partial autonomy, custom GUIs and autonomy sliders.' Cursor's Tab → Cmd+K → Cmd+L → Agent mode exemplifies graduated autonomy.

Resolution

Karpathy's "autonomy slider" — Cursor demonstrates this cleanly: Tab → Cmd+K → Cmd+L → Agent mode. Let users dial in the autonomy level themselves based on task risk and their own trust. The debate isn't 'which mode' — it's 'expose the slider'. This debate is effectively settled as a design pattern rather than an ideological argument — builders should implement autonomy sliders, not choose sides.

Karpathy — autonomy slider concept

Focused vs. General Focused-First vs. General-Orchestrator

Pro

Focused agents with narrow scope win. Production evidence shows monolithic agents reach ~40% success on complex tasks while focused-and-composed agents hit ~95% at 20–30% context each — scope discipline is the single biggest reliability lever. Google Research's December 2025 study (180 configurations) found task type determines architecture more than any other variable — builders should choose by task type, not autonomy level.

Con

General orchestrators scale better as model capability grows. The Bitter Lesson applied to agents says that frontier reasoning models (DeepSeek V3.1, o3-class) absorb planning and decomposition natively, so a single capable agent outperforms hand-built hierarchies.

DeepSeek V3.1 technical report
Alation — replaced a multi-agent hierarchy with a single o3 agent

Evidence

MAST identifies self-verification as a recurring coordination failure; production teams repeatedly report accuracy metrics collapsing once the grader is replaced with an independent judge.

Mitigation

Apply 12-Factor Factor V — no agent grades its own work. Use a separately-prompted, separately-contexted judge agent, or better, deterministic verification (tests, schema checks, re-derivation).

Impact Independent judges and deterministic checks routinely surface failures that self-graders marked as passing, restoring calibrated accuracy signals.

Architecture Limits

Five architectures we can't yet build and the engineering blockers in front of each.

Persistent Cognitive Systems

Agents with durable identity and evolving expertise across weeks-to-years of continuous operation, accumulating skill rather than restarting each session.

Memory structure

No consensus on how to organize episodic vs. semantic vs. procedural memory at scale without context degradation. Current systems collapse distinctions into flat vector stores that lose temporal and causal ordering.

Context degradation

Long rolling contexts accumulate noise and contradictions; summaries lose fidelity. Every compaction step is lossy, and the loss is unevenly distributed across facts.

Autonomous memory management

Forgetting is a decision, not a side effect. Current agents rely on brittle heuristics or fixed token windows and cannot autonomously decide what to retain, revise, or discard.

Nearest precedent Letta, Mem0, Anthropic Projects — user-scoped memory, not agent-scoped identity.

RossLabs synthesis

Market Based Agent Systems

Large populations of specialist agents that discover, bid for, and trade work through price signals rather than hand-coded routing — coordination as an emergent property of an internal economy.

O(n²) coordination cost

Pairwise negotiation and discovery costs scale quadratically with agent count; without a matching layer the economy collapses under its own message volume long before producing useful allocations.

Latency

Bidding, settlement, and arbitration add round-trips on top of already-slow LLM inference. Any task sensitive to wall-clock time cannot tolerate a market in the hot path.

Shared knowledge

Markets presuppose a common ontology of goods, qualities, and contracts. Agents with divergent world models cannot reliably price each other's outputs, leading to adverse selection and collapse of the price signal.

Nearest precedent Research prototypes (MetaGPT-style role markets, auction-based MAS) — none in production at meaningful scale.

RossLabs synthesis

Self Improving Agent Systems

Agents that measurably get better at their own job through deployment — not just more context, but genuinely updated capability that compounds over time.

Static LLM weights

The underlying model is frozen between releases. All 'learning' happens in prompts, tools, or retrieval layers — which are easier to poison than to harden, and reset on every model upgrade.

Weak feedback signals

Production traces rarely include ground-truth outcomes. Self-reported success, user thumbs, and downstream proxies are noisy and game-able, so gradient direction is unreliable even when a learning loop exists.

Catastrophic drift

Online updates optimized against noisy rewards routinely degrade previously-solid behavior. Without regression harnesses comparable to pretraining evals, each 'improvement' risks silent capability loss.

Nearest precedent Voyager, Reflexion, STaR-style self-training — bounded domains, no durable identity carried across sessions.

RossLabs synthesis

Hierarchical Cognitive Organizations

Deeply-layered agent organizations where strategy at the top decomposes into tactics in the middle and execution at the leaves — coherent over long horizons, like a functional org chart that actually works.

Planning horizons

Current models lose coherence beyond a few dozen dependent steps. Multi-week plans fragment into locally plausible but globally inconsistent subgoals, and no layer can reliably detect the divergence.

Constraint enforcement

Upper layers cannot reliably bind lower layers to their intent. Sub-agents routinely rationalize around constraints rather than respect them, and the parent has no privileged channel to verify compliance.

Weak world models

Hierarchy presumes the top layer can reason about consequences in the environment. LLMs still lack the durable causal models needed for long-horizon planning to stay grounded rather than drift into narrative.

Nearest precedent MetaGPT, ChatDev, AutoGen hierarchical teams — impressive demos, brittle under load.

RossLabs synthesis

Distributed Reasoning Networks

Large networks of agents collectively reasoning over shared problems — no single coordinator, intermediate results flowing through the graph until consensus emerges.

Memory infrastructure

Shared blackboards, vector stores, and message queues are not yet engineered for the throughput, consistency, and provenance guarantees a reasoning network requires. Current infra is built for request/response, not continuous collective cognition.

Observability

When a distributed reasoning network produces a wrong answer, there is no tractable way to attribute the error. Traces explode combinatorially and existing tracing tools were built for microservices, not for semantic causality.

Compute cost

Every additional node multiplies inference spend. Without order-of-magnitude cost reductions or drastic model specialization, the economic floor for a meaningful network is already above most budgets.

Nearest precedent Research demos on graph-of-thought and debate ensembles — small node counts, short horizons.

RossLabs synthesis

Benchmarks

Six benchmarks scored across the CLASSic framework — Cost, Latency, Accuracy, Security, Stability.

SWE-bench

2,294 real GitHub issues across 12 popular Python repositories

Measures whether agents can resolve real bug reports and feature requests end-to-end against a repository's own test suite.

Top score SWE-bench Verified >65% (2026)

Key finding 50× cost variations observed for equivalent accuracy (Kapoor et al., Princeton TMLR 2025).

CLASSic

Cost High — full pytest runs per iteration

Latency Minutes per task

Accuracy Primary axis of comparison

Security N/A — sandboxed containers

Stability Contamination sensitive; see SWE-bench-Live

WebArena

812 long-horizon tasks across 4 self-hosted, fully reproducible web apps (e-commerce, forum, dev, CMS)

Measures whether agents can complete realistic multi-step web tasks in a deterministic, offline-replicable environment.

Top score Top agents ~40-50% task success (2025)

Key finding Gap between LLM 'knows the steps' and 'executes them reliably on a live DOM' remains the dominant failure mode.

CLASSic

Cost Medium — many DOM observations per task

Latency Tens of seconds to minutes per task

Accuracy Exact end-state matching

Security Sandboxed, self-hosted apps

Stability High — fully reproducible snapshots

AgentBench

8 distinct environments spanning OS, DB, knowledge graph, card game, web shopping, web browsing, household, and code

Cross-domain evaluation of LLM-as-agent across reasoning, tool use, and decision-making in heterogeneous environments.

Top score Frontier closed models significantly outperform open models on multi-turn tasks

Key finding Persistent gap between closed and open models in long-horizon instruction following, even when single-turn scores are comparable.

CLASSic

Cost Variable by environment

Latency Seconds to minutes per task

Accuracy Per-environment success metrics

Security Sandboxed per environment

Stability Mixed — some environments stochastic

GAIA

466 real-world assistant questions requiring reasoning, multi-modality, web browsing, and tool use

Tests general AI assistant competence on tasks easy for humans but hard for models — conceptually simple but execution-heavy.

Top score Top agents ~70%+ on Level 1, sharply lower on Levels 2-3 (2025)

Key finding Humans score ~92% with no training; frontier agents still drop steeply as task depth increases.

CLASSic

Cost Medium-high — web and tool calls per question

Latency Variable, often minutes per question

Accuracy Exact-match answer

Security Live web — read-only

Stability Anchored to frozen question set

WorkArena

33+ enterprise knowledge-worker tasks on a live ServiceNow instance (forms, lists, dashboards, workflows)

Measures agent competence on realistic enterprise SaaS workflows that knowledge workers actually perform day-to-day.

Top score Frontier agents succeed on basic tasks but drop sharply on compositional workflows

Key finding Enterprise UIs expose the gap between web-browsing agents and true workflow execution; small UI variations break otherwise capable agents.

CLASSic

Cost Medium — live enterprise UI interaction

Latency Tens of seconds to minutes per task

Accuracy Task-completion verification via instance state

Security Sandboxed ServiceNow dev instances

Stability Medium — depends on instance version

SWE-bench-Live

Continuously refreshed stream of recent GitHub issues, replacing the frozen SWE-bench snapshot

Addresses training-set contamination in SWE-bench by evaluating agents on issues that post-date model training cutoffs.

Top score Scores typically lower than SWE-bench Verified, exposing contamination deltas

Key finding Gap between SWE-bench and SWE-bench-Live quantifies how much reported agent performance is memorization vs. genuine repair capability.

CLASSic

Cost High — same pytest-per-iteration profile as SWE-bench

Latency Minutes per task

Accuracy Per-issue test pass/fail

Security Sandboxed containers

Stability Intentionally non-stationary — that is the point

12-Factor Agent Principles

Production discipline for agentic systems.

Natural Language to Tool Calls

The job of the model is to turn natural language into structured tool calls, not to run the program.

Treat the LLM as a translator from intent to a typed action, then hand that action to deterministic code. Keeping the model on the translation side of the line is what makes agent behavior testable, observable, and replayable.

Anti-pattern: Letting the model emit free-form prose that downstream code has to parse with regex and vibes.

Own Your Prompts

Prompts are production code. Write them, version them, diff them, and test them like any other critical system.

Hidden framework prompts, autogenerated templates, and 'smart' wrappers make behavior impossible to reason about. The prompts your model actually sees should live in your repo, under review, with the same rigor as application code.

Anti-pattern: Relying on a framework's default prompt and shipping whatever it happens to emit this release.

III

Own Your Context Window

You, not the framework, decide what goes into the context window on every turn.

Context is the agent's short-term world model. Leaving its construction to a framework black box means you cannot reproduce behavior, cannot audit failures, and cannot optimize cost. Build the window explicitly from typed inputs you control.

Anti-pattern: Auto-appending full chat history, all tool outputs, and retrieved docs without a deliberate shaping step.

Tools Are Just Structured Outputs

A tool call is nothing more than the model producing a structured output that your code chooses to act on.

Demystifying tools removes the magic: the model emits typed JSON, a handler validates and executes it, and the result is fed back as context. Framing everything this way lets you add, remove, or swap tools without reaching for a new abstraction layer.

Anti-pattern: Treating tool-calling as a special model feature that requires a framework-specific runtime to manage.

Unify Execution State and Business State

The agent's execution state and the application's business state should live in the same durable store.

Splitting 'where the agent is in its loop' from 'what the business thinks happened' guarantees drift. Writing both to one transactional store makes pause, resume, retry, and audit trivial, and lets you reason about the system as a state machine instead of a ghost.

Anti-pattern: Keeping agent progress in memory or ephemeral framework objects while business records live in a database.

Launch / Pause / Resume With Simple APIs

Agents should be startable, pausable, and resumable through plain HTTP-style APIs, not bespoke runtimes.

Treating an agent run as a resource you create, inspect, and continue — with durable state behind it — gives you the operational surface every other production system already has: retries, timeouts, scaling, and human intervention, all without special tooling.

Anti-pattern: A long-running in-process loop that cannot be stopped, inspected, or resumed after a crash.

VII

Contact Humans With Tool Calls

Asking a human for input is just another tool call — model it that way.

When the agent needs approval, clarification, or review, emit a structured 'ask_human' call with a typed payload. The human response becomes a typed result that flows back into context like any other tool output, so human-in-the-loop is a first-class part of the control flow, not a bolted-on escape hatch.

Anti-pattern: Pausing via side-channel UI widgets whose responses never re-enter the agent's formal state.

VIII

Own Your Control Flow

You write the loop. The model advises; your code decides.

Framework-owned control flow hides the most load-bearing logic in your system: when to call a tool, when to retry, when to stop, when to escalate. Keeping control flow in code you own makes behavior debuggable, testable, and portable across models.

Anti-pattern: Handing the outer loop to a framework and hoping its heuristics match your business rules.

Compact Errors Into Context Window

Errors are signal. Summarize them, feed them back, and let the model self-correct instead of crashing.

A 4KB stack trace is noise; a one-line compacted error ("HTTP 429 from api/x, retry after 30s") is something the model can act on. Treating exceptions as just another tool result — in a compact, typed form — turns most transient failures into recoverable steps.

Anti-pattern: Letting raw exceptions bubble up and kill the run, or dumping full stack traces into the prompt.

Small, Focused Agents

Each agent owns a single, narrow responsibility — then you compose them.

Production evidence shows monolithic agents achieve roughly 40% success on complex tasks while focused-and-composed agents reach ~95% at 20–30% context each. Scope discipline is the single biggest lever on reliability, and it costs nothing except restraint.

Anti-pattern: A 'god agent' that plans, executes, verifies, and reports in one ever-growing prompt.

Trigger From Anywhere, Meet Users Where They Are

Agents should be reachable from Slack, email, CLI, cron, webhooks — wherever work actually originates.

Pinning an agent to a single UI wastes most of its value. If the trigger surface is just 'post a structured event to the agent's API', every new channel is a thin adapter rather than a rewrite, and users get help in the tools they already live in.

Anti-pattern: Building a dedicated chat UI and requiring users to context-switch into it for every interaction.

XII

Make Your Agent a Stateless Reducer

An agent step is a pure function: (state, event) → (new state, actions). Keep it stateless and push durability to the store.

Modeling each turn as a reducer makes the system replayable, testable, and horizontally scalable. State lives in the database; the agent process is disposable. This is the property that turns agent runs from fragile long-lived processes into ordinary distributed systems.

Anti-pattern: Stateful in-memory agent objects whose behavior depends on how long they've been alive.

Governing Pyramid

The Harness, Not the Model, Is the Product

Coordination Reliability Beats Model Capability

Focused-First, General-Later

Key Numbers

Architecture Types, Frameworks, Patterns

Augmented Assistant

Workflow Automaton

Orchestrated Agent Team

Networked Fabric

Autonomous Ecosystem

LangGraph

CrewAI

AutoGen

smolagents

DSPy

Pydantic AI

Bedrock AgentCore

Prompt Chaining

Routing / Triage

Parallelization

Orchestrator–Workers

Evaluator–Optimizer

Hierarchical

Autonomous Loop

Use it when

How it differs

Key insight

Examples

Use it when

How it differs

Key insight

Examples

Use it when

How it differs

Key insight

Examples

Use it when

How it differs

Key insight

Examples

Use it when

How it differs

Key insight

Examples

When to use

Example in the wild

When to use

Example in the wild

When to use

Example in the wild

When to use

Example in the wild

When to use

Example in the wild

When to use

Example in the wild

When to use

Example in the wild

When to use

Architecture

Core primitives

Key capabilities

Adoption

Trade-offs

When to use

Architecture

Core primitives

Key capabilities

Adoption

Trade-offs

When to use

Architecture

Core primitives

Key capabilities

Adoption

Trade-offs

When to use

Architecture

Core primitives