Research Reference

Agent Architectures

A governing pyramid, five architecture types, the agent harness, lab deep dives, memory systems, failure modes, architectural limits, benchmarks, and the 12-Factor principles for building production agents.

Governing Pyramid

Three pillars that hold the field up. Foundation, middle, apex.

foundation

The Harness, Not the Model, Is the Product

Production agents are 70%+ harness: context, memory, tools, verification, observability. The model is the CPU; the harness is the OS.

Frontier capability gains plateau quickly without the surrounding scaffolding that turns a model into a system. Claude Code, Cursor, and Anthropic's research agents all show that the long tail of reliability — context curation, tool ergonomics, verification loops, trace observability — dominates end-to-end performance, and that swapping models inside a strong harness produces smaller deltas than swapping harnesses around a strong model.

middle

Coordination Reliability Beats Model Capability

Once each agent is capable enough, the bottleneck shifts to how agents share context, hand off work, and recover from each other's failures. Coordination is the new scaling axis.

The MAST failure taxonomy (arXiv 2503.13657) finds that 70%+ of multi-agent failures are systemic — miscommunication, context loss, role confusion — not single-agent reasoning errors. Anthropic's multi-agent research system shows that disciplined orchestrator-worker coordination can yield 90.2% improvements, while Cognition's counter-argument warns that naive multi-agent setups fragment context and degrade below a strong single agent. Either way, the decisive variable is coordination design, not raw model IQ. However, three counterpoints emerged from primary sources. First, Anthropic's own data shows token usage explains 80% of performance variance — raw compute, not coordination finesse, is the primary driver. Upgrading from Sonnet 3.7 to Sonnet 4 delivered a larger gain than doubling the token budget. Second, Google Research found a capability saturation threshold at ~45%: once a single agent exceeds this baseline, adding coordination yields diminishing returns. Third, Deloitte and McKinsey emphasize organizational readiness (process redesign, governance, data architecture) as the primary blocker, not technical coordination alone — only 30% of organizations reach maturity level 3+ in agentic AI governance.

apex

Focused-First, General-Later

The agents that ship are narrow, well-scoped, and composable. Generality emerges from composition of focused agents, not from one omni-agent that does everything.

12-Factor Agents Factor III ('One agent, one job') and Anthropic's simplicity thesis both argue that scope discipline is what separates demos from production. Focused agents are easier to evaluate, debug, and sandbox; they fail in smaller ways and compose into larger systems via explicit protocols like A2A or MCP. The pattern across Claude Code sub-agents, Devin, and LangChain Deep Agents is the same: a top-level planner dispatches to single-purpose workers rather than asking one agent to be universally competent.