Context Engineering for Commercial Agent Systems
Memory, Isolation, Hardening, and Multi-Tenant Context Infrastructure
When you build agents for a single user on a laptop, almost anything works.
When you build commercial multi-tenant agents serving enterprises, almost nothing accidental survives.
Since late 2024, I’ve spent most of my time building and optimizing commercial, multi-tenant agent systems with a team of engineers inside a large SaaS platform serving hundreds of enterprise customers. We stress-tested agents under real constraints: tenant isolation, financial accuracy, auditability, cost control, and scale.
When I wasn’t working on those systems, I was consulting with other teams, collaborating with peers building agent platforms, and running independent experiments and side projects to pressure-test orchestration models, retrieval architectures, evaluation harnesses, and model upgrades under similarly real-world conditions.
Different environments. Same constraints.
We evaluated and implemented systems across direct foundation model orchestration, managed agent runtimes, multi-agent coordination patterns, graph-oriented orchestration, and MCP integrations.
In parallel, we architected a multi-tenant semantic layer over economically material enterprise cost data. Not RAG over documents, but a deterministic parsing and interpretation engine performing canonical identity modeling, entity resolution, ontology alignment, and context-aware re-ranking with full provenance. Underneath it: hybrid search, columnar analytics, vector stores, and AI-native data models.
We built replay-based regression harnesses. We instrumented token cost and execution traces per run. We fine-tuned and distilled local models for latency and cost optimization.
We heavily utilized production agentic coding systems like Claude Code and Cursor, studying how they handle production-grade context management: pruning between turns, isolating workspaces, externalizing heavy operations, aggressively controlling context surfaces. The same patterns we were engineering, they were hardening in the wild.
Across all of this, a clear convergence emerged.
Models improve. APIs standardize. Tool use matures.
But in commercial multi-tenant systems, those are not the determining factors.
Context is.
What follows is not theory. It emerged from debugging memory drift across tenants. From tightening isolation guarantees that survive scale. From building promotion gates to prevent memory poisoning. From implementing retention under enterprise compliance. From learning that token economics only become predictable when context is disciplined.
This is not just one team’s experience. The same architectural pressures are visible across production systems like Claude Code, Cursor, Letta, AWS AgentCore, and others. Implementations differ. The convergence is real.
The systems that survive production pressure consistently share the same traits:
- Typed memory instead of transcript blobs
- Separation between canonical truth and derived acceleration layers
- Explicit promotion and compaction gates
- Trace envelopes for replay and audit
- Aggressive pruning between turns
- Isolation boundaries enforced as security boundaries
- Cost surfaces treated as first-class signals
The systems that fail optimize for demo velocity. They index raw transcripts. They reuse context wholesale. They blur truth and acceleration. They discover cost, drift, and cross-tenant risk too late.
This guide synthesizes:
- Lived experience building commercial multi-tenant agents
- Public architectural signals from mature agent systems
- Explicit engineering rules for context discipline
In commercial multi-tenant systems, context is not an implementation detail.
It is infrastructure.
And infrastructure requires engineering discipline.
IMPORTANT NOTE
This guide is NOT canon. No single person or team’s experience is. What it represents is accumulated knowledge from building under real constraints, knowledge that appears to converge with architectural decisions the broader ecosystem is arriving at independently. Where we align with production systems like Claude Code or AWS AgentCore, it is not because we copied their work. It is because the same pressures produce the same load-bearing patterns. Treat this as a field guide, not a specification.
The code blocks in this guide are pseudo-code expressing architectural invariants, not implementation details. They illustrate structural contracts that must hold regardless of language or framework.
Part I: Context Is Infrastructure
The Three Non-Negotiables of Commercial Agent Systems
- Structural Isolation
- Deterministic Replay
- Economic predictability
These are not optional features. They are architectural constraints.
If your system cannot enforce tenant boundaries structurally, it will eventually leak.
If your system cannot run deterministic replays, you cannot debug or evolve it safely.
If your system cannot predict cost per run, it cannot scale sustainably.
Models Are Commoditized. Context Is Not.
Everyone has access to the same frontier models. Claude. GPT. Gemini. The APIs are public. The prices are falling. Capabilities are converging.
What differentiates systems is no longer the model.
It's the context.
Two teams using the same model can produce radically different outcomes depending on how they externalize knowledge and constrain behavior.
That's true for coding assistants. It's even more true for commercial agent systems.
Because in commercial systems, context isn't just about output quality. It's about:
- Policy compliance: did the agent follow the rules?
- Data isolation: can Tenant A's data leak into Tenant B's context?
- Cross-tenant safety: does shared infrastructure introduce shared risk?
- Cost control: can you predict and bound what each run costs?
- Replay and auditability: can you reconstruct what the agent knew at decision time?
- Correctness under ambiguity: does the system degrade gracefully or silently?
If you get context wrong, you don't just get worse answers. You get silent corruption.
Observed Ecosystem Convergence
Context as Competitive Surface
Pattern: Context assembly is becoming a competitive differentiator.
- Bolt frames token efficiency and richer context as economic levers, not just model inputs.
- Replit describes injecting minimal diagnostic signals instead of dumping full logs into context.
The ecosystem is optimizing context selection, not just model choice.
Decisions as First-Class Records
There's a shift happening beneath the surface that most teams haven't fully internalized.
Enterprise software historically captured objects: customers, invoices, tickets, accounts. Systems of record persisted state.
Agent systems introduce something new: decisions.
- Why was this exception approved?
- Which policy version applied?
- What precedent influenced this action?
- What context was visible at decision time?
We're not fully at “context graphs” yet. But we can't get there unless we build the foundations now:
- Append-only trace logs
- Scoped memory promotion
- Provenance tracking
- Replayable context assembly
Context management isn't just about controlling what the model sees. It's about observing and recording how context influences action.
Without traceability, you cannot optimize.
Without replay, you cannot debug.
Without provenance, you cannot trust durable memory.
The foundation for future “context graphs” is not speculation. It's disciplined trace capture starting today.
Observed Ecosystem Convergence
Decisions and Traces Are First-Class Primitives
Pattern: Platforms are standardizing structured decision records and trace envelopes.
- Anthropic's evals guidance formally defines "transcript/trace/trajectory" as the complete record of a run including outputs, tool calls, and intermediate results.
- OpenAI's Responses API assigns durable response IDs, supports
previous_response_idthreading, and exposes explicit metadata and usage fields. - Claude Code exports structured telemetry via OpenTelemetry, making execution observable at the run level.
Traceability is becoming infrastructure, not instrumentation added later.
Architectural Principle
Non-Negotiables, Not Features
Context boundaries are infrastructure, not application logic.
- Isolation is structural, not prompt-based.
- Replay is the baseline for trust.
- Economics must be predictable per run.
Part II: Memory as a Scoped, Typed System
Memory Must Be Scoped and Typed
Before we talk about storage or retrieval, we need a shared vocabulary.
The word “memory” is overloaded. So is “context.”
If you don’t explicitly define both scope and type, you end up with a single undifferentiated blob store. And that’s where security and correctness failures begin.
A robust commercial agent system classifies memory along two dimensions:
- Scope: who can see it (a security boundary)
- Type: what kind of memory it is (semantic role)
In practice, every production system we evaluated converged on some form of this separation.
This is not academic modeling. It is operational survival.
Memory Scopes (Security Boundaries)
These are structural isolation layers. They are enforced at the storage and routing layers, never delegated to the model.
Global Scope: Platform-Wide, Tenant-Invariant Memory
- System safety rules
- Tool contracts
- Product ontology
- Agent “constitution”
Properties:
- Immutable at runtime
- Versioned
- Deployment-controlled
- Never writable by agents
Tenant Scope: Organization-Wide Shared Memory
- Organization policies
- Knowledge bases
- Playbooks
- Connector configurations
Properties:
- Shared across users in a tenant
- Policy-gated promotion
- Subject to tenant retention rules
This is where governance lives. It is also where poisoning risk becomes systemic if not handled correctly.
User Scope: Personalized Memory Within a Tenant
- Preferences
- Working style
- Personal notes
- User-specific entitlements
Properties:
- Visible only to the user (and system)
- Promotion-gated
- TTL or policy-based retention
Cursor’s user-local memory posture reflects this discipline. User state stays user-scoped by default.
Session Scope: Ephemeral Runtime State
- Tool outputs
- Intermediate plans
- Scratch buffers
- Temporary retrieval results
Properties:
- Short-lived
- Subject to aggressive garbage collection
- Not durable unless explicitly promoted
Observed Ecosystem Convergence
Memory Scopes Converge on Hierarchical, File-Like Policy
Pattern: Memory is scoped and layered, not dumped into a single store.
- Claude Code implements multiple memory layers including managed policy, project memory, modular rules, user memory, local project memory, and auto-memory, all with deterministic precedence and on-demand loading.
- Cursor describes "Rules" with explicit scopes (project vs user) and multiple activation modes (Always Apply, Apply Intelligently, Apply to Specific Files, Apply Manually).
- Windsurf implements location-based scoping via AGENTS.md where subdirectory placement defines the scope boundary.
- Warp organizes durable context as typed artifacts (Workflows, Notebooks, Rules, MCP Servers, etc.) rather than raw transcript.
Scope boundaries are becoming filesystem conventions, not prompt-level suggestions.
Memory Type (Semantic Role)
Memory types cut across scopes. Scope defines who can see it. Type defines what the memory represents and how it is expected to behave.
-
Policy memory
Normative rules and constraints.
Typically global or tenant-scoped.
Versioned and tightly controlled. -
Preference memory
Stable personalization parameters.
Usually user-scoped. -
Fact memory
Durable assertions the agent may reuse.
Must include provenance. -
Episodic memory
Structured summaries of completed work.
“Case resolved.”
“Migration completed.”
“Exception granted.”
Reusable artifacts extracted from traces. -
Trace memory
Raw, append-only execution events.
This is your flight recorder.
The most common failure mode in early systems is allowing episodic or fact memory to silently drift into policy memory.
That is how precedent poisoning begins.
Observed Ecosystem Convergence
Memory Classification Is Becoming Explicit
Pattern: Memory types (policy, preference, episodic, fact) are being separated structurally.
- Letta's AI Memory SDK exposes labeled memory blocks (human, summary, policies, history, preferences) backed by per-subject agent state.
- Amazon Bedrock AgentCore separates short-term memory (raw interaction events) from long-term memory (structured records extracted across sessions) with semantic retrieval APIs.
- Bolt separates ephemeral chat history from durable "Project Knowledge," instructing users to promote constraints into the dedicated durable channel.
The more you know
A Simple Rule
Memory without scope is exposure. Memory without type is entropy.
Once you have both, you can start engineering context deliberately.
Memory Layer Summary:
| Layer | Scope | Typical Contents | Retention | Write Policy | Canonical Store |
|---|---|---|---|---|---|
| Constitution | Global | Safety rules, tool contracts, ontologies | Versioned | Write-locked | Artifact registry (versioned policy bundles) + object store |
| Org memory | Tenant | Playbooks, knowledge base, connectors, norms | Policy-based | Gated promotion | Structured memory store |
| Personal memory | User | Preferences, working style, drafts | TTL-based + user controls | Gated promotion | Structured memory store |
| Runtime state | Session | Tool outputs, scratch space, intermediate plans | Hours–days | Auto GC | Ephemeral cache (working set); trace captures events/references |
| Episodes | User / Tenant | “Case resolved,” “refactor complete,” derived summaries | Months+ | Explicit promotion | Structured memory store |
| Traces | Tenant (partitioned by session/run) | Events, retrievals, tool calls, approvals | Policy-based (often long-lived) | Append-only | Event log |
Note how:
- Session memory is volatile
- User memory is semi-durable
- Tenant memory is high-stakes
- Global memory is immutable
Each layer carries different isolation risk and promotion risk.
What We Learned
Tenant Configuration Needed an Override Layer
We initially treated tenant-scoped configuration as the single source of truth for all users within a tenant. It worked well when needs were uniform.
As adoption grew, users needed to adjust specific settings without changing the tenant-wide baseline. Without partial overrides, every exception became either a tenant mutation or a workaround.
We introduced a resolution layer. Tenant configuration remained canonical, but user-scoped records could shadow specific fields. Reads resolved through explicit precedence rules, with tenant state authoritative and user preferences layered on top.
Scope is not just about visibility. It defines resolution order when multiple layers have opinions about the same setting.
Why Scoped + Typed Memory Changes Everything
Without scope boundaries:
- Cross-tenant contamination becomes possible
- Retrieval filters become advisory
- Privacy guarantees degrade
Without type boundaries:
- Precedents become directives
- Facts become policies
- Session artifacts become durable memory
Scoped + typed memory is the minimum viable structure for safe autonomy.
It allows:
- Isolation enforcement
- Promotion gating
- Retention control
- Cost modeling by layer
- Evaluation at the run level
And most importantly: it prevents a single undifferentiated memory surface from quietly becoming a liability.
Architectural Principle
Name the Boundary
Memory must be explicitly typed and scoped.
- Scope defines the security boundary.
- Type defines behavioral semantics.
- If either is ambiguous, drift becomes structural.
Part III: Truth vs Acceleration
Once you define a memory taxonomy, the next mistake most systems make is collapsing storage into a single layer.
Everything goes into:
- A vector database
- A document store
- A transcript log
- Or worse, a hybrid of all three
That works for prototypes.
It does not work for commercial, multi-tenant agent systems.
The core distinction you must preserve is this:
Separate truth from acceleration.
In practice, that means designing two canonical stores and two derived stores.
Canonical Stores (Truth)
Canonical stores serve as the system of record. They hold durable, immutable facts from which state can be deterministically derived.
They must support:
- Auditability
- Replay
- Version awareness
- Deterministic reconstruction
Their full state must be reconstructible from their own persisted history, not dependent on secondary indexes, caches, or materialized views.
1. Canonical Event Log (Append-Only)
This is your flight recorder.
Every agent run emits multiple events that include:
- Context retrieved
- Policies evaluated
- Tool calls made
- Approvals routed
- Outputs generated
- Memory promoted
This log is:
- Append-only
- Immutable
- Replayable
- Version-aware
It allows you to answer:
- What did the agent know at decision time?
- Which policy version applied?
- Why was this exception granted?
- What was retrieved and why?
Without this log:
- You cannot debug autonomy.
- You cannot build evaluation loops.
- You cannot build future context graphs.
Example agent event:
1{2 "event_id": "evt_01J...",3 "run_id": "run_01J...",4 "tenant_id": "t_123",5 "user_id": "u_456",6 "ts": "2026-02-16T18:21:22Z",7 "type": "retrieval",8 "artifact_ids": ["mem_88"],9 "candidate_count": 32,10 "policy": { "version": "tenant_policy_v8" }11}
This is not logging. It is infrastructure.
2. Canonical Structured Memory Store
This stores durable memory state.
Unlike the event log, this is not raw trace data. It stores structured artifacts:
- Facts
- Preferences
- Episodic summaries
- Approved overrides
- Tenant-level knowledge
Every record must include:
- Scope
- Class
- Provenance
- Retention policy
- Sensitivity classification
Crucially: This store, not the vector index, is truth.
Example memory record:
1{2 "memory_id": "mem_88",3 "tenant_id": "t_42",4 "user_id": "u_7",5 "scope": "tenant",6 "memory_type": "episode",7 "status": "verified",8 "content_ref": "obj_441",9 "content_digest": "sha256:...",10 "provenance_run_id": "run_01J...",11 "retention_policy": "policy_12",12 "sensitivity": "internal",13 "created_at": 173102944114}
If your vector store becomes your truth layer, you will eventually:
- Lose provenance
- Lose replayability
- Break deletion guarantees
- Create retention drift
Derived Stores (Projections)
Derived stores exist for performance.
They are:
- Rebuildable
- Ephemeral
- Invalidatable
- Non-authoritative
They are accelerators, not truth.
1. Retrieval Index (Vector / Hybrid Search)
The retrieval index is your serving layer for recall.
It may include:
- Embeddings
- Lexical search (BM25 or equivalent)
- Hybrid ranking
- Freshness boosts
- Scope filters
- Metadata constraints
But it must be rebuildable from canonical sources.
It is a projection.
If your vector store becomes your truth layer, you will eventually lose structural integrity.
2. Object Store (Large Payloads)
Agent systems frequently deal with:
- Large documents
- Attachments
- Extraction outputs
- Tool responses
- External system dumps
These do not belong in structured memory. They belong in an object store, referenced by ID, tagged with scope and sensitivity, and governed by retention policy.
Objects should be content-addressed (or at minimum content-hashed) so they can be verified, deduplicated, and traced back to immutable source bytes. Derived artifacts such as embeddings, chunks, summaries, and classifications should be stored separately and keyed by (object_id, content_digest, model/version). They are projections for retrieval and acceleration, not canonical truth.
Summaries are especially useful for faster retrieval and context compression, but they must remain reproducible and auditable. A summary should always reference:
- the source
object_id - the source content digest
- the model and prompt/version used to generate it
- creation timestamp and optional verification status
The event log and structured memory should reference these objects, never embed large payloads directly.
If using managed ingestion systems such as Amazon Bedrock Knowledge Bases, treat them as synchronization and chunking layers that build and refresh retrieval indexes from object storage. They orchestrate ingestion and pruning, but they do not replace the underlying search engine or the need for canonical content verification.
Hybrid Search
Hybrid search (lexical + semantic) provides stronger precision and filtering guarantees than vector-only retrieval when correctness and isolation matter.
Lexical search preserves deterministic filtering.
Semantic search improves recall.
Combined ranking reduces false matches and helps minimize hallucinations.
Engines such as OpenSearch, Typesense, and Pinecone natively support hybrid retrieval, combining keyword relevance (BM25-style scoring) with vector similarity to balance precision and semantic recall.
Amazon Bedrock Knowledge Bases is not a search engine itself, but instead a managed ingestion and synchronization layer that builds and refreshes retrieval indexes (typically backed by a hybrid store) from documents stored in S3. It helps prune, chunk, and rebuild indexes, on your behalf.
Critically, the index, regardless of engine, must be built from:
- Canonical structured memory
- Curated documents
- Approved episodes
It must not be built directly from raw transcripts.
Raw transcripts are noisy, redundant, and context-fragmented. Indexing them directly undermines traceability and weakens retrieval discipline.
What We Learned
The Vector Index Became Accidental Truth
We relied on the vector index because it already contained embeddings, metadata, and retrieval paths. It was fast, convenient, and close to the model.
Over time, it quietly became the de facto system of record. Deletion became probabilistic, retention policies diverged across layers, and rebuilding the index shifted historical behavior because canonical truth had never been explicitly defined.
We separated acceleration from truth. Canonical records became immutable objects with explicit provenance and retention semantics. The vector index was reduced to a projection layer, fully rebuildable from canonical sources.
If your retrieval layer is the only place certain data lives, it is not acceleration. It is an unauditable system of record.
Isolation at the Projection Layer
Isolation is not a retrieval tuning feature.
It is a system invariant.
But the retrieval index is where projection-layer drift most commonly appears.
Filtering must occur before ranking, not after.
Every retrieval query must include:
- Tenant ID
- Scope visibility constraint
- Expiration checks
- Sensitivity boundaries
1class IdentityEnvelope:2 def __init__(self, tenant_id, user_id, roles, privacy_mode, policy_version):3 self.tenant_id = tenant_id4 self.user_id = user_id5 self.roles = roles6 self.privacy_mode = privacy_mode7 self.policy_version = policy_version89def retrieve(query, envelope: IdentityEnvelope):10 assert envelope.tenant_id is not None11 assert envelope.policy_version is not None1213 filters = {14 "tenant_id": envelope.tenant_id,15 "policy_version": envelope.policy_version,16 "visibility": allowed_scopes(envelope),17 "not_expired": True18 }1920 candidates = hybrid_search(query, filters)21 return rank(candidates)
If filtering is applied after ranking, cross-tenant artifacts may still influence embedding neighborhoods.
Partition semantics must match canonical storage.
If canonical storage is tenant-partitioned but the retrieval index is globally indexed with soft filters, isolation becomes advisory.
Observed Ecosystem Convergence
Event Logs and Structured Records Are Splitting into Distinct Tiers
Pattern: Canonical truth is separating from derived acceleration layers.
- Amazon Bedrock AgentCore implements this split explicitly: short-term memory stores raw interaction events; long-term memory holds structured information extracted asynchronously with semantic retrieval.
- OpenAI's Responses API provides durable response IDs, explicit metadata fields, and detailed usage breakdowns including cached tokens, the API-level anchors for trace envelopes.
Raw event capture and structured state are becoming architecturally distinct.
Content Proofs and Cross-Tenant Isolation
In shared or multi-tenant retrieval systems, consider cryptographic “content proofs” to prevent cross-copy leakage.
Cursor, for example, uses Merkle-tree-based content proofs during shared index onboarding to ensure results are returned only if the requester can prove legitimate possession.
This pattern can be applied at the object-store level:
- Maintain tenant-scoped manifests
- Maintain Merkle roots over authorized (
object_id,digest) pairs - Enforce verifiable inclusion boundaries
This reinforces isolation at the projection layer.
Cryptography does not replace logical isolation. It reinforces it.
Observed Ecosystem Convergence
Content Proofs and Index Isolation Are Production Patterns
Pattern: Derived indexes enforce isolation cryptographically, not just logically.
- Cursor's blog and Security page describes Merkle-tree-based "content proofs" for secure teammate index reuse, filtering results unless the client can prove file possession, then deleting proofs after roots match.
- Cursor's data-use disclosure documents temporary encrypted caching with client-generated keys that exist server-side only during the request.
Projection layers are being hardened against cross-boundary leakage.
A Quick Mental Model
Think of it this way:
- The event log is your immutable journal.
- The structured memory store is your state.
- The retrieval index is your materialized view.
- The object store is your blob layer.
If you’ve worked with event sourcing, this should feel familiar (with less determinism).
If you haven’t, the rule is simple:
If you can’t rebuild it from canonical truth, it shouldn’t be trusted.
Why This Separation Matters
This architecture gives you:
- Replayability
- Deletion guarantees
- Poisoning containment
- Cross-tenant isolation clarity
- Retention enforcement
- Cost control
- Index rebuild capability
And most importantly:
It prevents your retrieval layer from silently becoming your system of record.
Architectural Principle
Truth Is Rebuildable, Acceleration Is Disposable
Canonical data must be authoritative; everything else must be regenerable.
- Canonical stores are the system of record.
- Derived layers are projections, not truth.
- If acceleration becomes authoritative, integrity erodes.
Part IV: The Context Engine Loop
Commercial agent systems don't fail because they lack storage.
They fail because they lack discipline at runtime.
Context is not something you “load.”
It is something you assemble, constrain, compact, and sometimes discard.
Most systems accumulate context. Production systems reconstruct it.
That discipline lives in the context engine loop.
The High-Level Loop
Every agent run should follow a predictable sequence:
- Ingest: establish identity, scope, constraints, and privacy mode
- Plan context needs: determine what information is required to act safely
- Retrieve: execute hybrid search within allowed scopes
- Assemble working set: layer context by priority and token budget
- Semantic stabilization: normalize references, extract structure, preserve meaning before reduction
- Agentic garbage collection: deduplicate, prune low-confidence artifacts, enforce working-set limits
- Infer and act: model + tools + policy enforcement + optional human approval
- Promotion gate: decide what becomes durable memory
- Emit trace envelope: record retrievals, actions, policies, versions, and cost surfaces
- Lifecycle garbage collection: expire session buffers, enforce retention, invalidate derived projections
This loop is not optional.
If you skip steps, you get drift.
Step 1: Ingest
At the beginning of a run, you must establish:
- Tenant identity
- User identity
- Role and entitlements
- Privacy mode
- Sensitivity level
- Task type
Isolation begins here.
Retrieval filters are built before retrieval runs.
1from dataclasses import dataclass2from typing import FrozenSet, Optional34@dataclass(frozen=True)5class IdentityEnvelope:6 tenant_id: str7 user_id: str8 roles: FrozenSet[str]9 privacy_mode: str # e.g., "retained" | "no_retention"10 policy_version: str # must be pinned per run1112def assert_envelope(envelope: IdentityEnvelope) -> None:13 assert envelope.tenant_id, "tenant_id is required"14 assert envelope.user_id, "user_id is required"15 assert envelope.policy_version, "policy_version must be pinned per run"16 assert envelope.privacy_mode in {"retained", "no_retention"}, "invalid privacy_mode"
You DO NOT ask the model to filter data.
You filter in the data plane.
If identity and scope are ambiguous at ingestion, everything downstream becomes probabilistic.
Step 2: Plan Context Needs
Before retrieving anything, the agent should plan what kind of context it needs.
Does this task require:
- Tenant policy?
- Prior episodes?
- User preferences?
- External knowledge?
This prevents the common anti-pattern:
“Retrieve everything and let the model figure it out.”
In production, this anti-pattern shows up as:
- Gradually increasing token costs
- Slowly degrading precision
- Retrieval surfaces expanding
- Embedding neighborhoods densifying
- Prompt budgets creeping upward
No single run looks catastrophic.
Over weeks, additive context and recursive indexing begin influencing outcomes in subtle, hard-to-debug ways.
Planning reduces both risk and cost.
It's your first form of budget control.
1def plan_context(request):2 if request.type == "support_refund":3 return {4 "needs": ["tenant_policy", "prior_episodes", "customer_history"],5 "max_tokens": 24006 }7 elif request.type == "draft_email":8 return {9 "needs": ["user_preferences"],10 "max_tokens": 120011 }
Observed Ecosystem Convergence
Plan-Before-Execute Is Standard Practice
Pattern: Agent systems are separating planning from execution with explicit gates.
- Claude Code describes an agentic loop: gather context, take action, and verify results, with subagents that use fresh isolated contexts and return summaries.
- OpenCode documents a "plan" agent that analyzes without modifying code, with permissioned tools requiring approval before execution.
- Lovable splits into Plan mode for decision-making and Agent mode for execution with verification.
- Bolt documents Plan Mode as improving strategy and execution accuracy.
Inference without planning is giving way to deliberate, gated execution.
Step 3: Retrieve (Isolation Enforced Here)
Retrieval must respect:
- Scope
- Visibility
- Sensitivity
- Retention
- Privacy mode
Filtering happens before ranking, not after.
1def retrieve(query: str, envelope: IdentityEnvelope, *, now_ts: int):2 assert_envelope(envelope)3 assert query and isinstance(query, str)45 filters = {6 "tenant_id": envelope.tenant_id, # mandatory predicate7 "visibility": allowed_scopes(envelope.roles), # computed in data plane8 "not_expired_at": now_ts, # enforce retention gates9 "status_in": {"active"}, # provisional is not broadly retrievable10 }1112 # IMPORTANT: filter BEFORE rank, never after.13 candidates = hybrid_search(query=query, filters=filters)1415 # Projection is not truth. Verify tenant on canonical fetch.16 artifact_ids = [c.artifact_id for c in candidates[:50]]17 records = guarded_fetch(artifact_ids, envelope.tenant_id)1819 return rank(query, records)
Hybrid search (lexical + semantic) provides:
- Deterministic filtering
- Precision guarantees
- Improved recall
But retrieval is still a projection.
The canonical store remains the source of truth.
Step 4: Assemble the Working Set
The working set is the ephemeral context that actually enters the model’s window.
It is layered:
- Global constitution
- Tenant policies
- User preferences
- Retrieved facts and episodes
- Session state
Each layer has:
- Priority
- Token budget
- Truncation rules
Without layering and budgets, context windows become dumping grounds.
1def assemble(layers, budget):2 ordered = sort_by_priority(layers)3 working_set = []4 tokens_used = 056 for item in ordered:7 if tokens_used + item.tokens <= budget:8 working_set.append(item)9 tokens_used += item.tokens10 else:11 break1213 return working_set
What We Learned
Silent Guardrail Drift
We assumed that once policies existed in the system, they would remain influential.
As session histories expanded, tenant-level constraints were gradually pushed out of the working set. The system kept running, just without its guardrails visible at inference time.
We introduced explicit layer budgets. Global constitution and tenant policy received reserved allocations that could not be displaced.
If guardrails can be crowded out, they are suggestions, not invariants.
Observed Ecosystem Convergence
Budgeted Context Assembly Replaces Wholesale Inclusion
Pattern: Context is selectively loaded by budget and priority, not dumped wholesale.
- Replit injects minimal diagnostic signals and instructs the agent to fetch logs via a tool, explicitly avoiding full context dumps.
- Anthropic's Skills guide formalizes progressive disclosure: metadata always loaded, full instructions loaded only when needed, linked files navigated on demand.
- Cursor requires explicit context inclusion rather than blanket accumulation.
What you exclude from context is becoming as important as what you include.
Step 5: Semantic Stabilization (Pre-Compaction Flush)
Before you shrink context, you MUST stabilize meaning.
Compaction without stabilization risks deleting something the model was implicitly relying on.
Semantic stabilization answers:
What must be transformed or anchored before we enforce token limits?
This step may include:
- Collapsing verbose tool traces into structured summaries
- Extracting typed episodic artifacts from conversation fragments
- Converting free-form dialogue into structured facts
- Normalizing references (“that refund we discussed”) into concrete IDs
- Marking low-confidence artifacts explicitly
- Ensuring provenance metadata is attached
1def semantic_stabilization(working_set):2 working_set = collapse_tool_traces(working_set)3 working_set = extract_structured_episodes(working_set)4 working_set = normalize_references(working_set)5 working_set = attach_provenance(working_set)6 return working_set
This is not deletion.
It is transformation before deletion.
Without this step:
- Summarization can distort intent
- Compaction can silently remove guardrails
- Session references can become ambiguous
- Replay fidelity can degrade
Semantic stabilization preserves reasoning integrity before footprint reduction.
What We Learned
Compaction Without Stabilization Corrupted Meaning
We aggressively summarized long histories before extracting structured artifacts. It reduced tokens quickly and seemed harmless.
Over time, subtle behaviors shifted. Context that influenced tool selection and policy evaluation disappeared because it had been compressed before it was normalized or typed.
We moved structured extraction ahead of compaction. Meaning was stabilized first, then footprint was optimized.
If compaction runs before normalization, you are not reducing noise. You are discarding signal you have not yet captured.
Pre-compaction stabilization protects correctness.
Step 6: Agentic Garbage Collection (Working-Set Compaction)
After meaning is stabilized, the system can safely optimize.
Agentic garbage collection happens before inference.
It enforces:
- Token budgets by layer
- Deduplication of redundant artifacts
- Dropping stale session state
- Removing low-confidence provisional memory
- Enforcing maximum working-set size
Example:
1def agentic_gc(working_set, budget):2 working_set = dedupe(working_set)3 working_set = drop_low_confidence(working_set)4 return enforce_token_budget(working_set, budget)
Agentic GC protects:
- Guardrail visibility
- Cost predictability
- Ambiguity control
- Drift containment
It ensures that:
- Global constitution cannot be crowded out
- Tenant policy remains visible
- Session chatter does not displace structural constraints
Uncompressed history turns directly into cost.
Agentic garbage collection is not just optimization.
It is drift control.
Garbage Collection by Memory Layer:
| Memory Layer | Volatility | Promotion Risk | GC Strategy | Industry Parallel |
|---|---|---|---|---|
| Session | High | Low | Aggressive compaction, TTL | Claude Code ephemeral state |
| User | Medium | Medium | TTL + overwrite | Cursor user-local history |
| Tenant | Low | High | Verification gate | AgentCore tenant memory |
| Global | Immutable | Extreme | Write-locked | Signed system artifacts |
Session state is cheap and volatile.
Tenant memory is high-stakes and must be protected accordingly.
What We Learned
Transparency Competed With Cost
We retained full intermediate tool traces in the working context to maximize debugging transparency. Nothing else changed, but token usage per run steadily climbed.
The working set was carrying diagnostic detail the model did not need for inference. Cost increased without improving behavior.
We collapsed tool traces into structured summaries during stabilization and let agentic GC prune the rest. Only durable artifacts were eligible for promotion.
Full traces belong in the event log. The working set should carry only what inference needs to act on.
Step 7: Infer and Act
Only after:
- Context is stabilized
- Working set is compacted
- Budgets are enforced
does inference occur.
This is where:
- The model runs
- Tools are invoked
- Policies are evaluated
- Approvals are requested if needed
This is the only step most tutorials focus on.
Model invocation + tools + policy evaluation + approvals.
In commercial systems:
- Actions must be policy-evaluated
- High-risk actions may require human approval
- Tool outputs must be sensitivity-tagged
- Outputs must be traced
Tool invocations should include:
- Versioned tool contracts
- Input digests
- Output digests
External systems evolve.
Without capturing tool version and payload hash, replay fidelity degrades over time.
The model is a component.
The system is the product.
Step 8: Promotion Gate
Promotion transitions session memory into durable memory.
This is the highest-risk operation in the system.
It deserves its own section, which we will fully expand in Part VI.
Step 9: Emit Trace Envelope
Disciplined trace capture requires a canonical shape.
Every run produces a single append-only trace envelope.
The envelope is a run-scoped materialization derived from the append-only event log. It does not introduce new facts. It snapshots derived aggregates so replay, audit, and cost analysis do not require reconstructing runs from raw events.
Cost views, lineage trees, evaluation harnesses, and audit dashboards are projections derived from this record. They do not redefine it.
Events are self-describing for partitioning and queryability, but the trace envelope is the authoritative run-level header. The event log is keyed by run_id.
At minimum, a canonical trace record must anchor:
- Identity (tenant, user, privacy mode)
- Model and policy versions
- Prefix/version hash
- Retrieval artifact IDs
- Tool contract versions
- Promotion decisions
- Token usage and cost
- Lineage (parent_run_id)
- Immutable event history
A minimal representation might look like this:
1{2 "run_id": "run_01J...",3 "parent_run_id": null,45 "tenant_id": "t_123",6 "user_id": "u_456",7 "privacy_mode": "retained",89 "policy": {10 "version": "tenant_policy_v8",11 "hash": "sha256:..."12 },13 "prefix": {14 "version": "constitution_v12",15 "hash": "sha256:..."16 },17 "model": {18 "provider": "anthropic",19 "name": "claude-sonnet-4-6",20 "version": "2026-02-15"21 },2223 "started_at": "2026-02-16T18:21:22Z",24 "ended_at": "2026-02-16T18:21:41Z",25 "status": "success",2627 "usage": {28 "tokens_in": 1832,29 "tokens_out": 412,30 "static_prefix_tokens": 620,31 "dynamic_context_tokens": 1212,32 "cost_estimate_usd": 0.023133 },3435 "retrieval": {36 "count": 8,37 "bytes_in": 14523,38 "rerank_candidates": 3239 },40 "tools": {41 "invoked": 2,42 "retry_count": 143 },44 "promotions": {45 "count": 1,46 "by_scope": { "tenant": 1, "user": 0, "global": 0 }47 },4849 "events": [50 {51 "event_id": "evt_01J...",52 "run_id": "run_01J...",53 "tenant_id": "t_123",54 "user_id": "u_456",55 "ts": "2026-02-16T18:21:23Z",56 "type": "retrieval",57 "artifact_ids": ["mem_88"],58 "candidate_count": 32,59 "policy": { "version": "tenant_policy_v8" }60 },61 {62 "event_id": "evt_01J...",63 "run_id": "run_01J...",64 "tenant_id": "t_123",65 "user_id": "u_456",66 "ts": "2026-02-16T18:21:27Z",67 "type": "tool_call",68 "tool": "refund_api",69 "contract_version": "v3.1",70 "input_hash": "sha256:...",71 "output_hash": "sha256:...",72 "policy": { "version": "tenant_policy_v8" }73 },74 {75 "event_id": "evt_01J...",76 "run_id": "run_01J...",77 "tenant_id": "t_123",78 "user_id": "u_456",79 "ts": "2026-02-16T18:21:38Z",80 "type": "promotion_write",81 "memory_id": "mem_441",82 "scope": "tenant",83 "memory_type": "episode",84 "status": "provisional",85 "policy": { "version": "tenant_policy_v8" }86 }87 ],8889 "integrity": {90 "envelope_hash": "sha256:...",91 "events_root_hash": "sha256:..."92 }93}
This record is append-only.
It is version-aware.
It is sufficient to replay the decision.
Everything else is projection.
Without trace envelopes, context engineering becomes guesswork.
Step 10: Lifecycle Garbage Collection (Durability & Retention Discipline)
After the run:
- Expire session buffers
- Invalidate derived indexes if needed
- Apply TTL to memory
- Archive large payloads
- Enforce retention policies
Memory is not just created. It must decay.
The more you know
Why Three Forms of Garbage Collection?
It’s important to distinguish between:
- Semantic Stabilization: preserve meaning before reduction
- Agentic Garbage Collection: enforce working-set discipline before inference
- Lifecycle Garbage Collection: enforce retention and projection hygiene across runs
They operate at different layers of the architecture and protect different invariants:
- Stabilization protects correctness
- Agentic GC protects cost and drift
- Lifecycle GC protects durability and compliance Most systems implement only one. Commercial systems require all three.
Run Boundary Events
Beyond the events emitted within the agent loop, two boundary events define the run itself.
run_started pins the execution boundary.
It captures the immutable configuration for the run: policy version, prefix hash, privacy mode, primary model, and parent linkage. From this point forward, the run operates inside that fixed context.
run_finalized closes the lifecycle.
It records final status, token usage, cost attribution, promotion counts, and integrity hashes. After this event, the run is complete and immutable.
Together, these two events make the trace envelope fully reconstructible from the append-only event log. The envelope introduces no new facts. It materializes the boundary and aggregates for fast replay, audit, and cost analysis.
Multi-Turn Conversations Do Not Justify Persistent Windows
A common misconception:
“If this is a conversation, the entire prior context should remain in the window.”
⚠️ That is incorrect in commercial systems.
Multi-turn state should be reconstructed per turn from:
- Canonical structured memory
- Verified episodes
- Approved tenant policies
- Selective session summaries
Not from raw accumulated transcripts as the primary reconstruction mechanism.
Each turn should:
- Emit a trace
- Compact session artifacts
- Promote only approved durable memory
- Reassemble context fresh
Carrying forward full windows across turns:
- Increases token cost
- Increases drift
- Increases poisoning risk
- Obscures replayability
The acceptable pattern:
- Session memory is volatile
- Durable memory is reconstructed
- Context is assembled per turn
If context grows by accumulation rather than reassembly, you are building drift into the architecture.
What We Learned
Transcript Indexing Drift
We indexed raw transcripts directly because it was fast and required almost no additional structure. Early demos were impressive.
Over time, behavior drifted. Summarization evolved, tokenization shifted, and buried instructions inside transcripts began influencing retrieval in ways we could not replay or explain.
We moved transcripts out of the retrieval surface. Only structured artifacts, verified episodes, and canonical documents were indexed. Transcripts remained in the event log.
Raw transcripts are source material, not durable memory. If retrieval is built on conversation residue, behavior becomes a function of accumulated noise.
The Discipline
This loop is the difference between:
A chatbot with a vector DB and a commercial agent system.
Most failures come from skipping:
- Planning
- Pre-compaction
- Promotion gating
- Trace emission
The loop enforces discipline.
And discipline turns context from an experiment into infrastructure.
Architectural Principle
Assemble, Don’t Accumulate
Context must be reconstructed per run, not allowed to grow unchecked.
- Context is built intentionally each execution.
- The loop is the product: retrieve, budget, compact, promote, trace.
- Unbounded carryover becomes architectural drift.
Part V: Multi-Agent Context Boundaries
Commercial agent systems increasingly delegate work across multiple agents.
A parent agent spawns a subagent to research a topic, execute a tool chain, validate a result, or operate within a specialized domain. Multi-agent orchestration patterns such as fan-out, delegation, pipelines, and supervisory hierarchies are becoming standard.
The architectural challenge is not orchestration.
It is context discipline across agent boundaries.
Every principle established so far, scoped memory, truth vs acceleration, the context engine loop, applies within a single agent. Multi-agent systems multiply the surfaces where those principles must hold.
If context flows between agents without discipline, you get the same failures as undisciplined single-agent systems, but harder to debug because the causal chain crosses execution boundaries.
Context Inheritance vs Isolation
When a parent agent spawns a subagent, the first question is:
What context does the subagent receive?
There are two patterns:
- Shared context: The subagent inherits the parent's full working set.
- Isolated context with scoped input: The subagent receives a fresh context window with only the information the parent explicitly passes.
The first pattern is simple. It is also dangerous.
It carries the following risks:
- The subagent's token budget is consumed by the parent's context before it begins its own work.
- Irrelevant context from the parent pollutes the subagent's reasoning.
- Replay becomes ambiguous because you cannot isolate which agent's context influenced which decision.
- If the parent's context contains sensitive artifacts the subagent should not access, isolation is violated.
The second pattern, isolated context with scoped input, however, survives production pressure.
Claude Code's subagent model reflects this: subagents operate with fresh isolated contexts. The parent provides a scoped task description. The subagent executes independently. It returns a structured summary. The parent incorporates the summary into its own working set.
The isolation is deliberate:
- The subagent's token budget is its own.
- The parent controls what enters the subagent's window.
- The subagent's full internal trace stays in its own scope.
- Replay can reconstruct each agent's decision independently.
The cost of isolation is that the parent must decide what context the subagent needs. That decision is itself a context engineering problem, and it benefits from the same planning step described in the context engine loop.
If context inheritance is implicit, debugging multi-agent behavior requires reconstructing invisible state.
If context inheritance is explicit, each agent's behavior is independently replayable.
1def spawn_subagent(parent_trace: TraceEnvelope, envelope: IdentityEnvelope, task: dict, input_artifact_ids: list[str]):2 # Mandatory inheritance: tenant, user, policy, privacy (handled in routing outside this call).3 child_run_id = new_id("run")45 child_trace = TraceEnvelope(6 run_id=child_run_id,7 tenant_id=envelope.tenant_id,8 user_id=envelope.user_id,9 policy_version=envelope.policy_version,10 model_version=select_model_for(task),11 parent_run_id=parent_trace.run_id,12 )1314 # Parent explicitly chooses what the child can see.15 result = execute_subagent(task=task, input_artifact_ids=input_artifact_ids, trace=child_trace)1617 parent_trace.record_event({18 "event_type": "delegation",19 "child_run_id": child_run_id,20 "agent_type": task.get("agent_type"),21 "input_artifact_ids": input_artifact_ids,22 "output_summary_id": result.summary_id,23 "child_cost_usd": child_trace.cost_usd,24 })2526 child_trace.finalize()27 return result
What We Learned
Context Sharing Was Correct Until It Wasn't
We initially delegated to subagents by passing the parent’s full working set. It was simple, fast to ship, and produced strong results in testing.
As parent contexts grew, subagent token costs grew with them. Behavior became sensitive to prior session state, and identical delegations produced different outcomes depending on what had happened earlier in the run.
We moved to scoped delegation. The parent assembled a minimal context package per subagent: task description, applicable policies, and explicitly selected artifacts. Each subagent ran in an isolated context and returned a structured summary.
Full context inheritance works when working sets are small. At scale, implicit inheritance turns parent history into unintended influence.
Subagent Outputs Are Promotion Events
When a subagent returns results to its parent, the parent incorporates that output into its working set.
This is a promotion event.
It deserves the same scrutiny as any other transition from ephemeral to durable state.
When the subagent's summary enters the parent's context it can influence:
- Subsequent tool invocations
- Policy evaluation
- Further delegation decisions
- Memory promotion at the end of the run
If the subagent's output is treated as trusted input without validation, the parent inherits whatever errors, hallucinations, or poisoning the subagent produced.
Defense:
- Subagent outputs should be typed: fact, episode, recommendation, tool result.
- Provenance should be tagged: which subagent, which run, which model version.
- Sensitivity classification should transfer: if the subagent accessed tenant-scoped data, the summary inherits that classification.
- The parent's promotion gate applies: subagent outputs should be treated the same way as any other artifact entering durable memory.
A useful mental model: treat subagent outputs like tool outputs.
They are data, not directives.
They carry provenance.
They are subject to the same validation rules as any other input to the working set.
What We Learned
Subagent Outputs Bypassed Promotion Gates
We treated subagent summaries as trusted internal artifacts because they came from our own agents. They flowed directly into the parent’s working set and, in some cases, into tenant-scoped durable memory without passing through the standard promotion gate.
As delegation volume increased, unverified summaries accumulated in durable memory faster than review processes could keep up.
We routed subagent outputs through the same promotion pipeline as every other artifact. Provenance became mandatory, and outputs remained provisional until validated.
The source of an artifact does not determine its trustworthiness. Internal agents are not exempt from governance.
Trace Lineage Across Agent Boundaries
In a single-agent system, the trace envelope captures one execution path.
In a multi-agent system, traces form a tree.
If Agent A delegates to Agent B, and Agent B delegates to Agent C, the trace must capture the full lineage:
run_idfor each agent's executionparent_run_idlinking child to parent- Delegation context: what was passed to the child
- Return summary: what came back
- Cost attribution per agent
Without lineage, you cannot:
- Replay a specific agent's execution in isolation
- Attribute cost to the agent that incurred it
- Debug which agent in the chain produced a problematic output
- Evaluate whether delegation decisions were correct
Trace lineage turns a multi-agent run from opaque delegation into a debuggable, replayable execution graph.
Without it, multi-agent systems become black boxes that happen to contain smaller black boxes.
Example trace structure (truncated for brevity):
1{2 "run_id": "run_parent_01J...",3 "tenant_id": "t_42",4 "user_id": "u_7",5 "policy_version": "policy_v3",6 "model_contract_version": "agent_spec_v2",78 "delegations": [9 {10 "child_run_id": "run_sub_01J...",11 "agent_type": "research_agent",12 "model_version": "claude-sonnet-4-6",13 "input_artifact_ids": ["mem_88", "policy_v3"],14 "output_artifact_ids": ["sum_441"],15 "tokens_in": 1420,16 "tokens_out": 380,17 "tools_invoked": 3,18 "cost_estimate_usd": 0.018,19 "promotions": [],20 "delegations": []21 }22 ],2324 "promotions": [],25 "total_cost_estimate_usd": 0.04126}
Scope Inheritance Rules
When a parent agent delegates to a subagent, the subagent must operate within the correct scope boundaries.
Tenant scope and user scope must be inherited. If a subagent operates outside the parent's tenant boundary, isolation is violated. This is not optional.
Session scope is different. The subagent should have its own ephemeral session scope. It should not inherit the parent's session history, scratch buffers, or intermediate plans. Those belong to the parent's execution context.
Policy visibility must also propagate. If the parent operates under tenant policy version 3.2, the subagent must operate under the same version. Policy version drift across agents within a single run creates inconsistency that is extremely difficult to debug.
Summary of inheritance rules:
| Scope | Inherited? | Notes |
|---|---|---|
| Tenant identity | Yes (mandatory) | Isolation boundary |
| User identity | Yes (mandatory) | Entitlement boundary |
| Tenant policies | Yes (mandatory) | Must be same version as parent |
| Global constitution | Yes (mandatory) | Immutable, always present |
| Session state | No | Subagent gets its own session scope |
| Parent's working set | No | Only explicitly passed artifacts |
| Privacy mode | Yes (mandatory) | Cannot be downgraded by delegation |
If privacy mode is active in the parent, it must be active in every subagent. Delegation cannot downgrade privacy guarantees.
If policy version differs between parent and child, the trace will show inconsistent evaluation and replay will not reproduce the behavior.
Cost Attribution Across Agents
Multi-agent runs compound every cost surface described in Part IX.
Each subagent incurs its own:
- Inference cost (its own context window, its own model invocation)
- Retrieval cost (its own queries against the retrieval index)
- Tool cost (its own external API calls)
- Persistence cost (if it promotes anything to durable memory)
Without per-agent cost attribution, optimization is impossible because you cannot see which agents are expensive.
Common failure mode:
A parent agent delegates to five subagents. Total run cost rises. The trace shows aggregate token counts. But it does not show that one subagent consumed 60% of the budget because its retrieval surface was over-broad.
Per-agent cost attribution within a run is not optional in commercial systems. It is the only way to identify which delegation paths are economically sustainable and which need tighter budgets.
The trace envelope must decompose cost by agent, not just by surface.
Observed Ecosystem Convergence
Multi-Agent Context Isolation Is Becoming Structural
Pattern: Agent systems are separating agent execution contexts with explicit boundaries rather than sharing state.
- Claude Code implements subagent isolation through the Task tool. These subagents receive scoped task descriptions, operate with fresh contexts, and return structured summaries. The parent's context is not shared wholesale. Subagents cannot spawn other subagents, enforcing a single-level delegation hierarchy. The Anthropic Agent SDK extends this with
parent_tool_use_idfields for tracing delegation lineage. - Letta supports multi-agent architectures where each agent maintains its own memory blocks and state. Cross-agent communication happens through explicit message passing (
send_message_to_agent_asyncandsend_message_to_agent_and_wait_for_reply) not shared context windows. When state must be shared, Letta uses explicitly attached shared memory blocks rather than implicit context inheritance. - The OpenAI Agents SDK supports agent handoffs where conversation state transfers between specialized agents. By default the receiving agent sees full conversation history, but
input_filterfunctions give explicit control over what context propagates. The SDK also supportsnest_handoff_history, which collapses prior transcripts into summary messages rather than passing raw history, implementing context scoping as a first-class API. It also supports an agents-as-tools pattern for nested delegation. - AWS Bedrock AgentCore supports multi-agent orchestration with a supervisor-agent pattern where specialized sub-agents maintain independent configurations and tool access. AgentCore Memory provides memory branching, isolated conversation branches within a shared memory resource, so each agent maintains its own context history within shared tenant boundaries.
Context isolation between agents follows the same pattern as context isolation between tenants: structural separation with explicit, controlled sharing.
Architectural Principle
Delegation Multiplies Risk
Every agent boundary must preserve isolation and replay.
- Subagents receive minimal, intentional context.
- Cross-agent flows remain scoped and traceable.
- Inherited sprawl is a systemic failure.
Part VI: Isolation, Poisoning, and Promotion Control
If you build commercial agents long enough, you eventually learn a frustrating truth:
Most failures don’t look like failures.
They look like slightly worse output... until you realize the system is drifting.
That drift is usually caused by one of three things:
- Isolation boundaries weren’t enforced consistently
- Bad context was retrieved and treated as truth
- Session artifacts were promoted into durable memory
The danger isn’t that the model hallucinates.
The danger is that the system starts believing it.
Isolation Is a Data-Plane Primitive
Here’s the invariant again because it’s worth repeating:
Isolation is enforced in the data plane, not the prompt.
If your strategy relies on the model restricting itself to tenant-specific content, you’re already in trouble.
Isolation must be structural.
This means:
- Every retrieval query must includes tenant/user filter as a required predicate
- Filters always apply before ranking
- When documents are fetched, tenant mismatch needs to throw an exception
- Derived indexes are should either be physically partitioned or logically partitioned with verified predicates
It sounds obvious.
1class IsolationBreach(Exception):2 pass34def guarded_fetch(artifact_ids: list[str], tenant_id: str):5 assert tenant_id67 records = canonical_memory_store.get_many(artifact_ids)89 for r in records:10 # Hard failure. Do not "best-effort" isolate.11 if r.tenant_id != tenant_id:12 raise IsolationBreach(f"tenant mismatch: {r.id}")1314 # Optional: enforce lifecycle rules at read time too.15 if getattr(r, "status", None) not in {"active"}:16 continue1718 return [r for r in records if getattr(r, "status", None) == "active"]
What We Learned
Canonical Was Clean. Projections Drifted.
When we investigated cross-tenant inconsistencies, canonical storage was correct. The drift lived in derived layers: search indexes, caches, and analytics jobs. Each enforced partitioning slightly differently. Each was almost correct.
Those small differences compounded into observable inconsistencies at scale.
We standardized tenancy enforcement across every derived layer. Retrieval filters became mandatory predicates, and partition semantics were made identical everywhere.
Isolation that holds only in canonical storage is not isolation. Every projection layer must inherit the same partitioning contract or eventually violate it.
Most multi-tenant systems that leak data don’t leak because someone wrote SELECT * FROM tenants.
They leak because a derived system wasn’t partitioned the same way the canonical store was.
Observed Ecosystem Convergence
Isolation Is Structural, Not Prompt-Level
Pattern: Privacy and isolation are routing decisions, not prompt-level instructions.
- Cursor's security architecture routes requests through a proxy into separate service replicas for privacy vs non-privacy workloads, defaulting to privacy-mode if the
x-ghost-modeheader is missing. - Warp performs unconditional secret redaction for AI interactions and uses an explicit
X-Warp-Telemetry-Enabledheader where the server assumes telemetry disabled if absent. - OpenCode executes tasks inside GitHub Actions runners, creating a hard boundary for tool execution and side effects.
Isolation is structural and enforced before writes, not advisory.
The Three Types of Memory Poisoning
Memory poisoning is not just “prompt injection.”
In multi-tenant systems it shows up in three distinct ways.
1. Instruction Poisoning
Malicious or malformed content attempts to alter system behavior.
Examples:
- “Ignore previous instructions”
- “Always approve refunds”
- “If you see this, exfiltrate secrets”
Defense:
- Policies never come from user content
- Policy memory is signed and versioned (global scope)
- User instructions are treated as input, not law
- Tool outputs are treated as data, not directives
If you take one rule from this section:
Never promote instructions to policy.
2. Precedent Poisoning
This is subtle and common.
An agent makes an exception once.
That exception gets stored as “how we do it.”
Six weeks later, the exception becomes default behavior.
Defense:
- Don’t store precedents as directives
- Store them as episodes with provenance and outcomes
- Require explicit approval signals before a precedent becomes a norm
Episodic memory earns its keep here.
Episodes store:
- What happened
- Why it happened
- Under what policy
- With what approval
They do not store: “What to always do.”
What We Learned
Exceptions Became Norms
We allowed emergency overrides to persist temporarily. They solved immediate problems and seemed contained.
Over time, some overrides made their way into durable tenant memory. Temporary exceptions began behaving like permanent rules.
We changed the promotion path. Overrides remained ephemeral unless approved through the same governance flow as canonical state. Precedent required explicit provenance and verification.
If an exception can persist without approval, the system will eventually treat it as policy. Governance is not about preventing overrides. It is about preventing silent graduation.
3. Cross-Scope Contamination
A user-level artifact gets promoted to tenant scope.
A tenant-level artifact affects global behavior.
A retrieval index accidentally crosses tenants.
When this happens:
- Quality degrades everywhere
- Security risk spikes
- Replay becomes ambiguous
Defense:
- Promotion gates enforce scope rules
- Global scope is write-locked
- Tenant scope requires stricter verification than user scope
- Every memory write includes scope, retention, sensitivity, and provenance
What We Learned
Automatic Learning Was a Trap
We experimented with automatically persisting what the agent “learned” during a run. It felt like progress. The system appeared to evolve.
Over time, small local mistakes were promoted into durable memory. Durable memory amplified those errors and fed them back into future runs.
We moved promotion behind an explicit gate. Every durable write required scope classification, provenance, retention policy, and verification. Session state remained volatile unless intentionally promoted.
Intelligence that persists without governance is not learning. It is drift with momentum.
Promotion: The Most Dangerous Operation
Promotion is the transition from session state to durable memory.
It is where most memory poisoning becomes permanent.
Promotion must be treated like a database write.
Not like a convenience feature.
A promotion gate should answer four questions:
- What scope can this live in? Session vs user vs tenant vs global
- What types is it? Preference vs fact vs episode vs policy
- What is the retention policy? TTL vs manual deletion vs legal hold
- What is the provenance? Where did it come from, and can we replay it?
note
New users may require a bootstrapping phase with more permissive promotion that tightens over time. Otherwise newly onboarded tenants face a cold-start challenge where overly restrictive promotion gates mean the system has no memory to work with and delivers poor early experiences.
Default Promotion Rules
-
Session → User
Allowed for preferences, drafts, working style, user-specific episodes -
Session → Tenant
Allowed only for verified facts and approved episodes -
Session → Global
Never allowed at runtime
Verification Rules
Facts stored at tenant scope require:
- Human approval
- Trusted system-of-record confirmation
- Repeated corroboration across independent sources
Sensitivity Rules
- Never persist secrets (API keys, tokens)
- Be careful persisting PII without explicit retention rules and consent
Pseudo-Policy:
1def promote(candidate, envelope: IdentityEnvelope, *, trace, now_ts: int):2 assert_envelope(envelope)34 # Invariants:5 # - No runtime writes to global scope6 # - Policy is a signed artifact, never promotable7 if candidate.scope == "global":8 return reject("global scope is write-locked")9 if candidate.memory_type == "policy":10 return reject("policy is a signed artifact, not promotable")1112 # Tenant writes are high-stakes.13 if candidate.scope == "tenant":14 if candidate.memory_type == "fact" and not (candidate.verified or candidate.human_approved):15 return reject("tenant facts require verification or explicit approval")16 if candidate.memory_type == "episode" and not (candidate.human_approved or candidate.from_trusted_workflow):17 return reject("tenant episodes require approval or trusted workflow signal")1819 # Sensitivity guardrails.20 if candidate.contains_secrets:21 return reject("secrets are never persisted")2223 # Minimal canonical write: reference + digest + provenance (never raw payload).24 record = {25 "memory_id": new_id("mem"),26 "tenant_id": envelope.tenant_id,27 "user_id": envelope.user_id,28 "scope": candidate.scope,29 "memory_type": candidate.memory_type,30 "status": "provisional",31 "content_ref": candidate.content_ref,32 "content_digest": candidate.content_digest,33 "provenance_run_id": trace.run_id,34 "policy_version": envelope.policy_version,35 "retention_policy": candidate.retention_policy,36 "created_at": now_ts,37 }38 canonical_memory_store.put(record)3940 trace.record_event({41 "event_type": "promotion_write",42 "memory_id": record["memory_id"],43 "status": record["status"],44 "scope": record["scope"],45 "memory_type": record["memory_type"],46 "content_digest": record["content_digest"],47 })4849 # Everything expensive is off the critical path.50 enqueue_hardening(memory_id=record["memory_id"])51 return record["memory_id"]
This is the generational GC analogy in practice:
- Session state is cheap and volatile
- User memory is moderately durable
- Tenant memory is high-stakes and requires verification
- Global memory is write-locked
Promotion discipline is not about paranoia.
It is about protecting invariants.
What We Learned
Promotion Gates Need Calibration, Not Just Restriction
We initially tightened promotion gates to prevent bad writes. The instinct was correct: durability should be earned.
Over time, the system began forgetting legitimate outcomes. Verified decisions expired with session state, and when related tasks resurfaced weeks later, the agent had no grounding in what had already been established.
We recalibrated promotion by memory type. Facts and policy remained tightly gated. Episodes from completed workflows were eligible for promotion with automatic provenance tagging and verification signals.
A promotion gate is not just a wall. It is a calibration surface. Too permissive and drift compounds. Too restrictive and the system forgets what it legitimately learned.
Observed Ecosystem Convergence
Promotion Gating Is a Product-Level Pattern
Pattern: Durable memory requires explicit approval, not automatic persistence.
- Cursor extracts Memories from chat but saves them only with user approval, which is a direct implementation of promotion gating.
- Bolt intentionally clears chat history when switching agents, instructing users to preserve durable guidance in "Project Knowledge."
- Lovable positions "Custom knowledge" as persistent shared memory applied across future edits, not accumulated chat transcript.
Session state is ephemeral by default; durability requires a gate.
What We Learned
Deletion Drifted Without Synchronized Invalidation
We treated deletion in canonical storage as sufficient. The system of record behaved correctly.
Derived indexes did not. They continued serving artifacts that no longer existed in canonical truth. rojections outlived their source.
We introduced synchronized invalidation. Deletion emitted tombstones with provenance and triggered oordinated updates across every derived layer.
Deletion without invalidation creates both cost drift and correctness drift. If a deleted artifact can resurface through a projection, deletion > is not complete.
Architectural Principle
Promotion Is a Database Write
Promotion changes durable state and must be treated as such.
- Global scope is write-locked.
- Policy artifacts are not promotable.
- Tenant memory requires verification and provenance.
Part VII: Asynchronous Hardening and Memory Lifecycle
Promotion is the transition from volatile session state to durable canonical memory.
But durable memory should not be fully materialized synchronously.
In commercial systems, enrichment and hardening steps are frequently asynchronous.
These operations are:
- Computationally expensive
- Potentially slow
- Sometimes dependent on external systems
- Often unnecessary for immediate inference
The critical path should do only what is required to establish canonical truth.
Everything else belongs in a background processing pipeline.
Minimal Canonical Writes
The inference loop writes:
- A minimal canonical record
- With scope
- With type
- With provenance
- With retention policy
- With sensitivity classification
It does not block on:
- Embedding generation
- Cross-document deduplication
- Fact corroboration
- Conflict detection
- Episodic summarization
- Index rebuilds
- Retention reclassification
Those belong to hardening.
The Hardening Pipeline
Pattern:
1run_complete2 → promotion_candidate written to canonical store (minimal record)3 → enqueue enrichment tasks45background_worker6 → validate / enrich / embed7 → update derived indexes8 → emit trace event
This separation accomplishes four things:
- Keeps latency predictable
- Prevents enrichment failures from blocking inference
- Preserves canonical truth even if derived layers fail
- Enables cost amortization through batched AI inference
Design rule:
The inference loop writes minimal verified canonical records.
Everything else is projection.
Observed Ecosystem Convergence
Asynchronous Hardening Is an Emerging Architectural Norm
Pattern: Enrichment and consolidation are moving off the critical inference path.
- Letta processes messages asynchronously where a "subconscious agent" updates memory blocks out of band.
- Amazon Bedrock AgentCore generates long-term structured memory asynchronously from raw session events, with semantic retrieval APIs for later access.
Both systems keep enrichment off the critical path. Canonical truth persists immediately; projections follow.
Lifecycle States
Asynchronous hardening introduces a non-obvious reality:
Canonical truth may be persisted before it is fully trusted.
Treat newly promoted records as provisional until hardening completes. That means every promoted artifact carries an explicit status and can move through a small lifecycle:
-
Provisional
Persisted with provenance and scope
Eligible for replay
Not eligible for broad retrieval -
Active
Validated and hardened
Eligible for retrieval and reuse -
Quarantined
Suspected poisoning
Contradictions
Failed checks
Excluded from retrieval -
Revoked
Explicitly superseded or deleted via tombstone with provenance
Hardening determines retrieval eligibility.
Inference should never block waiting for enrichment.
What We Learned
Quarantine Needed a Replacement Path
Our first quarantine behaved correctly. A tenant-scoped fact was contradicted during asynchronous hardening, flagged, and excluded from retrieval.
What we did not anticipate was the vacuum it created. That fact had grounded responses for weeks. Once removed, the agent simply stopped referencing it, with no visible signal that context had changed.
We added a resolution workflow. Quarantine no longer meant exclusion alone. It triggered review, leading to correction, explicit revocation, or reinstatement.
Lifecycle states need more than transitions. They need resolution paths. Quarantine without resolution is silent deletion.
Observed Ecosystem Convergence
Lifecycle State and Retrieval Eligibility Are Explicit
Pattern: Memory records carry explicit lifecycle status that gates retrieval eligibility.
- Amazon Bedrock AgentCore formalizes memory organization with actor and session scoping, recommending namespaced organization to avoid conflicts. It also warns that event metadata isn't meant for sensitive content because it isn't encrypted with a customer-managed key.
- Windsurf implements global, workspace, and system-level memory tiers with distinct lifecycle characteristics.
Scope boundaries and sensitivity classification are becoming structural, not advisory.
Failure Modes and the Canonical Contract
Asynchronous hardening introduces several failure modes.
They must degrade recall, not correctness.
1. Enrichment Failure
Embedding generation fails.
Summarizer times out.
Downstream index is unavailable.
Rule:
- Canonical record remains intact
- Status remains provisional
- Derived projections are retried
Correctness is preserved.
2. Contradiction Discovered
Corroboration fails.
Trusted system-of-record disagrees with agent-authored fact.
Rule:
- Mark artifact as quarantined
- Emit trace event
- Optionally write corrective memory with higher precedence
Never silently overwrite.
3. Duplication and Merge Races
Multiple runs promote semantically identical artifacts.
Deduplication merges incorrectly.
Rule:
- Use deterministic identity keys (content_digest + scope + type)
- Make merge operations idempotent
- Record merge decisions in the event log
Projection must not rewrite canonical history without trace.
4. Late Revocation
Tenant updates policy.
User revokes consent.
Compliance deletion arrives after embedding and indexing.
Rule:
- Tombstones are first-class
- Deletion triggers coordinated invalidation across every derived layer
Deletion without invalidation resurrects stale context.
5. Partial Projection
Canonical write succeeds.
Some derived indexes update.
Others do not.
Rule:
- Retrieval must be tolerant of missing projections and fall back to canonical fetch paths.
Projections should never be required for correctness.
Memory Conflict and Drift
Over time, multi-tenant agents accumulate contradictions.
Examples:
- Tenant policy changed but old memory persists
- Preference updated but stale entries remain
- Past episode contradicts current entitlement state
You need a strategy for conflict. The simplest workable approach:
- Prefer newest memory with verified provenance
- Prefer canonical systems of record over agent-authored facts
- Mark memories with “confidence” and “verified” flags
- Allow revocation and explicit tombstones (deletions with provenance)
Tombstones matter because:
- You need to prove deletion
- Stop retrieval from resurrecting stale artifacts
- Ensure derived indexes invalidate consistently
Hardening without tombstones creates retention drift.
Why Hardening Exists at All
Without asynchronous hardening:
- Latency becomes unpredictable
- Inference blocks on expensive enrichment
- Failures cascade into user-visible errors
Without hardening gates:
- Provisional artifacts influence retrieval prematurely
- Poisoned memory spreads
- Verification becomes retroactive instead of preventive
Hardening separates:
- Canonical truth persistence
- Retrieval eligibility
- Acceleration layer maintenance
This is the canonical contract:
Inference writes minimal truth.
Hardening validates and projects.
Derived layers accelerate.
Architectural Principle
If It Isn’t Canonical, It Isn’t Durable
Durable state must be established synchronously; enrichment happens asynchronously and never blocks correctness.
- The critical path writes minimal canonical truth with identity, scope, provenance, and cost surfaces pinned.
- Embeddings, deduplication, enrichment, and index updates occur off the critical path and are fully rebuildable.
- Replay and audit depend only on canonical records, never on projections or background jobs.
Part VIII: Privacy, Retention, and Cryptographic Boundaries
Isolation protects tenants from each other.
Privacy protects tenants from you.
Privacy is often treated as a feature. In commercial agent systems, it must be treated as a data-plane architecture decision.
If your system promises "we don't retain this," "this run won't train models," "this tenant's data is isolated," or "this is ephemeral", then those guarantees must be visible in your routing, storage, and indexing layers. Not just in marketing copy.
If privacy depends on flags inside the model prompt or conditionals inside business logic, it will eventually fail.
The only durable privacy guarantee is structural separation.
Compliance requires encryption at rest.
Architecture requires partitioned routing.
Threat containment requires domain separation.
Privacy as Architectural Routing
Isolation is mandatory for multi-tenancy. It's a structural invariant.
If Tenant A can influence or see Tenant B’s data, the system is broken.
Privacy posture is different.
Privacy posture is a policy commitment enforced by architecture.
It defines how much the platform itself retains, observes, or learns from tenant activity.
A common anti-pattern:
1if privacy_mode:2 disable_logging()
This is extremely fragile.
Logging isn't the only place data persists. Data can leak into async queues, derived retrieval indexes, observability pipelines, caches, debug traces, analytics jobs, and model prompt caching layers.
If "privacy mode" relies on remembering to guard every one of those paths, it will eventually fail.
In production, the failure mode isn't malicious. It's incremental. A new logging layer is introduced. A new background job is added. A new cache is deployed. And privacy assumptions silently degrade.
Privacy modes should influence:
- Where traces are written
- Whether promotion is allowed
- Whether retrieval indexes are updated
- Whether embeddings are generated
- Whether objects are persisted
Privacy must route execution differently, not just mask output.
But the critical part is this:
The no-retention path should not share the same physical data plane as the retention path.
Separate buckets.
Separate partitions.
Separate encryption domains.
Otherwise accidental logging becomes inevitable.
What We Learned
Logging Drifted Past Privacy Boundaries
We implemented privacy by masking sensitive fields and stripping explicit PII markers. It appeared compliant.
Over time, sensitive data surfaced in places we had not guarded: tool payloads, derived projections, and trace fragments. Cleanup logic missed new paths as the system evolved.
We moved privacy enforcement to ingestion. Privacy mode became part of the identity envelope, and routing decisions were made before any writes occurred.
Cleanup logic is fragile. Routing is structural. If privacy is enforced after writes, every new logging layer, background job, and cache becomes a potential leak.
The Stronger Pattern: Separate Data Planes
The more defensible pattern (validated by mature SaaS systems) is this: route privacy-mode traffic through a separate path.
That can mean separate replicas, separate storage backends, separate queues, suppressed or redirected observability streams, and distinct retention policies at the storage layer.
The point is structural separation. At minimum, enforce physical storage partitioning and routing-level isolation.
A privacy-aware system may include:
- Retained trace log
- Metadata-only trace log
- Durable structured memory store
- Ephemeral session store
In privacy mode:
- Trace logs may contain only metadata (
run_id,timestamps,token counts) - Canonical structured memory may disallow promotion entirely
- Object store writes may be disabled
This is not just policy enforcement.
It is architectural branching.
If your privacy guarantee requires scanning logs after the fact, you are already too late.
Observed Ecosystem Convergence
Privacy Routing as Architectural Separation Is Production Practice
Pattern: Privacy is a routing decision enforced before any writes occur, not a flag checked after.
- Cursor implements separate service replicas and parallel queues/workers per privacy mode, defaulting to privacy-mode if the routing header is missing.
- Warp documents that Business/Enterprise plans operate under zero-data-retention agreements, with the server assuming telemetry disabled if the header is absent.
Safe defaults and physical separation replace conditional scrubbing.
What Privacy Mode Should Actually Control
At minimum, privacy mode should govern:
-
Event log retention
Are trace envelopes persisted?
If yes, are payloads redacted?
If no, is only metadata retained? -
Structured memory promotion
Are promotion gates disabled?
Are only session-scope artifacts allowed? -
Derived index writes
Are embeddings created?
Where are they stored?
Are they scoped to ephemeral partitions? -
Object store persistence
Are large payloads retained?
Encrypted with tenant-specific keys?
Auto-expiring?
Retention Discipline
If you promise "no retention," define what that means precisely.
Does it mean no canonical trace stored? No structured memory writes? No embedding generation? No analytics logs? No prompt caching? No external model provider retention?
You cannot say "no retention" if the request is still embedded into a global vector index, flows into analytics dashboards, or is logged to a shared debugging stream.
A clean architectural approach:
- Metadata-only trace retention
- run_id, cost, timing
- no payload
- Ephemeral session store
- in-memory or short TTL
- Derived indexes disabled
- Object store writes blocked or TTL-bound
- Separate observability stream with redaction
That way your promise is enforceable, not aspirational.
Retention is where drift hides
Without explicit retention semantics:
- Session artifacts become durable
- Durable artifacts never expire
- Index entries outlive source truth
Every canonical record must carry retention metadata:
- Retention policy ID
- Expiration timestamp or rule
- Sensitivity classification
Example retention policy:
1{2 "memory_id": "mem_441",3 "scope": "user",4 "memory_type": "preference",5 "retention_policy": "ttl_90_days",6 "expires_at": "2026-05-16T00:00:00Z"7}
Retention must cascade
When a canonical record expires:
- A tombstone is emitted
- Derived projections are invalidated
- Object store references are removed or reclassified
Retention without projection invalidation creates ghost memory.
Designing this after the fact is painful.
Designing it up front as an alternate routing path is tractable.
Privacy cannot be retrofitted.
The more you know
Retention Drift Is Real
Even well-designed systems accumulate "retention drift" over time. A new logging layer is added and isn't privacy-aware. A background embedding job writes to the wrong index. A debug flag writes full transcripts to object storage. A feature team builds a derived index and forgets to scope it.
We quickly learned that deletion alone is not enough. Unless invalidation is tightly coordinated across derived layers, indexes will continue serving content that has already been removed from canonical storage.
The system remained operational. The storage layer behaved correctly.
The issue was lifecycle design. Without synchronized invalidation, derived systems outlive the truth they project.
This is why separation by architecture is stronger than separation by conditionals.
Encryption and Tenant Keys
Isolation is a policy boundary.
Encryption is a cryptographic boundary.
In any commercial system, canonical stores and object stores should be encrypted at rest. In multi-tenant enterprise-grade systems, we need to go further:
- Tenant-scoped encryption keys
- Key rotation policies
- Envelope encryption for object payloads
- Separation of encryption domains between canonical and derived stores
Why this matters:
- A storage misconfiguration should not automatically imply cross-tenant readability.
- Deletion guarantees are stronger when encryption keys can be revoked.
- “No retention” modes can be reinforced with short-lived encryption domains.
- Object stores containing large payloads require stricter cryptographic boundaries than derived summaries.
Recommended pattern:
- Canonical store: tenant-scoped envelope encryption (per-tenant KEK, rotating DEKs)
- Object store: tenant-scoped encryption + content digests (integrity/provenance), with sensitivity-based storage rules
- Retrieval index: tenant-partitioned index; store encrypted IDs + embeddings for curated artifacts only
- Cache layers: tenant-partitioned, TTL-bound caches; encrypt with short-lived keys; never authoritative
- Event log: tenant-partitioned; metadata encrypted with platform-managed KMS for cross-tenant analytics, and payloads encrypted with tenant-scoped envelope encryption (or stored as tenant-encrypted object references)
This allows:
- Tenant-level cryptographic revocation
- Controlled key rotation
- Partitioned blast radius
Key design principle:
If derived systems are compromised, canonical truth and raw payloads should remain cryptographically isolated.
Encryption is not just compliance posture. It reinforces scope boundaries.
What We Learned
Shared Encryption Domains Complicated Lifecycle Changes
We used a single tenant-scoped KMS key across canonical storage and derived indexes. It simplified IAM and key management and worked in steady state.
The coupling surfaced when we needed stronger revocation guarantees for canonical data. Because canonical and derived layers shared the same key, a scoped change required coordinated migrations across storage and index layers.
We separated encryption domains. Canonical stores could rotate or revoke independently, and derived indexes used shorter-lived keys that could be discarded during rebuilds.
The cost of a shared encryption domain is invisible until rotation or revocation is required. By then the coupling is load-bearing.
Observed Ecosystem Convergence
Credential and Cryptographic Containment Is Formalizing
Pattern: Secrets and encryption domains are structurally isolated, not trusted to prompts.
- AWS AgentCore Identity provides a token vault that stores user tokens, handles refresh, and secures tool API keys, aligning with structural containment for secrets.
- Cursor documents temporary encrypted caching with client-generated keys that exist server-side only during the request.
- Anthropic's prompt caching stores KV cache representations and cryptographic hashes rather than raw prompt text, supporting zero-data-retention alongside caching economics.
Encryption domains are separating across volatile and durable layers.
The more you know
Canonical vs Derived Encryption Domains
Another subtle pattern:
Canonical and derived layers should not share identical encryption semantics.
Why? Because projections are rebuildable.
If you use identical encryption domains everywhere:
- A projection compromise can reveal canonical references
- Revocation becomes more complicated
Projections are disposable.
Canonical truth must be durable and revocable.
Relationship to Trace Metadata
Privacy mode complicates decision traces. If you suppress traces entirely, you lose replay. If you persist traces indiscriminately, you violate retention promises.
Trace envelopes should include:
- Model version
- Policy version
- Retrieval artifact IDs
- Token counts
- Tool contract versions
But in privacy mode:
- Persist structural trace metadata
- Redact sensitive payloads
- Allow tenants to opt into extended trace retention
- Separate trace retention from model training retention
This preserves operability without over-collecting.
Replay may become limited in privacy mode. That is acceptable.
What is not acceptable is accidentally retaining sensitive content because logging was decoupled from privacy routing.
Cross-Tenant Isolation Reinforced by Cryptography
Scope enforcement in the data plane protects retrieval.
Encryption enforces isolation at rest.
Even if a bug bypasses retrieval filters:
- Encrypted tenant domains limit exposure
- Key management boundaries provide additional containment
Cryptography does not replace logical isolation. It reinforces it.
Architectural Principle
Privacy as Architecture
Privacy must be enforced structurally, not retrofitted procedurally.
- Routing decisions must occur before any durable write.
- Retention policies must cascade across canonical and derived layers.
- Encryption must reinforce scope boundaries at rest and in transit.
Part IX: Cost Surfaces and Token Economics
In traditional cloud systems, cost scales with compute, storage, and network. In agent systems, cost scales with tokens, retrieval volume, context assembly size, tool invocation frequency, and trace retention footprint.
The difference is subtle but profound:
Cost is no longer driven by infrastructure. It's driven by intent.
And intent flows through context.
In commercial agent systems, cost does not scale linearly with traffic.
It compounds with context.
The Four Primary Cost Surfaces
Per-run cost is not just tokens in and tokens out. In commercial systems it is a composite of four surfaces:
- Inference
- Retrieval
- Tooling
- Persistence
Each surface has different scaling behavior.
A disciplined system tracks these surfaces separately so optimization doesn’t become guesswork.
1. Inference Cost
Inference cost is driven by:
- Input tokens
- Output tokens
- Context window size
- Model selection
- Prompt prefix size (caching effects)
As context grows, inference cost grows even if the business logic remains unchanged.
This is the first place drift becomes visible.
2. Retrieval Cost
Retrieval cost scales with:
- Number of documents indexed
- Embedding dimensionality
- Query volume
- Hybrid ranking complexity
- Cross-tenant filtering overhead
As memory accumulates, retrieval cost rises, even if token budgets stay fixed.
Derived projections amplify cost.
3. Tooling Cost
Tool invocation cost includes:
- External API calls
- Database reads
- Connector queries
- Downstream system calls
- Internal compute
- Side effects (including retries and human approval latency)
Tools often dwarf model cost in enterprise environments.
Without trace attribution per run, tooling cost becomes opaque.
4. Persistence Cost
Persistence cost includes:
- Canonical storage
- Object store usage
- Derived index storage
- Log and trace retention
- Backup and compliance overhead
Durable memory is an economic commitment.
Promotion decisions multiply storage cost over time.
The Economic Model of an Agent Run
Every agent run has a composite cost. It is not just tokens in and tokens out.
It is the sum of the four distinct surfaces.
Formally:
Where:
- Inference scales with context size.
- Retrieval scales with memory growth.
- Tooling scales with autonomy.
- Persistence scales with promotion discipline.
Most teams focus on inference cost.
In commercial systems, that is usually not the dominant long-term driver.
Layer-Based Token Budgets
One of the most effective structural controls is explicit token budgeting by layer.
Example allocation:
- Global constitution: 500 tokens
- Tenant policy: 800 tokens
- User memory: 600 tokens
- Retrieved episodes/artifacts: 1,200 tokens
- Session state/scratch: 900 tokens
Total: 4,000 token input budget
Without allocation, session state will crowd out policy.
Without allocation, retrieval will crowd out global constraints.
Layer budgets:
- Preserve guardrails
- Bound token cost
- Reduce drift
- Enable cost prediction
Budget discipline is architectural, not cosmetic.
Without layer-specific budgets, tenant memory can crowd out safety rules, session noise can crowd out durable facts, and retrieval bloat can drown signal.
Example Token Budget:
1TOKEN_BUDGET = 40002allocation = {3 "global": 500,4 "tenant": 800,5 "user": 600,6 "retrieved": 1200,7 "session": 9008}
What We Learned
Budgets Matter
We did not enforce strict per-layer token budgets. Context size expanded gradually across runs, sessions, and memory layers.
When traffic spiked, costs did not rise linearly. They jumped. Context growth had been compounding quietly.
We introduced explicit token allocations by layer: global, tenant, user, retrieved, and session. Each had a hard cap.
Without per-layer budgets, context discipline is aspirational. With them, it becomes both quality control and economic control.
Prompt Caching and Context Hashing
In commercial systems, many runs share identical static context layers: global constitution, tenant policy, stable user preferences.
These layers should be explicitly versioned and constructed deterministically.
A stable hash of the static prefix:
1prefix_hash = hash(global_constitution + tenant_policy_version)
This enables:
- Observability of policy drift
- Deterministic replay
- Compatibility with provider-side prefix caching
- Reduction of redundant prefix construction
When supported by the model provider, identical prefixes may benefit from prompt caching at the infrastructure layer, reducing repeated token charges. Even without provider caching, explicit prefix hashing encourages disciplined versioning and makes constitutional changes visible and auditable.
This reduces inference cost volatility.
But only if:
- Prefix boundaries are stable
- Versioning is explicit
- Trace logs record prefix version
Otherwise cache invalidation becomes opaque.
The more you know
Why Prefix Caching Changes the Incentives
If static layers are cacheable, the marginal cost shifts toward the dynamic tail:
- retrieval payloads
- tool traces
- accumulated session state
That makes compaction and progressive disclosure even more valuable, because you stop paying repeatedly for the same invariant prefix and start paying almost entirely for what you chose to assemble.
Observed Ecosystem Convergence
Prompt Caching and Token Accounting Are Platform Primitives
Pattern: Caching is an architectural cost lever with explicit economic telemetry.
- Anthropic supports automatic and explicit cache breakpoints, storing cryptographic hashes rather than raw text.
- Google Gemini provides a first-class "cached content" object including system instructions and tool configuration.
- Google Cloud Vertex AI reports
cachedContentTokenCountin response metadata for explicit economic telemetry. - Amazon Bedrock exposes prompt caching with cache checkpoints and distinct pricing semantics.
All four platforms treat caching as an architectural decision, not an optimization afterthought.
What Token Drift Looks Like
Architectural discussions about token budgets stay abstract until you attach numbers.
Consider a representative workload:
- Early average input: ~1,800 tokens
- Output: ~400 tokens
- Total per run: ~2,200 tokens
- Approximate cost: ~$0.02–$0.03 per run
note
The exact dollar amounts vary by model pricing, input/output rate asymmetry, and whether static prefixes benefit from caching, but the drift dynamics are consistent.
Now introduce three common forms of drift:
- Retrieval surface expands
- Session transcripts are retained instead of compacted
- Promotion frequency increases
Six weeks later, average input grows to ~2,900 tokens. Output remains stable.
Per-run cost rises to ~$0.035–$0.045.
That increase feels small in isolation. It is a 50–70% jump that rarely triggers alarms during development.
At 50,000 runs per day:
- $0.025 → $0.040 average
- ~$22,000 monthly delta
No new features shipped. No model changed.
Context simply accumulated.
Why Cost Compounds
Drift rarely starts in inference.
It starts in promotion.
- Every durable memory write increases the retrieval surface.
- A larger retrieval surface increases context payload size.
- Larger payloads increase inference cost.
- Higher inference cost creates pressure to promote summaries.
- Promotion increases durable memory.
The system compounds.
This is why token drift is rarely linear.
It is structural.
Promotion as an Economic Multiplier
Every promotion to durable memory:
- Increases future retrieval size
- Increases index cardinality
- Increases token pressure
- Increases retention cost
Promotion frequency directly influences long-term cost curve.
If 5% of runs promote tenant-scope artifacts at 50,000 runs/day, that is 2,500 new durable records daily.
Reducing tenant-level promotion from 5% to 1% materially slows both storage and token growth curves.
Promotion discipline is not just governance. It is cost control.
Progressive Disclosure vs Up-Front Retrieval
Early systems (like chatbots), generally used a simplistic approach:
- Query retrieval layer broadly
- Insert top-k results
- Let the model sort it out
This often "works", but it is without discipline.
Not all context should or needs to be visible at once.
Progressive disclosure is the better pattern:
- Retrieve minimal context
- Attempt inference
- If low confidence, retrieve additional context
- Repeat
This avoids the “dump everything into the prompt” anti-pattern.
1context = retrieve_minimal(query)2result, confidence = infer(query, context)34if confidence < 0.6:5 context += retrieve_additional(query)6 result, confidence = infer(query, context)
What We Learned
More Context Did Not Mean Better Answers
We initially retrieved as much context as possible up front. The assumption was simple: more knowledge should produce better answers.
It did not. Up-front retrieval increased input tokens by roughly 25–30% compared to progressive disclosure, with negligible accuracy difference and slightly lower hallucination risk under staged loading.
We shifted to progressive disclosure. Context was loaded incrementally based on need and explicit budget constraints.
Overloading the context window does not increase intelligence. It increases ambiguity. Progressive disclosure is both a correctness control and an economic control.
Progressive disclosure improves precision, reduces token usage, and stabilizes behavior.
Layer budgets, compaction, promotion limits, and retrieval discipline are not optimizations. They are economic control surfaces.
Progressive disclosure is both correctness control and cost control.
Observed Ecosystem Convergence
Progressive Disclosure Is Replacing Up-Front Context Dumps
Pattern: Minimal context first, expand only when needed.
- Replit avoids dumping full error output into context, instead injecting a minimal signal that tells the agent to pull details via a log tool on demand.
- Anthropic's Skills guide formalizes multi-level loading: metadata always present, full instructions loaded only when needed, linked files navigated on demand.
- Bolt argues that richer model context can reduce wasted iterations, framing token investment as an economic decision.
Context is loaded progressively, not preloaded exhaustively.
Observing Cost as a First-Class Signal
Cost is not a monthly bill. It is a per-run property of execution.
Beyond the obvious token charges, agent systems incur structural cost across multiple dimensions:
- Context assembly cost
- Retrieval indexing and reranking cost
- Promotion and durable writes
- Retention and storage footprint
- Tool execution cost
- Observability overhead
If these are not attributed per run, you cannot tune budgets, compare retrieval strategies, or detect runaway behavior early.
Every trace envelope should include:
- Model and prefix version
- Tokens in / tokens out
- Static prefix tokens vs dynamic context tokens
- Retrieval artifact count and bytes retrieved
- Rerank candidate count
- Tool invocations and retry count
- Promotion writes by scope
- Estimated cost (model + tools)
- Latency
- Hardening queue lag (if async memory is used)
Tracing is not logging.
It is cost instrumentation.
Example cost instrumentation view (derived from the trace envelope):
1{2 "run_id": "01J...",3 "usage": {4 "tokens_in": 1832,5 "tokens_out": 412,6 "static_prefix_tokens": 620,7 "dynamic_context_tokens": 12128 },9 "retrieval": {10 "count": 8,11 "bytes_in": 14523,12 "rerank_candidates": 3213 },14 "tools": {15 "invoked": 2,16 "retry_count": 117 },18 "promotions": 1,19 "estimated_cost_usd": 0.023120}
What We Learned
Trace Gaps Obscured Cost
We initially modeled cost using aggregate token counts. When expenses rose, we could not explain why. Trace envelopes lacked layer-level token breakdowns, retrieval volume by scope, promotion counts, and model or prefix versioning.
Cost analysis required manual reconstruction of how context had been assembled.
We extended trace envelopes to include tokens per layer, retrieval artifact IDs, promotion counts, and model and prefix versions. Cost attribution became deterministic instead of anecdotal.
Cost modeling is only as strong as trace fidelity. If you cannot decompose a run into its cost surfaces, optimization is guesswork.
Context Is Now an Economic Surface
In commercial agent systems, infrastructure cost scales predictably.
Context cost compounds.
It compounds with:
- Retrieval volume
- Assembly size
- Promotion rates
- Compaction discipline
- Retention duration
Progressive disclosure turns these into controllable surfaces. Retrieve minimally. Expand only when needed.
This reduces token volume and stabilizes behavior without sacrificing accuracy.
Teams that engineer these surfaces control cost.
Teams that treat context as a byproduct discover it later.
Cost rarely grows with traffic alone.
It grows with unmanaged context.
Cost control is not model negotiation.
It is deliberate context engineering.
Architectural Principle
Context Has a Meter
Every context decision carries measurable economic cost.
- Retrieval, promotion, compaction, and retention are cost surfaces.
- Attribution must exist per run.
- Unmeasured cost compounds invisibly.
Part X: Evaluation as Context Discipline
Most teams think evaluation means scoring model outputs.
In commercial agent systems, evaluation means something far more structural:
You are evaluating context assembly under constraint.
Models are interchangeable components.
Context discipline is the system.
What You Are Actually Evaluating
A commercial agent run is the composition of:
- Identity resolution
- Scope filtering
- Retrieval selection
- Layer budgeting
- Semantic stabilization
- Garbage collection
- Tool invocation
- Policy enforcement
- Promotion gating
- Lifecycle transitions
- Cost surfaces
Evaluation must ask:
- Was the correct context retrieved?
- Were scope boundaries enforced?
- Were policies visible in the working set?
- Did promotion decisions follow rules?
- Did cost remain within expected bounds?
- Would this run behave the same way tomorrow?
If you cannot answer those questions deterministically, you are not evaluating autonomy. You are sampling outputs.
Deterministic Replay Is the Baseline
Evaluation begins with replay.
Every run must capture:
- Model version
- Policy version
- Prefix hash
- Retrieval artifact IDs
- Tool contract versions
- Promotion decisions
- Token counts
- Cost breakdown by surface
Replay fixes those variables and re-executes the run.
If output changes under identical inputs, you have drift.
Drift can originate from:
- Model upgrades
- Retrieval index mutation
- Promotion contamination
- Prefix instability
- Lifecycle state changes
Replay is not debugging.
Replay is architectural validation.
Without replay, optimization becomes guesswork.
Example replay validation view (derived from the trace envelope):
1{2 "run_id": "run_01J...",3 "model": {4 "version": "claude-sonnet-4-6"5 },6 "prefix": {7 "version": "constitution_v12 + tenant_policy_v8",8 "hash": "sha256:..."9 },10 "retrieval": {11 "artifact_ids": ["mem_88", "mem_104"]12 },13 "tool_contracts": {14 "refund_api": "v3.1",15 "crm_lookup": "v2.4"16 },17 "input_token_breakdown": {18 "global": 480,19 "tenant": 720,20 "user": 610,21 "retrieval": 1150,22 "session": 84023 },24 "output_tokens": 410,25 "promotion_count": 1,26 "estimated_cost": 0.08227}
Replay without artifact IDs becomes probabilistic.
Replay without versioned prefix becomes inaccurate.
Replay without tool contract versions becomes misleading.
Replay is only deterministic if the system treats versioning as a first-class discipline.
Observed Ecosystem Convergence
Replay Infrastructure Is Becoming a Platform Expectation
Pattern: Trace envelopes and versioned artifacts are becoming standard replay primitives.
- Anthropic's evals guidance defines "transcript/trace/trajectory" as the canonical unit of evaluation.
- OpenAI's Responses API provides durable response IDs,
previous_response_idthreading, metadata, and detailed usage including cached tokens. - OpenAI Agents SDK preserves per-request usage breakdowns (
request_usage_entries), enabling measurable cost surfaces per run.
Replay is shifting from ad hoc reconstruction to platform-supported infrastructure.
What We Learned
Evaluation Without Replay Was Guesswork
Before full trace envelopes, evaluation relied on spot-checking outputs, comparing example runs, and manually reconstructing context. It was slow, subjective, and blind to subtle cost regressions.
Without versioned prefixes, artifact IDs, layer-level token breakdowns, and tool contract versions, we could not deterministically explain why behavior changed.
We extended trace fidelity to include these elements. Replay became deterministic. Evaluation became measurable. Upgrade decisions became evidence-based instead of intuition-driven.
If you cannot decompose what changed between two runs, evaluation is opinion. Trace fidelity turns it into measurement.
Model Upgrades Are Context Stress Tests
Frontier models evolve.
Tokenization changes.
Tool calling behavior shifts.
Summarization becomes more aggressive.
Safety layers adjust.
A model upgrade is not just a capability change.
It is a context stress test.
Under a new model:
- Does retrieval selection change?
- Does semantic stabilization behave differently?
- Does promotion frequency increase?
- Do policies get crowded out?
- Does token usage shift by layer?
Evaluation harnesses must compare:
- Old model + fixed context
- New model + identical context
If behavior shifts materially, you have learned something about your context design.
Stronger models do not fix weak context discipline. They amplify it.
Promotion Drift Is an Evaluation Surface
Evaluation must track:
- Promotion rate by scope
- Promotion type distribution
- Provisional → active transitions
- Quarantine frequency
- Tombstone rate
- Memory growth curve
A rising tenant-scope promotion rate is not a feature win.
It is an economic and governance signal.
Promotion drift compounds:
- Retrieval cardinality
- Token pressure
- Storage cost
- Cross-scope contamination risk
Evaluation is not just output quality.
It is structural hygiene.
Cost as a First-Class Evaluation Metric
Every run should emit:
- Inference tokens
- Retrieval volume
- Tool invocation count
- Persistence writes
- Derived index updates
Cost should be evaluated against:
- Task type
- Layer budget allocation
- Promotion decision
- Model version
- Prefix hash
If cost shifts without intentional architectural change, context drift is occurring.
Evaluation must detect this early.
Observed Ecosystem Convergence
Cost Observability Is Becoming a Standard API Surface
Pattern: Per-run cost attribution is emerging as a platform primitive.
- OpenAI Agents SDK exposes per-request usage breakdowns for fine-grained drift detection.
- Google Cloud Vertex AI reports cached token counts in response metadata, making cache economics visible.
- GitHub Copilot packages coding agent capabilities into enterprise plans with centralized audit logs and usage telemetry exports.
Production agents are expected to be observable and cost-accountable at org scope.
Confidence and Context Sufficiency
Evaluation is incomplete without measuring sufficiency.
When progressive disclosure is used, you should measure:
- How often additional retrieval was required
- Confidence thresholds triggering expansion
- Cost delta between minimal vs expanded context
If confidence thresholds drift upward over time, context may be degrading.
Evaluation is about understanding when the system needs more context and when it is over-consuming it.
From Runs to Decision Graphs
Once trace envelopes are disciplined, something more powerful emerges.
Every run captures:
- What was visible
- What influenced action
- Which policies applied
- What was written
- What cost was incurred
Over time, this becomes a structured corpus of decisions.
You can analyze:
- Policy application frequency
- Precedent influence
- Cost distribution by task type
- Drift across model versions
- Promotion patterns by scope
That corpus is the foundation of decision graphs.
Not theoretical graphs.
Replayable, auditable, cost-attributed decision networks.
But they only emerge if evaluation is embedded from the beginning.
Replay Is Not Just for Model Upgrades
Replay supports:
- Model upgrade validation
- Policy changes
- Retrieval tuning
- Prompt modifications
- Tool migration
- Cost optimization experiments
If your evaluation loop depends on synthetic examples, it will miss edge cases.
Production traces are your regression corpus.
Model Upgrade Replay
Model upgrades are the most obvious replay scenario.
Upgrade process:
- Freeze a representative replay corpus from production traces
- Re-run corpus with new model version
- Compare:
- Outputs
- Policy adherence
- Token usage
- Tool invocation patterns
- Cost
Differences are classified:
- Improvement
- Neutral
- Regression
Without replay, upgrade evaluation becomes anecdotal.
Without cost comparison, upgrade risk becomes financial.
Policy Change Evaluation
When tenant or global policy changes:
- Re-run affected traces
- Detect differences in approval decisions
- Validate no unintended escalation of privilege
- Validate no suppression of required actions
Policy becomes versioned infrastructure, not live mutation.
Policy drift without replay is invisible.
Retrieval Strategy Testing
Changes to:
- Embedding model
- Chunk size
- Hybrid ranking weights
- Scope filtering rules
- Compaction rules
- Progressive disclosure thresholds
should be replayed across prior runs.
Key metrics:
- Retrieval artifact count
- Token usage by layer
- Hallucination rate
- Tool invocation deltas
Retrieval changes affect both quality and cost.
Replay reveals both.
This turns retrieval tuning from intuition into measurement.
Shadow Mode
Replay supports offline evaluation. But mature systems also support live shadowing.
Pattern is straightforward:
- Production executes normally and output remains authoritative.
- In parallel, a shadow run uses:
- A new model version
- A modified retrieval strategy
- New compaction rules
- Etc.
- Shadow outputs are logged but not surfaced to users
- Differences are analyzed asynchronously.
This reduces risk during:
- Model migrations
- Context restructuring
- Index rebuilds
- Policy refactors
It also enables support for:
- Controlled rollout
- Drift detection
- Confidence building
Shadow mode turns architectural evolution into controlled iteration.
What We Learned
Shadow Mode Caught What Staging Missed
We upgraded to a newer model and passed regression tests in staging. Mechanically, everything worked.
What staging did not reflect was production memory: real tenant configurations, months of promoted episodes, and live retrieval surfaces. When we enabled the new model in shadow mode on a small slice of production traffic, subtle behavioral deviations appeared immediately. Policy language was interpreted differently, shifting tool selection and response framing.
We corrected the prompts before full rollout.
Staging validates mechanics. Shadow mode validates behavior against accumulated memory. They test different failure surfaces.
Regression Suites from Real Traffic
Synthetic test cases rarely capture:
- Long-tail edge cases
- Rare entitlement combinations
- Complex multi-step tool chains
- High-context runs
The strongest evaluation corpus is your own trace log:
- Sample representative runs across tenants and task types.
- Strip or redact sensitive payloads if required.
- Store structured replay fixtures.
- Version them alongside policy and model artifacts.
Your production history becomes your test suite.
Over time, you accumulate:
- Edge cases
- Near-failures
- Policy conflicts
- Retrieval anomalies
- Promotion mistakes
Those are more valuable than curated prompts.
Without real traces, evaluation remains shallow.
Architectural Principle
Evaluate Integrity, Not Just Outputs
Evaluation must validate context behavior, not only model responses.
- Isolation, scope, promotion, and lifecycle are testable.
- Replayability is non-optional.
- If you cannot reconstruct it, you cannot trust it.
Part XI: Model Versioning and Upgrade Discipline
Model upgrades are not configuration changes.
They are behavioral migrations.
In commercial agent systems, a model upgrade can alter:
- Interpretation of policy language
- Tool invocation sequencing
- Confidence thresholds
- Hallucination patterns
- Token usage patterns
- Sensitivity to ambiguity
- Response verbosity
- Context utilization behavior
If model upgrades are treated casually, system behavior becomes unpredictable.
Model evolution must be controlled at the architectural level.
Version Everything That Influences Behavior
Model versioning alone is insufficient.
Behavior is a function of:
- Model name and version
- Temperature and inference parameters
- Static prefix hash
- Policy artifact versions
- Tool contract versions
- Retrieval configuration version
- Compaction strategy version
If any of them change without trace visibility, replay fidelity degrades and regression analysis becomes impossible.
Model versioning is part of the canonical contract.
This makes behavioral provenance explicit.
Without it, you cannot explain deltas.
Observed Ecosystem Convergence
Versioned Behavior Contracts Are Platform-Level Primitives
Pattern: Platforms expose structured versioning for tools, models, and configuration.
- OpenAI's Responses API includes explicit tool configuration fields (
tools,tool_choice,parallel_tool_calls), detailed token usage with caching breakdowns, and metadata. - Google Gemini formalizes structured tool invocation where the model returns structured parameters.
- Anthropic's prompt caching uses hash-based storage, making constitutional and policy prefix changes auditable.
Behavioral provenance is becoming explicit, not implicit.
The Upgrade Sequence
A disciplined upgrade follows this sequence:
1. Freeze a Replay Corpus
Select representative production traces that include:
- High-context runs
- Multi-step tool chains
- Policy-sensitive flows
- Edge-case approvals
Lock the artifact IDs and prefix versions. These runs become your regression suite.
2. Deterministic Replay
Run the corpus under:
- Old model
- New model
Hold all other variables constant.
Compare:
- Output correctness
- Policy adherence
- Token usage
- Tool invocation patterns
- Cost
3. Shadow Production
Deploy new model in shadow mode.
- Run in parallel
- Record outputs
- Do not affect user-visible behavior
Track:
- Output divergence
- Policy deviations
- Token deltas
- Latency shifts
Replay validates determinism. Shadow validates live distribution behavior.
4. Progressive Rollout
Gradually increase traffic percentage.
Monitor:
- Drift signals
- Cost per run
- Tool call volume
- Promotion rate
Roll back immediately if invariants break.
5. Rollback Readiness
Rollback must be:
- Instant
- Deterministic
- Stateless
If rollback requires schema migration, reindexing, or cache rebuilding, you have coupled layers incorrectly.
Architecture must remain stable while models evolve.
Versioning without rollback is theater.
Context Window Changes Are Architectural Events
Larger context windows tempt teams to relax discipline.
Common failure pattern:
- “The window is bigger now.”
- Retrieval breadth increases.
- Session history persists longer.
- Promotion discipline loosens.
Short term: outputs improve.
Medium term: token cost explodes.
Long term: drift becomes embedded.
A larger window does not eliminate:
- The need for compaction
- The need for scope boundaries
- The need for promotion gating
- The need for evaluation
It magnifies the consequences of ignoring them.
Window size is a constraint multiplier.
Budgets are architectural, not model-driven.
Separation of Concerns Across Layers
Model upgrades should not require:
- Rewriting canonical storage
- Reindexing entire memory store
- Altering encryption boundaries
- Modifying promotion semantics
If they do, the architecture is over-coupled.
Layer separation ensures:
- Canonical store remains stable
- Derived projections can be rebuilt
- Retrieval configuration can evolve independently
- Compaction strategy can be tuned without rewriting history
Models change. Architecture must remain durable.
If your system entangles model behavior with memory writes or policy evaluation, upgrades will feel dangerous.
If layers are cleanly defined, upgrades become controlled experiments.
Upgrade Without Replay Is Operational Risk
If you upgrade a model without replay:
- You cannot quantify regression
- You cannot quantify cost delta
- You cannot detect policy drift
- You cannot isolate behavior change to the model
You are trusting a black box to remain aligned.
In enterprise systems, that is unacceptable.
What We Learned
Model Upgrade Without Full Trace Was Fragile
In early iterations, we captured model version and final output. We did not capture prefix version, artifact IDs, layer token breakdowns, or tool contract versions.
When behavior changed after a model upgrade, we could not determine whether the cause was retrieval, policy visibility, compaction, or the model itself.
We extended trace envelopes to capture full execution context. Upgrade analysis became surgical instead of speculative.
If you cannot isolate what changed, every model migration is a bet. Trace fidelity turns it into a controlled experiment.
Architecture Must Outlive Models
Frontier models are constantly evolving.
Your architecture must survive for years.
Design rule:
Models are replaceable components.
Context engineering is the durable asset.
Observed Ecosystem Convergence
Architecture Is Outliving Individual API Generations
Pattern: Systems that decouple canonical state from model-specific behavior survive API transitions.
- OpenAI's migration from Assistants to the Responses API, with a published sunset date (August 26, 2026), demonstrates why architecture must remain stable while models evolve.
- Vercel Agent is converging on "skills" and MCP-like tooling ecosystems with machine-readable docs designed for agent consumption.
Models are replaceable components. Context architecture is the durable asset.
Architectural Principle
Models Are Mutable
Model upgrades must be treated as behavioral migrations, not routine swaps.
- Version everything that influences behavior.
- Replay and shadow before exposure.
- Rollback must be immediate while architecture remains stable.
Part XII: From RAG to a Context Engine
Many teams start with RAG.
Retrieve. Append. Generate. Repeat.
For prototypes, this works. For commercial systems, it does not.
RAG solves recall. It does not solve isolation, replay, promotion discipline, retention enforcement, lifecycle management, cost predictability, or upgrade safety.
RAG is a retrieval pattern.
A context engine is an architectural system.
Why Basic RAG Is Not Enough
Basic RAG assumes:
- Retrieval is stateless
- Memory is external
- The model is the primary reasoning surface
- History can be appended safely
Commercial agent systems violate all of these assumptions.
They require:
- Long-lived memory
- Scoped isolation
- Cross-tenant guarantees
- Durable decisions
- Cost accounting
- Evaluation loops
If you build on naive RAG, retrieval becomes accidental truth, transcripts become memory, promotion becomes implicit, drift becomes structural, and cost becomes unpredictable.
RAG is a building block. It is not the architecture.
Observed Ecosystem Convergence
RAG Alone Has Proven Insufficient
Pattern: Every major platform has added lifecycle, gating, and scoping layers on top of basic retrieval.
- Sourcegraph Cody documented moving away from embeddings-only retrieval back in 2024, partly because sending code to a third party created operational complexity, echoing that derived layers become toxic when they silently become the system of record.
- Every platform reviewed, from Claude Code to Bedrock AgentCore to Letta, has added lifecycle, gating, and scoping layers beyond retrieval.
Retrieval is a building block. It is not the architecture.
What a Commercial Context Engine Looks Like
A commercial context engine has:
- Typed memory, not blobs
- Scoped storage, not implicit visibility
- Canonical event log, not accumulated transcript fragments
- Structured memory store, not unbounded conversation residue
- Promotion gates, not automatic persistence
- Hardening lifecycle, not synchronous promotion shortcuts
- Lifecycle garbage collection, not window overflow
- Hybrid retrieval, not vector-only recall
- Token budgets by layer, not "fit what you can"
- Privacy as routing architecture, not configuration flags
- Trace envelopes, not black-box runs
- Encryption boundaries, not shared trust assumptions
- Structured trace envelopes, not free-form logs
- Deterministic replay, not probabilistic reconstruction
- Versioned artifacts, not silent drift
It assembles context deliberately. It reconstructs context per run. It promotes intentionally. It enforces isolation structurally. It measures cost explicitly. It evolves through replay.
Observed Ecosystem Convergence
The Context Engine Pattern Is the Common Destination
Pattern: Independent systems converge on the same structural invariants once they operate agents in production.
- Across Claude Code, Cursor, Letta, OpenClaw, Amazon Bedrock AgentCore, OpenCode, Warp, Windsurf, Bolt, and Lovable, the convergence is structural: multi-tenant boundaries, durable memory, tool orchestration, retention rules, replayable failures, and predictable economics.
The interfaces differ, but the invariants do not.
Architectural Principle
The Four Constraints of Context
Context must be constrained, versioned, scoped, and observable.
- Constrained by token budgets, compaction rules, and retrieval limits
- Versioned by policy artifacts, trace metadata, and constitution changes
- Scoped by tenant boundaries, user isolation, and memory type
- Observable through trace envelopes, replay, cost accounting, and promotion logs
When those properties exist, drift becomes diagnosable, isolation becomes provable, retention becomes enforceable, and cost becomes bounded.
Without them, cost, drift, and isolation become outcomes you discover instead of variables you control.
Conclusion: Engineering Context
Ecosystem Convergence
Over the past year, a clear pattern has emerged. Teams building IDE agents, coding copilots, orchestration layers, and model APIs, often in isolation and under very different product pressures, have arrived at strikingly similar system designs.
Throughout this guide, we've cited specific architectural signals from these systems. The convergence is not anecdotal. It spans memory design, isolation enforcement, trace capture, retrieval strategy, cost accounting, and promotion discipline.
Where the Signals Come From
Constraint, Not Coincidence
Across these systems, the same invariants appear:
- Memory is layered, scoped, and explicitly governed
- Durable memory requires deliberate promotion
- Context is assembled per run, not carried forward as a growing window
- Derived layers are projections rebuilt from canonical truth
- Retrieval combines structured and semantic signals
- Tool use is typed and contract-bound
- Isolation is architectural, not optional
- Replay and traceability are mandatory
- Async processing moves heavy work off the critical path
- Cost and token budgets are explicit control surfaces
These systems were built independently. They landed on the same patterns because production agents force these constraints.
Context is no longer a prompt engineering trick. It is infrastructure.
You do not have to adopt any specific framework. But if you ignore these structural constraints, you will rediscover them through drift, cost escalation, replay failures, privacy incidents, promotion poisoning, and upgrade regressions.
Convergence is a signal. And one you should listen to.
The more you know
Why There Is No Simple Maturity Model
It is tempting to reduce commercial agent systems to a clean ladder.
Level 1: chatbot + vector DB
Level 2: scoped memory
Level 3: canonical logs
Level 4: evaluation harnesses and replay
Level 5: privacy routing and cost instrumentation
In practice, maturity is multi-dimensional.
A system may have strict isolation but weak promotion discipline.
It may version models correctly but lack replay fidelity.
It may instrument cost but fail to control lifecycle drift.
It may run evaluation loops but persist unverified memory.
Isolation, promotion, lifecycle hardening, replay, evaluation, cost instrumentation, and privacy architecture evolve independently.
The parts described above define the axes.
Where your system sits along each axis determines its maturity.
The Architecture That Survives
A context engine exists to enforce three structural invariants:
-
Structural Isolation
Tenant boundaries are enforced in the data plane, not implied in prompts. Scope, partitioning, encryption domains, and routing make cross-tenant contamination structurally impossible, not statistically unlikely.Isolation is not a guardrail.
It is a boundary. -
Deterministic Replay
Every decision must be reconstructible.
Versioned prefixes, artifact IDs, tool contracts, and trace envelopes turn autonomy from guesswork into debuggable infrastructure.If you cannot replay a run, you cannot trust it.
If you cannot trust it, you cannot evolve it. -
Economic Predictability
Cost must be bounded by architecture.
Layered token budgets, promotion discipline, constrained retrieval, and per-run cost attribution prevent context from compounding silently.If cost emerges from accumulation instead of design, scale becomes drift.
If any of these are optional, autonomy will eventually degrade.
Not suddenly. Not catastrophically. Gradually.
Guardrails fade. Costs creep. Memory drifts. Isolation weakens.
The systems that survive production pressure are NOT the ones with the best prompts.
They are the ones where context is engineered as infrastructure.
And infrastructure either holds, or it doesn’t.
A Note on Adoption
This guide describes the full architecture that production pressure eventually demands.
No team should attempt to implement all of it at once.
The patterns here are not a checklist. They are a reference architecture. Your system's constraints determine which layers matter first.
If you are not yet multi-tenant, isolation can be deferred. If you are not under compliance or audit pressure, the full hardening pipeline and lifecycle state machine can wait. If your agent runs are short-lived and stateless, promotion gating matters less than retrieval discipline. If cost is not yet a problem, trace envelopes and per-run attribution can follow later.
Start with the constraints your use case actually imposes:
- If you handle enterprise data across tenants, start with isolation and scoped memory.
- If you need to debug behavioral drift, start with trace envelopes and replay.
- If cost is compounding, start with layer budgets and promotion discipline.
- If you are building durable memory, start with truth vs acceleration separation.
The full architecture is the destination. Your roadmap determines the order of arrival.
The ecosystem is moving fast. Managed memory services, provider-side context management, turnkey trace infrastructure, and retrieval-as-a-service offerings are shipping regularly. Many of the layers described in this guide are increasingly buy-not-build decisions. The architectural principles remain the same regardless of whether you implement them yourself or adopt managed services that enforce them on your behalf. What matters is that the invariants hold, regardless of who builds the plumbing.
Build what your constraints require today. Adopt what the ecosystem provides tomorrow. But know where the architecture converges, so you are building toward it rather than away from it.
Final Principle
Context is no longer just what you pass to a model.
It is your safety boundary.
Your isolation boundary.
Your cost boundary.
Your audit trail.
And your competitive moat.
Frontier models are increasingly interchangeable. APIs converge. Capabilities normalize. Pricing compresses.
What does not commoditize is how context is assembled, constrained, promoted, retained, replayed, and priced. That discipline determines safety, reliability, latency, auditability, and margin. Two teams can call the same model and produce radically different systems depending on their memory architecture and context control surfaces.
In commercial agent systems, model choice is leverage.
Context engineering is the moat.
Context is not an implementation detail. It is infrastructure. Engineer it accordingly.
Hope this helps, 
