Context Engineering for Commercial Agent Systems

When you build agents for a single user on a laptop, almost anything works.
When you build commercial multi-tenant agents serving enterprises, almost nothing accidental survives.

Since late 2024, I’ve spent most of my time building and optimizing commercial, multi-tenant agent systems with a team of engineers inside a large SaaS platform serving hundreds of enterprise customers. We stress-tested agents under real constraints: tenant isolation, financial accuracy, auditability, cost control, and scale.

When I wasn’t working on those systems, I was consulting with other teams, collaborating with peers building agent platforms, and running independent experiments and side projects to pressure-test orchestration models, retrieval architectures, evaluation harnesses, and model upgrades under similarly real-world conditions.

Different environments. Same constraints.

We evaluated and implemented systems across direct foundation model orchestration, managed agent runtimes, multi-agent coordination patterns, graph-oriented orchestration, and MCP integrations.

In parallel, we architected a multi-tenant semantic layer over economically material enterprise cost data. Not RAG over documents, but a deterministic parsing and interpretation engine performing canonical identity modeling, entity resolution, ontology alignment, and context-aware re-ranking with full provenance. Underneath it: hybrid search, columnar analytics, vector stores, and AI-native data models.

We built replay-based regression harnesses. We instrumented token cost and execution traces per run. We fine-tuned and distilled local models for latency and cost optimization.

We heavily utilized production agentic coding systems like Claude Code and Cursor, studying how they handle production-grade context management: pruning between turns, isolating workspaces, externalizing heavy operations, aggressively controlling context surfaces. The same patterns we were engineering, they were hardening in the wild.

Across all of this, a clear convergence emerged.

Models improve. APIs standardize. Tool use matures.

But in commercial multi-tenant systems, those are not the determining factors.

Context is.

What follows is not theory. It emerged from debugging memory drift across tenants. From tightening isolation guarantees that survive scale. From building promotion gates to prevent memory poisoning. From implementing retention under enterprise compliance. From learning that token economics only become predictable when context is disciplined.

This is not just one team’s experience. The same architectural pressures are visible across production systems like Claude Code, Cursor, Letta, AWS AgentCore, and others. Implementations differ. The convergence is real.

The systems that survive production pressure consistently share the same traits:

Typed memory instead of transcript blobs
Separation between canonical truth and derived acceleration layers
Explicit promotion and compaction gates
Trace envelopes for replay and audit
Aggressive pruning between turns
Isolation boundaries enforced as security boundaries
Cost surfaces treated as first-class signals

The systems that fail optimize for demo velocity. They index raw transcripts. They reuse context wholesale. They blur truth and acceleration. They discover cost, drift, and cross-tenant risk too late.

This guide synthesizes:

Lived experience building commercial multi-tenant agents
Public architectural signals from mature agent systems
Explicit engineering rules for context discipline

In commercial multi-tenant systems, context is not an implementation detail.

It is infrastructure.

And infrastructure requires engineering discipline.

IMPORTANT NOTE

This guide is NOT canon. No single person or team’s experience is. What it represents is accumulated knowledge from building under real constraints, knowledge that appears to converge with architectural decisions the broader ecosystem is arriving at independently. Where we align with production systems like Claude Code or AWS AgentCore, it is not because we copied their work. It is because the same pressures produce the same load-bearing patterns. Treat this as a field guide, not a specification.

The code blocks in this guide are pseudo-code expressing architectural invariants, not implementation details. They illustrate structural contracts that must hold regardless of language or framework.

Part I: Context Is Infrastructure

The Three Non-Negotiables of Commercial Agent Systems

Structural Isolation
Deterministic Replay
Economic predictability

These are not optional features. They are architectural constraints.

If your system cannot enforce tenant boundaries structurally, it will eventually leak.
If your system cannot run deterministic replays, you cannot debug or evolve it safely.
If your system cannot predict cost per run, it cannot scale sustainably.

Models Are Commoditized. Context Is Not.

Everyone has access to the same frontier models. Claude. GPT. Gemini. The APIs are public. The prices are falling. Capabilities are converging.

What differentiates systems is no longer the model.

It's the context.

Two teams using the same model can produce radically different outcomes depending on how they externalize knowledge and constrain behavior.

That's true for coding assistants. It's even more true for commercial agent systems.

Because in commercial systems, context isn't just about output quality. It's about:

Policy compliance: did the agent follow the rules?
Data isolation: can Tenant A's data leak into Tenant B's context?
Cross-tenant safety: does shared infrastructure introduce shared risk?
Cost control: can you predict and bound what each run costs?
Replay and auditability: can you reconstruct what the agent knew at decision time?
Correctness under ambiguity: does the system degrade gracefully or silently?

If you get context wrong, you don't just get worse answers. You get silent corruption.

Observed Ecosystem Convergence

Context as Competitive Surface

Pattern: Context assembly is becoming a competitive differentiator.

Bolt frames token efficiency and richer context as economic levers, not just model inputs.
Replit describes injecting minimal diagnostic signals instead of dumping full logs into context.

The ecosystem is optimizing context selection, not just model choice.

Decisions as First-Class Records

There's a shift happening beneath the surface that most teams haven't fully internalized.

Enterprise software historically captured objects: customers, invoices, tickets, accounts. Systems of record persisted state.

Agent systems introduce something new: decisions.

Why was this exception approved?
Which policy version applied?
What precedent influenced this action?
What context was visible at decision time?

We're not fully at “context graphs” yet. But we can't get there unless we build the foundations now:

Append-only trace logs
Scoped memory promotion
Provenance tracking
Replayable context assembly

Context management isn't just about controlling what the model sees. It's about observing and recording how context influences action.

Without traceability, you cannot optimize.
Without replay, you cannot debug.
Without provenance, you cannot trust durable memory.

The foundation for future “context graphs” is not speculation. It's disciplined trace capture starting today.

Observed Ecosystem Convergence

Decisions and Traces Are First-Class Primitives

Pattern: Platforms are standardizing structured decision records and trace envelopes.

Anthropic's evals guidance formally defines "transcript/trace/trajectory" as the complete record of a run including outputs, tool calls, and intermediate results.
OpenAI's Responses API assigns durable response IDs, supports previous_response_id threading, and exposes explicit metadata and usage fields.
Claude Code exports structured telemetry via OpenTelemetry, making execution observable at the run level.

Traceability is becoming infrastructure, not instrumentation added later.

Architectural Principle

Non-Negotiables, Not Features

Context boundaries are infrastructure, not application logic.

Isolation is structural, not prompt-based.
Replay is the baseline for trust.
Economics must be predictable per run.

Part II: Memory as a Scoped, Typed System

Memory Must Be Scoped and Typed

Before we talk about storage or retrieval, we need a shared vocabulary.

The word “memory” is overloaded. So is “context.”

If you don’t explicitly define both scope and type, you end up with a single undifferentiated blob store. And that’s where security and correctness failures begin.

A robust commercial agent system classifies memory along two dimensions:

Scope: who can see it (a security boundary)
Type: what kind of memory it is (semantic role)

In practice, every production system we evaluated converged on some form of this separation.

This is not academic modeling. It is operational survival.

Memory Scopes (Security Boundaries)

These are structural isolation layers. They are enforced at the storage and routing layers, never delegated to the model.

Global Scope: Platform-Wide, Tenant-Invariant Memory

System safety rules
Tool contracts
Product ontology
Agent “constitution”

Properties:

Immutable at runtime
Versioned
Deployment-controlled
Never writable by agents

Tenant Scope: Organization-Wide Shared Memory

Organization policies
Knowledge bases
Playbooks
Connector configurations

Properties:

Shared across users in a tenant
Policy-gated promotion
Subject to tenant retention rules

This is where governance lives. It is also where poisoning risk becomes systemic if not handled correctly.

User Scope: Personalized Memory Within a Tenant

Preferences
Working style
Personal notes
User-specific entitlements

Properties:

Visible only to the user (and system)
Promotion-gated
TTL or policy-based retention

Cursor’s user-local memory posture reflects this discipline. User state stays user-scoped by default.

Session Scope: Ephemeral Runtime State

Tool outputs
Intermediate plans
Scratch buffers
Temporary retrieval results

Properties:

Short-lived
Subject to aggressive garbage collection
Not durable unless explicitly promoted

Observed Ecosystem Convergence

Memory Scopes Converge on Hierarchical, File-Like Policy

Pattern: Memory is scoped and layered, not dumped into a single store.

Claude Code implements multiple memory layers including managed policy, project memory, modular rules, user memory, local project memory, and auto-memory, all with deterministic precedence and on-demand loading.
Cursor describes "Rules" with explicit scopes (project vs user) and multiple activation modes (Always Apply, Apply Intelligently, Apply to Specific Files, Apply Manually).
Windsurf implements location-based scoping via AGENTS.md where subdirectory placement defines the scope boundary.
Warp organizes durable context as typed artifacts (Workflows, Notebooks, Rules, MCP Servers, etc.) rather than raw transcript.

Scope boundaries are becoming filesystem conventions, not prompt-level suggestions.

Memory Type (Semantic Role)

Memory types cut across scopes. Scope defines who can see it. Type defines what the memory represents and how it is expected to behave.

Policy memory
Normative rules and constraints.
Typically global or tenant-scoped.
Versioned and tightly controlled.
Preference memory
Stable personalization parameters.
Usually user-scoped.
Fact memory
Durable assertions the agent may reuse.
Must include provenance.
Episodic memory
Structured summaries of completed work.
“Case resolved.”
“Migration completed.”
“Exception granted.”
Reusable artifacts extracted from traces.
Trace memory
Raw, append-only execution events.
This is your flight recorder.

The most common failure mode in early systems is allowing episodic or fact memory to silently drift into policy memory.

That is how precedent poisoning begins.

Observed Ecosystem Convergence

Memory Classification Is Becoming Explicit

Pattern: Memory types (policy, preference, episodic, fact) are being separated structurally.

Letta's AI Memory SDK exposes labeled memory blocks (human, summary, policies, history, preferences) backed by per-subject agent state.
Amazon Bedrock AgentCore separates short-term memory (raw interaction events) from long-term memory (structured records extracted across sessions) with semantic retrieval APIs.
Bolt separates ephemeral chat history from durable "Project Knowledge," instructing users to promote constraints into the dedicated durable channel.

The more you know

A Simple Rule

Memory without scope is exposure. Memory without type is entropy.

Once you have both, you can start engineering context deliberately.

Memory Layer Summary:

Layer	Scope	Typical Contents	Retention	Write Policy	Canonical Store
Constitution	Global	Safety rules, tool contracts, ontologies	Versioned	Write-locked	Artifact registry (versioned policy bundles) + object store
Org memory	Tenant	Playbooks, knowledge base, connectors, norms	Policy-based	Gated promotion	Structured memory store
Personal memory	User	Preferences, working style, drafts	TTL-based + user controls	Gated promotion	Structured memory store
Runtime state	Session	Tool outputs, scratch space, intermediate plans	Hours–days	Auto GC	Ephemeral cache (working set); trace captures events/references
Episodes	User / Tenant	“Case resolved,” “refactor complete,” derived summaries	Months+	Explicit promotion	Structured memory store
Traces	Tenant (partitioned by session/run)	Events, retrievals, tool calls, approvals	Policy-based (often long-lived)	Append-only	Event log

Note how:

Session memory is volatile
User memory is semi-durable
Tenant memory is high-stakes
Global memory is immutable

Each layer carries different isolation risk and promotion risk.

What We Learned

Tenant Configuration Needed an Override Layer

We initially treated tenant-scoped configuration as the single source of truth for all users within a tenant. It worked well when needs were uniform.

As adoption grew, users needed to adjust specific settings without changing the tenant-wide baseline. Without partial overrides, every exception became either a tenant mutation or a workaround.

We introduced a resolution layer. Tenant configuration remained canonical, but user-scoped records could shadow specific fields. Reads resolved through explicit precedence rules, with tenant state authoritative and user preferences layered on top.

Scope is not just about visibility. It defines resolution order when multiple layers have opinions about the same setting.

Why Scoped + Typed Memory Changes Everything

Without scope boundaries:

Cross-tenant contamination becomes possible
Retrieval filters become advisory
Privacy guarantees degrade

Without type boundaries:

Precedents become directives
Facts become policies
Session artifacts become durable memory

Scoped + typed memory is the minimum viable structure for safe autonomy.

It allows:

Isolation enforcement
Promotion gating
Retention control
Cost modeling by layer
Evaluation at the run level

And most importantly: it prevents a single undifferentiated memory surface from quietly becoming a liability.

Architectural Principle

Name the Boundary

Memory must be explicitly typed and scoped.

Scope defines the security boundary.
Type defines behavioral semantics.
If either is ambiguous, drift becomes structural.

Part III: Truth vs Acceleration

Once you define a memory taxonomy, the next mistake most systems make is collapsing storage into a single layer.

Everything goes into:

A vector database
A document store
A transcript log
Or worse, a hybrid of all three

That works for prototypes.

It does not work for commercial, multi-tenant agent systems.

The core distinction you must preserve is this:

Separate truth from acceleration.

In practice, that means designing two canonical stores and two derived stores.

Canonical Stores (Truth)

Canonical stores serve as the system of record. They hold durable, immutable facts from which state can be deterministically derived.

They must support:

Auditability
Replay
Version awareness
Deterministic reconstruction

Their full state must be reconstructible from their own persisted history, not dependent on secondary indexes, caches, or materialized views.

1. Canonical Event Log (Append-Only)

This is your flight recorder.

Every agent run emits multiple events that include:

Context retrieved
Policies evaluated
Tool calls made
Approvals routed
Outputs generated
Memory promoted

This log is:

Append-only
Immutable
Replayable
Version-aware

It allows you to answer:

What did the agent know at decision time?
Which policy version applied?
Why was this exception granted?
What was retrieved and why?

Without this log:

You cannot debug autonomy.
You cannot build evaluation loops.
You cannot build future context graphs.

Example agent event:

1{2  "event_id": "evt_01J...",3  "run_id": "run_01J...",4  "tenant_id": "t_123",5  "user_id": "u_456",6  "ts": "2026-02-16T18:21:22Z",7  "type": "retrieval",8  "artifact_ids": ["mem_88"],9  "candidate_count": 32,10  "policy": { "version": "tenant_policy_v8" }11}

This is not logging. It is infrastructure.

2. Canonical Structured Memory Store

This stores durable memory state.

Unlike the event log, this is not raw trace data. It stores structured artifacts:

Facts
Preferences
Episodic summaries
Approved overrides
Tenant-level knowledge

Every record must include:

Scope
Class
Provenance
Retention policy
Sensitivity classification

Crucially: This store, not the vector index, is truth.

Example memory record:

1{2  "memory_id": "mem_88",3  "tenant_id": "t_42",4  "user_id": "u_7",5  "scope": "tenant",6  "memory_type": "episode",7  "status": "verified",8  "content_ref": "obj_441",9  "content_digest": "sha256:...",10  "provenance_run_id": "run_01J...",11  "retention_policy": "policy_12",12  "sensitivity": "internal",13  "created_at": 173102944114}

If your vector store becomes your truth layer, you will eventually:

Lose provenance
Lose replayability
Break deletion guarantees
Create retention drift

Derived Stores (Projections)

Derived stores exist for performance.

They are:

Rebuildable
Ephemeral
Invalidatable
Non-authoritative

They are accelerators, not truth.

1. Retrieval Index (Vector / Hybrid Search)

The retrieval index is your serving layer for recall.

It may include:

Embeddings
Lexical search (BM25 or equivalent)
Hybrid ranking
Freshness boosts
Scope filters
Metadata constraints

But it must be rebuildable from canonical sources.

It is a projection.

If your vector store becomes your truth layer, you will eventually lose structural integrity.

2. Object Store (Large Payloads)

Agent systems frequently deal with:

Large documents
Attachments
Extraction outputs
Tool responses
External system dumps

These do not belong in structured memory. They belong in an object store, referenced by ID, tagged with scope and sensitivity, and governed by retention policy.

Objects should be content-addressed (or at minimum content-hashed) so they can be verified, deduplicated, and traced back to immutable source bytes. Derived artifacts such as embeddings, chunks, summaries, and classifications should be stored separately and keyed by (object_id, content_digest, model/version). They are projections for retrieval and acceleration, not canonical truth.

Summaries are especially useful for faster retrieval and context compression, but they must remain reproducible and auditable. A summary should always reference:

the source object_id
the source content digest
the model and prompt/version used to generate it
creation timestamp and optional verification status

The event log and structured memory should reference these objects, never embed large payloads directly.

If using managed ingestion systems such as Amazon Bedrock Knowledge Bases, treat them as synchronization and chunking layers that build and refresh retrieval indexes from object storage. They orchestrate ingestion and pruning, but they do not replace the underlying search engine or the need for canonical content verification.

Hybrid Search

Hybrid search (lexical + semantic) provides stronger precision and filtering guarantees than vector-only retrieval when correctness and isolation matter.

Lexical search preserves deterministic filtering.
Semantic search improves recall.

Combined ranking reduces false matches and helps minimize hallucinations.

Engines such as OpenSearch, Typesense, and Pinecone natively support hybrid retrieval, combining keyword relevance (BM25-style scoring) with vector similarity to balance precision and semantic recall.

Amazon Bedrock Knowledge Bases is not a search engine itself, but instead a managed ingestion and synchronization layer that builds and refreshes retrieval indexes (typically backed by a hybrid store) from documents stored in S3. It helps prune, chunk, and rebuild indexes, on your behalf.

Critically, the index, regardless of engine, must be built from:

Canonical structured memory
Curated documents
Approved episodes

It must not be built directly from raw transcripts.

Raw transcripts are noisy, redundant, and context-fragmented. Indexing them directly undermines traceability and weakens retrieval discipline.

What We Learned

The Vector Index Became Accidental Truth

We relied on the vector index because it already contained embeddings, metadata, and retrieval paths. It was fast, convenient, and close to the model.

Over time, it quietly became the de facto system of record. Deletion became probabilistic, retention policies diverged across layers, and rebuilding the index shifted historical behavior because canonical truth had never been explicitly defined.

We separated acceleration from truth. Canonical records became immutable objects with explicit provenance and retention semantics. The vector index was reduced to a projection layer, fully rebuildable from canonical sources.

If your retrieval layer is the only place certain data lives, it is not acceleration. It is an unauditable system of record.

Isolation at the Projection Layer

Isolation is not a retrieval tuning feature.

It is a system invariant.

But the retrieval index is where projection-layer drift most commonly appears.

Filtering must occur before ranking, not after.

Every retrieval query must include:

Tenant ID
Scope visibility constraint
Expiration checks
Sensitivity boundaries

1class IdentityEnvelope:2    def __init__(self, tenant_id, user_id, roles, privacy_mode, policy_version):3        self.tenant_id = tenant_id4        self.user_id = user_id5        self.roles = roles6        self.privacy_mode = privacy_mode7        self.policy_version = policy_version89def retrieve(query, envelope: IdentityEnvelope):10    assert envelope.tenant_id is not None11    assert envelope.policy_version is not None1213    filters = {14        "tenant_id": envelope.tenant_id,15        "policy_version": envelope.policy_version,16        "visibility": allowed_scopes(envelope),17        "not_expired": True18    }1920    candidates = hybrid_search(query, filters)21    return rank(candidates)

If filtering is applied after ranking, cross-tenant artifacts may still influence embedding neighborhoods.

Partition semantics must match canonical storage.

If canonical storage is tenant-partitioned but the retrieval index is globally indexed with soft filters, isolation becomes advisory.

Observed Ecosystem Convergence

Event Logs and Structured Records Are Splitting into Distinct Tiers

Pattern: Canonical truth is separating from derived acceleration layers.

Amazon Bedrock AgentCore implements this split explicitly: short-term memory stores raw interaction events; long-term memory holds structured information extracted asynchronously with semantic retrieval.
OpenAI's Responses API provides durable response IDs, explicit metadata fields, and detailed usage breakdowns including cached tokens, the API-level anchors for trace envelopes.

Raw event capture and structured state are becoming architecturally distinct.

Content Proofs and Cross-Tenant Isolation

In shared or multi-tenant retrieval systems, consider cryptographic “content proofs” to prevent cross-copy leakage.

Cursor, for example, uses Merkle-tree-based content proofs during shared index onboarding to ensure results are returned only if the requester can prove legitimate possession.

This pattern can be applied at the object-store level:

Maintain tenant-scoped manifests
Maintain Merkle roots over authorized (object_id, digest) pairs
Enforce verifiable inclusion boundaries

This reinforces isolation at the projection layer.

Cryptography does not replace logical isolation. It reinforces it.

Observed Ecosystem Convergence

Content Proofs and Index Isolation Are Production Patterns

Pattern: Derived indexes enforce isolation cryptographically, not just logically.

Cursor's blog and Security page describes Merkle-tree-based "content proofs" for secure teammate index reuse, filtering results unless the client can prove file possession, then deleting proofs after roots match.
Cursor's data-use disclosure documents temporary encrypted caching with client-generated keys that exist server-side only during the request.

Projection layers are being hardened against cross-boundary leakage.

A Quick Mental Model

Think of it this way:

The event log is your immutable journal.
The structured memory store is your state.
The retrieval index is your materialized view.
The object store is your blob layer.

If you’ve worked with event sourcing, this should feel familiar (with less determinism).

If you haven’t, the rule is simple:
If you can’t rebuild it from canonical truth, it shouldn’t be trusted.

Why This Separation Matters

This architecture gives you:

Replayability
Deletion guarantees
Poisoning containment
Cross-tenant isolation clarity
Retention enforcement
Cost control
Index rebuild capability

And most importantly:
It prevents your retrieval layer from silently becoming your system of record.

Architectural Principle

Truth Is Rebuildable, Acceleration Is Disposable

Canonical data must be authoritative; everything else must be regenerable.

Canonical stores are the system of record.
Derived layers are projections, not truth.
If acceleration becomes authoritative, integrity erodes.

Part IV: The Context Engine Loop

Commercial agent systems don't fail because they lack storage.

They fail because they lack discipline at runtime.

Context is not something you “load.”
It is something you assemble, constrain, compact, and sometimes discard.

Most systems accumulate context. Production systems reconstruct it.

That discipline lives in the context engine loop.

The High-Level Loop

Every agent run should follow a predictable sequence:

Ingest: establish identity, scope, constraints, and privacy mode
Plan context needs: determine what information is required to act safely
Retrieve: execute hybrid search within allowed scopes
Assemble working set: layer context by priority and token budget
Semantic stabilization: normalize references, extract structure, preserve meaning before reduction
Agentic garbage collection: deduplicate, prune low-confidence artifacts, enforce working-set limits
Infer and act: model + tools + policy enforcement + optional human approval
Promotion gate: decide what becomes durable memory
Emit trace envelope: record retrievals, actions, policies, versions, and cost surfaces
Lifecycle garbage collection: expire session buffers, enforce retention, invalidate derived projections

This loop is not optional.

If you skip steps, you get drift.

Step 1: Ingest

At the beginning of a run, you must establish:

Tenant identity
User identity
Role and entitlements
Privacy mode
Sensitivity level
Task type

Isolation begins here.

Retrieval filters are built before retrieval runs.

1from dataclasses import dataclass2from typing import FrozenSet, Optional34@dataclass(frozen=True)5class IdentityEnvelope:6    tenant_id: str7    user_id: str8    roles: FrozenSet[str]9    privacy_mode: str              # e.g., "retained" | "no_retention"10    policy_version: str            # must be pinned per run1112def assert_envelope(envelope: IdentityEnvelope) -> None:13    assert envelope.tenant_id, "tenant_id is required"14    assert envelope.user_id, "user_id is required"15    assert envelope.policy_version, "policy_version must be pinned per run"16    assert envelope.privacy_mode in {"retained", "no_retention"}, "invalid privacy_mode"

You DO NOT ask the model to filter data.
You filter in the data plane.

If identity and scope are ambiguous at ingestion, everything downstream becomes probabilistic.

Step 2: Plan Context Needs

Before retrieving anything, the agent should plan what kind of context it needs.

Does this task require:

Tenant policy?
Prior episodes?
User preferences?
External knowledge?

This prevents the common anti-pattern:
“Retrieve everything and let the model figure it out.”

In production, this anti-pattern shows up as:

Gradually increasing token costs
Slowly degrading precision
Retrieval surfaces expanding
Embedding neighborhoods densifying
Prompt budgets creeping upward

No single run looks catastrophic.
Over weeks, additive context and recursive indexing begin influencing outcomes in subtle, hard-to-debug ways.

Planning reduces both risk and cost.
It's your first form of budget control.

1def plan_context(request):2    if request.type == "support_refund":3        return {4            "needs": ["tenant_policy", "prior_episodes", "customer_history"],5            "max_tokens": 24006        }7    elif request.type == "draft_email":8        return {9            "needs": ["user_preferences"],10            "max_tokens": 120011        }

Observed Ecosystem Convergence

Plan-Before-Execute Is Standard Practice

Pattern: Agent systems are separating planning from execution with explicit gates.

Claude Code describes an agentic loop: gather context, take action, and verify results, with subagents that use fresh isolated contexts and return summaries.
OpenCode documents a "plan" agent that analyzes without modifying code, with permissioned tools requiring approval before execution.
Lovable splits into Plan mode for decision-making and Agent mode for execution with verification.
Bolt documents Plan Mode as improving strategy and execution accuracy.

Inference without planning is giving way to deliberate, gated execution.

Step 3: Retrieve (Isolation Enforced Here)

Retrieval must respect:

Scope
Visibility
Sensitivity
Retention
Privacy mode

Filtering happens before ranking, not after.

1def retrieve(query: str, envelope: IdentityEnvelope, *, now_ts: int):2    assert_envelope(envelope)3    assert query and isinstance(query, str)45    filters = {6        "tenant_id": envelope.tenant_id,                 # mandatory predicate7        "visibility": allowed_scopes(envelope.roles),    # computed in data plane8        "not_expired_at": now_ts,                        # enforce retention gates9        "status_in": {"active"},                         # provisional is not broadly retrievable10    }1112    # IMPORTANT: filter BEFORE rank, never after.13    candidates = hybrid_search(query=query, filters=filters)1415    # Projection is not truth. Verify tenant on canonical fetch.16    artifact_ids = [c.artifact_id for c in candidates[:50]]17    records = guarded_fetch(artifact_ids, envelope.tenant_id)1819    return rank(query, records)

Hybrid search (lexical + semantic) provides:

Deterministic filtering
Precision guarantees
Improved recall

But retrieval is still a projection.

The canonical store remains the source of truth.

Step 4: Assemble the Working Set

The working set is the ephemeral context that actually enters the model’s window.

It is layered:

Global constitution
Tenant policies
User preferences
Retrieved facts and episodes
Session state

Each layer has:

Priority
Token budget
Truncation rules

Without layering and budgets, context windows become dumping grounds.

1def assemble(layers, budget):2    ordered = sort_by_priority(layers)3    working_set = []4    tokens_used = 056    for item in ordered:7        if tokens_used + item.tokens <= budget:8            working_set.append(item)9            tokens_used += item.tokens10        else:11            break1213    return working_set

What We Learned

Silent Guardrail Drift

We assumed that once policies existed in the system, they would remain influential.

As session histories expanded, tenant-level constraints were gradually pushed out of the working set. The system kept running, just without its guardrails visible at inference time.

We introduced explicit layer budgets. Global constitution and tenant policy received reserved allocations that could not be displaced.

If guardrails can be crowded out, they are suggestions, not invariants.

Observed Ecosystem Convergence

Budgeted Context Assembly Replaces Wholesale Inclusion

Pattern: Context is selectively loaded by budget and priority, not dumped wholesale.

Replit injects minimal diagnostic signals and instructs the agent to fetch logs via a tool, explicitly avoiding full context dumps.
Anthropic's Skills guide formalizes progressive disclosure: metadata always loaded, full instructions loaded only when needed, linked files navigated on demand.
Cursor requires explicit context inclusion rather than blanket accumulation.

What you exclude from context is becoming as important as what you include.

Step 5: Semantic Stabilization (Pre-Compaction Flush)

Before you shrink context, you MUST stabilize meaning.

Compaction without stabilization risks deleting something the model was implicitly relying on.

Semantic stabilization answers:
What must be transformed or anchored before we enforce token limits?

This step may include:

Collapsing verbose tool traces into structured summaries
Extracting typed episodic artifacts from conversation fragments
Converting free-form dialogue into structured facts
Normalizing references (“that refund we discussed”) into concrete IDs
Marking low-confidence artifacts explicitly
Ensuring provenance metadata is attached

1def semantic_stabilization(working_set):2    working_set = collapse_tool_traces(working_set)3    working_set = extract_structured_episodes(working_set)4    working_set = normalize_references(working_set)5    working_set = attach_provenance(working_set)6    return working_set

This is not deletion.

It is transformation before deletion.

Without this step:

Summarization can distort intent
Compaction can silently remove guardrails
Session references can become ambiguous
Replay fidelity can degrade

Semantic stabilization preserves reasoning integrity before footprint reduction.

What We Learned

Compaction Without Stabilization Corrupted Meaning

We aggressively summarized long histories before extracting structured artifacts. It reduced tokens quickly and seemed harmless.

Over time, subtle behaviors shifted. Context that influenced tool selection and policy evaluation disappeared because it had been compressed before it was normalized or typed.

We moved structured extraction ahead of compaction. Meaning was stabilized first, then footprint was optimized.

If compaction runs before normalization, you are not reducing noise. You are discarding signal you have not yet captured.

Pre-compaction stabilization protects correctness.

Step 6: Agentic Garbage Collection (Working-Set Compaction)

After meaning is stabilized, the system can safely optimize.

Agentic garbage collection happens before inference.

It enforces:

Token budgets by layer
Deduplication of redundant artifacts
Dropping stale session state
Removing low-confidence provisional memory
Enforcing maximum working-set size

Example:

1def agentic_gc(working_set, budget):2    working_set = dedupe(working_set)3    working_set = drop_low_confidence(working_set)4    return enforce_token_budget(working_set, budget)

Agentic GC protects:

Guardrail visibility
Cost predictability
Ambiguity control
Drift containment

It ensures that:

Global constitution cannot be crowded out
Tenant policy remains visible
Session chatter does not displace structural constraints

Uncompressed history turns directly into cost.

Agentic garbage collection is not just optimization.

It is drift control.

Garbage Collection by Memory Layer:

Memory Layer	Volatility	Promotion Risk	GC Strategy	Industry Parallel
Session	High	Low	Aggressive compaction, TTL	Claude Code ephemeral state
User	Medium	Medium	TTL + overwrite	Cursor user-local history
Tenant	Low	High	Verification gate	AgentCore tenant memory
Global	Immutable	Extreme	Write-locked	Signed system artifacts

Session state is cheap and volatile.

Tenant memory is high-stakes and must be protected accordingly.

What We Learned

Transparency Competed With Cost

We retained full intermediate tool traces in the working context to maximize debugging transparency. Nothing else changed, but token usage per run steadily climbed.

The working set was carrying diagnostic detail the model did not need for inference. Cost increased without improving behavior.

We collapsed tool traces into structured summaries during stabilization and let agentic GC prune the rest. Only durable artifacts were eligible for promotion.

Full traces belong in the event log. The working set should carry only what inference needs to act on.

Step 7: Infer and Act

Only after:

Context is stabilized
Working set is compacted
Budgets are enforced

does inference occur.

This is where:

The model runs
Tools are invoked
Policies are evaluated
Approvals are requested if needed

This is the only step most tutorials focus on.

Model invocation + tools + policy evaluation + approvals.

In commercial systems:

Actions must be policy-evaluated
High-risk actions may require human approval
Tool outputs must be sensitivity-tagged
Outputs must be traced

Tool invocations should include:

Versioned tool contracts
Input digests
Output digests

External systems evolve.

Without capturing tool version and payload hash, replay fidelity degrades over time.

The model is a component.

The system is the product.

Step 8: Promotion Gate

Promotion transitions session memory into durable memory.

This is the highest-risk operation in the system.

It deserves its own section, which we will fully expand in Part VI.

Step 9: Emit Trace Envelope

Disciplined trace capture requires a canonical shape.

Every run produces a single append-only trace envelope.

The envelope is a run-scoped materialization derived from the append-only event log. It does not introduce new facts. It snapshots derived aggregates so replay, audit, and cost analysis do not require reconstructing runs from raw events.

Cost views, lineage trees, evaluation harnesses, and audit dashboards are projections derived from this record. They do not redefine it.

Events are self-describing for partitioning and queryability, but the trace envelope is the authoritative run-level header. The event log is keyed by run_id.

At minimum, a canonical trace record must anchor:

Identity (tenant, user, privacy mode)
Model and policy versions
Prefix/version hash
Retrieval artifact IDs
Tool contract versions
Promotion decisions
Token usage and cost
Lineage (parent_run_id)
Immutable event history

A minimal representation might look like this:

1{2  "run_id": "run_01J...",3  "parent_run_id": null,45  "tenant_id": "t_123",6  "user_id": "u_456",7  "privacy_mode": "retained",89  "policy": {10    "version": "tenant_policy_v8",11    "hash": "sha256:..."12  },13  "prefix": {14    "version": "constitution_v12",15    "hash": "sha256:..."16  },17  "model": {18    "provider": "anthropic",19    "name": "claude-sonnet-4-6",20    "version": "2026-02-15"21  },2223  "started_at": "2026-02-16T18:21:22Z",24  "ended_at": "2026-02-16T18:21:41Z",25  "status": "success",2627  "usage": {28    "tokens_in": 1832,29    "tokens_out": 412,30    "static_prefix_tokens": 620,31    "dynamic_context_tokens": 1212,32    "cost_estimate_usd": 0.023133  },3435  "retrieval": {36    "count": 8,37    "bytes_in": 14523,38    "rerank_candidates": 3239  },40  "tools": {41    "invoked": 2,42    "retry_count": 143  },44  "promotions": {45    "count": 1,46    "by_scope": { "tenant": 1, "user": 0, "global": 0 }47  },4849  "events": [50    {51      "event_id": "evt_01J...",52      "run_id": "run_01J...",53      "tenant_id": "t_123",54      "user_id": "u_456",55      "ts": "2026-02-16T18:21:23Z",56      "type": "retrieval",57      "artifact_ids": ["mem_88"],58      "candidate_count": 32,59      "policy": { "version": "tenant_policy_v8" }60    },61    {62      "event_id": "evt_01J...",63      "run_id": "run_01J...",64      "tenant_id": "t_123",65      "user_id": "u_456",66      "ts": "2026-02-16T18:21:27Z",67      "type": "tool_call",68      "tool": "refund_api",69      "contract_version": "v3.1",70      "input_hash": "sha256:...",71      "output_hash": "sha256:...",72      "policy": { "version": "tenant_policy_v8" }73    },74    {75      "event_id": "evt_01J...",76      "run_id": "run_01J...",77      "tenant_id": "t_123",78      "user_id": "u_456",79      "ts": "2026-02-16T18:21:38Z",80      "type": "promotion_write",81      "memory_id": "mem_441",82      "scope": "tenant",83      "memory_type": "episode",84      "status": "provisional",85      "policy": { "version": "tenant_policy_v8" }86    }87  ],8889  "integrity": {90    "envelope_hash": "sha256:...",91    "events_root_hash": "sha256:..."92  }93}

This record is append-only.
It is version-aware.
It is sufficient to replay the decision.

Everything else is projection.

Without trace envelopes, context engineering becomes guesswork.

Step 10: Lifecycle Garbage Collection (Durability & Retention Discipline)

After the run:

Expire session buffers
Invalidate derived indexes if needed
Apply TTL to memory
Archive large payloads
Enforce retention policies

Memory is not just created. It must decay.

The more you know

Why Three Forms of Garbage Collection?

It’s important to distinguish between:

Semantic Stabilization: preserve meaning before reduction
Agentic Garbage Collection: enforce working-set discipline before inference
Lifecycle Garbage Collection: enforce retention and projection hygiene across runs

They operate at different layers of the architecture and protect different invariants:

Stabilization protects correctness
Agentic GC protects cost and drift
Lifecycle GC protects durability and compliance Most systems implement only one. Commercial systems require all three.

Run Boundary Events

Beyond the events emitted within the agent loop, two boundary events define the run itself.

run_started pins the execution boundary.
It captures the immutable configuration for the run: policy version, prefix hash, privacy mode, primary model, and parent linkage. From this point forward, the run operates inside that fixed context.

run_finalized closes the lifecycle.
It records final status, token usage, cost attribution, promotion counts, and integrity hashes. After this event, the run is complete and immutable.

Together, these two events make the trace envelope fully reconstructible from the append-only event log. The envelope introduces no new facts. It materializes the boundary and aggregates for fast replay, audit, and cost analysis.

Multi-Turn Conversations Do Not Justify Persistent Windows

A common misconception:
“If this is a conversation, the entire prior context should remain in the window.”

⚠️ That is incorrect in commercial systems.

Multi-turn state should be reconstructed per turn from:

Canonical structured memory
Verified episodes
Approved tenant policies
Selective session summaries

Not from raw accumulated transcripts as the primary reconstruction mechanism.

Each turn should:

Emit a trace
Compact session artifacts
Promote only approved durable memory
Reassemble context fresh

Carrying forward full windows across turns:

Increases token cost
Increases drift
Increases poisoning risk
Obscures replayability

The acceptable pattern:

Session memory is volatile
Durable memory is reconstructed
Context is assembled per turn

If context grows by accumulation rather than reassembly, you are building drift into the architecture.

What We Learned

Transcript Indexing Drift

We indexed raw transcripts directly because it was fast and required almost no additional structure. Early demos were impressive.

Over time, behavior drifted. Summarization evolved, tokenization shifted, and buried instructions inside transcripts began influencing retrieval in ways we could not replay or explain.

We moved transcripts out of the retrieval surface. Only structured artifacts, verified episodes, and canonical documents were indexed. Transcripts remained in the event log.

Raw transcripts are source material, not durable memory. If retrieval is built on conversation residue, behavior becomes a function of accumulated noise.

The Discipline

This loop is the difference between:

A chatbot with a vector DB and a commercial agent system.

Most failures come from skipping:

Planning
Pre-compaction
Promotion gating
Trace emission

The loop enforces discipline.

And discipline turns context from an experiment into infrastructure.

Architectural Principle

Assemble, Don’t Accumulate

Context must be reconstructed per run, not allowed to grow unchecked.

Context is built intentionally each execution.
The loop is the product: retrieve, budget, compact, promote, trace.
Unbounded carryover becomes architectural drift.

Part V: Multi-Agent Context Boundaries

Commercial agent systems increasingly delegate work across multiple agents.

A parent agent spawns a subagent to research a topic, execute a tool chain, validate a result, or operate within a specialized domain. Multi-agent orchestration patterns such as fan-out, delegation, pipelines, and supervisory hierarchies are becoming standard.

The architectural challenge is not orchestration.

It is context discipline across agent boundaries.

Every principle established so far, scoped memory, truth vs acceleration, the context engine loop, applies within a single agent. Multi-agent systems multiply the surfaces where those principles must hold.

If context flows between agents without discipline, you get the same failures as undisciplined single-agent systems, but harder to debug because the causal chain crosses execution boundaries.

Context Inheritance vs Isolation

When a parent agent spawns a subagent, the first question is:
What context does the subagent receive?

There are two patterns:

Shared context: The subagent inherits the parent's full working set.
Isolated context with scoped input: The subagent receives a fresh context window with only the information the parent explicitly passes.

The first pattern is simple. It is also dangerous.

It carries the following risks:

The subagent's token budget is consumed by the parent's context before it begins its own work.
Irrelevant context from the parent pollutes the subagent's reasoning.
Replay becomes ambiguous because you cannot isolate which agent's context influenced which decision.
If the parent's context contains sensitive artifacts the subagent should not access, isolation is violated.

The second pattern, isolated context with scoped input, however, survives production pressure.

Claude Code's subagent model reflects this: subagents operate with fresh isolated contexts. The parent provides a scoped task description. The subagent executes independently. It returns a structured summary. The parent incorporates the summary into its own working set.

The isolation is deliberate:

The subagent's token budget is its own.
The parent controls what enters the subagent's window.
The subagent's full internal trace stays in its own scope.
Replay can reconstruct each agent's decision independently.

The cost of isolation is that the parent must decide what context the subagent needs. That decision is itself a context engineering problem, and it benefits from the same planning step described in the context engine loop.

If context inheritance is implicit, debugging multi-agent behavior requires reconstructing invisible state.

If context inheritance is explicit, each agent's behavior is independently replayable.

1def spawn_subagent(parent_trace: TraceEnvelope, envelope: IdentityEnvelope, task: dict, input_artifact_ids: list[str]):2    # Mandatory inheritance: tenant, user, policy, privacy (handled in routing outside this call).3    child_run_id = new_id("run")45    child_trace = TraceEnvelope(6        run_id=child_run_id,7        tenant_id=envelope.tenant_id,8        user_id=envelope.user_id,9        policy_version=envelope.policy_version,10        model_version=select_model_for(task),11        parent_run_id=parent_trace.run_id,12    )1314    # Parent explicitly chooses what the child can see.15    result = execute_subagent(task=task, input_artifact_ids=input_artifact_ids, trace=child_trace)1617    parent_trace.record_event({18        "event_type": "delegation",19        "child_run_id": child_run_id,20        "agent_type": task.get("agent_type"),21        "input_artifact_ids": input_artifact_ids,22        "output_summary_id": result.summary_id,23        "child_cost_usd": child_trace.cost_usd,24    })2526    child_trace.finalize()27    return result

What We Learned

We initially delegated to subagents by passing the parent’s full working set. It was simple, fast to ship, and produced strong results in testing.

As parent contexts grew, subagent token costs grew with them. Behavior became sensitive to prior session state, and identical delegations produced different outcomes depending on what had happened earlier in the run.

We moved to scoped delegation. The parent assembled a minimal context package per subagent: task description, applicable policies, and explicitly selected artifacts. Each subagent ran in an isolated context and returned a structured summary.

Full context inheritance works when working sets are small. At scale, implicit inheritance turns parent history into unintended influence.

Subagent Outputs Are Promotion Events

When a subagent returns results to its parent, the parent incorporates that output into its working set.

This is a promotion event.

It deserves the same scrutiny as any other transition from ephemeral to durable state.

When the subagent's summary enters the parent's context it can influence:

Subsequent tool invocations
Policy evaluation
Further delegation decisions
Memory promotion at the end of the run

If the subagent's output is treated as trusted input without validation, the parent inherits whatever errors, hallucinations, or poisoning the subagent produced.

Defense:

Subagent outputs should be typed: fact, episode, recommendation, tool result.
Provenance should be tagged: which subagent, which run, which model version.
Sensitivity classification should transfer: if the subagent accessed tenant-scoped data, the summary inherits that classification.
The parent's promotion gate applies: subagent outputs should be treated the same way as any other artifact entering durable memory.

A useful mental model: treat subagent outputs like tool outputs.

They are data, not directives.

They carry provenance.

They are subject to the same validation rules as any other input to the working set.

What We Learned

Subagent Outputs Bypassed Promotion Gates

We treated subagent summaries as trusted internal artifacts because they came from our own agents. They flowed directly into the parent’s working set and, in some cases, into tenant-scoped durable memory without passing through the standard promotion gate.

As delegation volume increased, unverified summaries accumulated in durable memory faster than review processes could keep up.

We routed subagent outputs through the same promotion pipeline as every other artifact. Provenance became mandatory, and outputs remained provisional until validated.

The source of an artifact does not determine its trustworthiness. Internal agents are not exempt from governance.

Trace Lineage Across Agent Boundaries

In a single-agent system, the trace envelope captures one execution path.

In a multi-agent system, traces form a tree.

If Agent A delegates to Agent B, and Agent B delegates to Agent C, the trace must capture the full lineage:

run_id for each agent's execution
parent_run_id linking child to parent
Delegation context: what was passed to the child
Return summary: what came back
Cost attribution per agent

Without lineage, you cannot:

Replay a specific agent's execution in isolation
Attribute cost to the agent that incurred it
Debug which agent in the chain produced a problematic output
Evaluate whether delegation decisions were correct

Trace lineage turns a multi-agent run from opaque delegation into a debuggable, replayable execution graph.

Without it, multi-agent systems become black boxes that happen to contain smaller black boxes.

Example trace structure (truncated for brevity):

1{2  "run_id": "run_parent_01J...",3  "tenant_id": "t_42",4  "user_id": "u_7",5  "policy_version": "policy_v3",6  "model_contract_version": "agent_spec_v2",78  "delegations": [9    {10      "child_run_id": "run_sub_01J...",11      "agent_type": "research_agent",12      "model_version": "claude-sonnet-4-6",13      "input_artifact_ids": ["mem_88", "policy_v3"],14      "output_artifact_ids": ["sum_441"],15      "tokens_in": 1420,16      "tokens_out": 380,17      "tools_invoked": 3,18      "cost_estimate_usd": 0.018,19      "promotions": [],20      "delegations": []21    }22  ],2324  "promotions": [],25  "total_cost_estimate_usd": 0.04126}

Scope Inheritance Rules

When a parent agent delegates to a subagent, the subagent must operate within the correct scope boundaries.

Tenant scope and user scope must be inherited. If a subagent operates outside the parent's tenant boundary, isolation is violated. This is not optional.

Session scope is different. The subagent should have its own ephemeral session scope. It should not inherit the parent's session history, scratch buffers, or intermediate plans. Those belong to the parent's execution context.

Policy visibility must also propagate. If the parent operates under tenant policy version 3.2, the subagent must operate under the same version. Policy version drift across agents within a single run creates inconsistency that is extremely difficult to debug.

Summary of inheritance rules:

Scope	Inherited?	Notes
Tenant identity	Yes (mandatory)	Isolation boundary
User identity	Yes (mandatory)	Entitlement boundary
Tenant policies	Yes (mandatory)	Must be same version as parent
Global constitution	Yes (mandatory)	Immutable, always present
Session state	No	Subagent gets its own session scope
Parent's working set	No	Only explicitly passed artifacts
Privacy mode	Yes (mandatory)	Cannot be downgraded by delegation

If privacy mode is active in the parent, it must be active in every subagent. Delegation cannot downgrade privacy guarantees.

If policy version differs between parent and child, the trace will show inconsistent evaluation and replay will not reproduce the behavior.

Cost Attribution Across Agents

Multi-agent runs compound every cost surface described in Part IX.

Each subagent incurs its own:

Inference cost (its own context window, its own model invocation)
Retrieval cost (its own queries against the retrieval index)
Tool cost (its own external API calls)
Persistence cost (if it promotes anything to durable memory)

Without per-agent cost attribution, optimization is impossible because you cannot see which agents are expensive.

Common failure mode:
A parent agent delegates to five subagents. Total run cost rises. The trace shows aggregate token counts. But it does not show that one subagent consumed 60% of the budget because its retrieval surface was over-broad.

Per-agent cost attribution within a run is not optional in commercial systems. It is the only way to identify which delegation paths are economically sustainable and which need tighter budgets.

The trace envelope must decompose cost by agent, not just by surface.

Observed Ecosystem Convergence

Multi-Agent Context Isolation Is Becoming Structural

Pattern: Agent systems are separating agent execution contexts with explicit boundaries rather than sharing state.

Claude Code implements subagent isolation through the Task tool. These subagents receive scoped task descriptions, operate with fresh contexts, and return structured summaries. The parent's context is not shared wholesale. Subagents cannot spawn other subagents, enforcing a single-level delegation hierarchy. The Anthropic Agent SDK extends this with parent_tool_use_id fields for tracing delegation lineage.
Letta supports multi-agent architectures where each agent maintains its own memory blocks and state. Cross-agent communication happens through explicit message passing (send_message_to_agent_async and send_message_to_agent_and_wait_for_reply) not shared context windows. When state must be shared, Letta uses explicitly attached shared memory blocks rather than implicit context inheritance.
The OpenAI Agents SDK supports agent handoffs where conversation state transfers between specialized agents. By default the receiving agent sees full conversation history, but input_filter functions give explicit control over what context propagates. The SDK also supports nest_handoff_history, which collapses prior transcripts into summary messages rather than passing raw history, implementing context scoping as a first-class API. It also supports an agents-as-tools pattern for nested delegation.
AWS Bedrock AgentCore supports multi-agent orchestration with a supervisor-agent pattern where specialized sub-agents maintain independent configurations and tool access. AgentCore Memory provides memory branching, isolated conversation branches within a shared memory resource, so each agent maintains its own context history within shared tenant boundaries.

Context isolation between agents follows the same pattern as context isolation between tenants: structural separation with explicit, controlled sharing.

Architectural Principle

Delegation Multiplies Risk

Every agent boundary must preserve isolation and replay.

Subagents receive minimal, intentional context.
Cross-agent flows remain scoped and traceable.
Inherited sprawl is a systemic failure.

Part VI: Isolation, Poisoning, and Promotion Control

If you build commercial agents long enough, you eventually learn a frustrating truth:

Most failures don’t look like failures.

They look like slightly worse output... until you realize the system is drifting.

That drift is usually caused by one of three things:

Isolation boundaries weren’t enforced consistently
Bad context was retrieved and treated as truth
Session artifacts were promoted into durable memory

The danger isn’t that the model hallucinates.

The danger is that the system starts believing it.

Isolation Is a Data-Plane Primitive

Here’s the invariant again because it’s worth repeating:

Isolation is enforced in the data plane, not the prompt.

If your strategy relies on the model restricting itself to tenant-specific content, you’re already in trouble.

Isolation must be structural.

This means:

Every retrieval query must includes tenant/user filter as a required predicate
Filters always apply before ranking
When documents are fetched, tenant mismatch needs to throw an exception
Derived indexes are should either be physically partitioned or logically partitioned with verified predicates

It sounds obvious.

1class IsolationBreach(Exception):2    pass34def guarded_fetch(artifact_ids: list[str], tenant_id: str):5    assert tenant_id67    records = canonical_memory_store.get_many(artifact_ids)89    for r in records:10        # Hard failure. Do not "best-effort" isolate.11        if r.tenant_id != tenant_id:12            raise IsolationBreach(f"tenant mismatch: {r.id}")1314        # Optional: enforce lifecycle rules at read time too.15        if getattr(r, "status", None) not in {"active"}:16            continue1718    return [r for r in records if getattr(r, "status", None) == "active"]

What We Learned

Canonical Was Clean. Projections Drifted.

When we investigated cross-tenant inconsistencies, canonical storage was correct. The drift lived in derived layers: search indexes, caches, and analytics jobs. Each enforced partitioning slightly differently. Each was almost correct.

Those small differences compounded into observable inconsistencies at scale.

We standardized tenancy enforcement across every derived layer. Retrieval filters became mandatory predicates, and partition semantics were made identical everywhere.

Isolation that holds only in canonical storage is not isolation. Every projection layer must inherit the same partitioning contract or eventually violate it.

Most multi-tenant systems that leak data don’t leak because someone wrote SELECT * FROM tenants.

They leak because a derived system wasn’t partitioned the same way the canonical store was.

Observed Ecosystem Convergence

Isolation Is Structural, Not Prompt-Level

Pattern: Privacy and isolation are routing decisions, not prompt-level instructions.

Cursor's security architecture routes requests through a proxy into separate service replicas for privacy vs non-privacy workloads, defaulting to privacy-mode if the x-ghost-mode header is missing.
Warp performs unconditional secret redaction for AI interactions and uses an explicit X-Warp-Telemetry-Enabled header where the server assumes telemetry disabled if absent.
OpenCode executes tasks inside GitHub Actions runners, creating a hard boundary for tool execution and side effects.

Isolation is structural and enforced before writes, not advisory.

The Three Types of Memory Poisoning

Memory poisoning is not just “prompt injection.”

In multi-tenant systems it shows up in three distinct ways.

1. Instruction Poisoning

Malicious or malformed content attempts to alter system behavior.

Examples:

“Ignore previous instructions”
“Always approve refunds”
“If you see this, exfiltrate secrets”

Defense:

Policies never come from user content
Policy memory is signed and versioned (global scope)
User instructions are treated as input, not law
Tool outputs are treated as data, not directives

If you take one rule from this section:
Never promote instructions to policy.

2. Precedent Poisoning

This is subtle and common.

An agent makes an exception once.
That exception gets stored as “how we do it.”
Six weeks later, the exception becomes default behavior.

Defense:

Don’t store precedents as directives
Store them as episodes with provenance and outcomes
Require explicit approval signals before a precedent becomes a norm

Episodic memory earns its keep here.

Episodes store:

What happened
Why it happened
Under what policy
With what approval

They do not store: “What to always do.”

What We Learned

Exceptions Became Norms

We allowed emergency overrides to persist temporarily. They solved immediate problems and seemed contained.

Over time, some overrides made their way into durable tenant memory. Temporary exceptions began behaving like permanent rules.

We changed the promotion path. Overrides remained ephemeral unless approved through the same governance flow as canonical state. Precedent required explicit provenance and verification.

If an exception can persist without approval, the system will eventually treat it as policy. Governance is not about preventing overrides. It is about preventing silent graduation.

3. Cross-Scope Contamination

A user-level artifact gets promoted to tenant scope.
A tenant-level artifact affects global behavior.
A retrieval index accidentally crosses tenants.

When this happens:

Quality degrades everywhere
Security risk spikes
Replay becomes ambiguous

Defense:

Promotion gates enforce scope rules
Global scope is write-locked
Tenant scope requires stricter verification than user scope
Every memory write includes scope, retention, sensitivity, and provenance

What We Learned

Automatic Learning Was a Trap

We experimented with automatically persisting what the agent “learned” during a run. It felt like progress. The system appeared to evolve.

Over time, small local mistakes were promoted into durable memory. Durable memory amplified those errors and fed them back into future runs.

We moved promotion behind an explicit gate. Every durable write required scope classification, provenance, retention policy, and verification. Session state remained volatile unless intentionally promoted.

Intelligence that persists without governance is not learning. It is drift with momentum.

Promotion: The Most Dangerous Operation

Promotion is the transition from session state to durable memory.

It is where most memory poisoning becomes permanent.

Promotion must be treated like a database write.

Not like a convenience feature.

A promotion gate should answer four questions:

What scope can this live in? Session vs user vs tenant vs global
What types is it? Preference vs fact vs episode vs policy
What is the retention policy? TTL vs manual deletion vs legal hold
What is the provenance? Where did it come from, and can we replay it?

note

New users may require a bootstrapping phase with more permissive promotion that tightens over time. Otherwise newly onboarded tenants face a cold-start challenge where overly restrictive promotion gates mean the system has no memory to work with and delivers poor early experiences.

Default Promotion Rules

Session → User
Allowed for preferences, drafts, working style, user-specific episodes
Session → Tenant
Allowed only for verified facts and approved episodes
Session → Global
Never allowed at runtime

Verification Rules

Facts stored at tenant scope require:

Human approval
Trusted system-of-record confirmation
Repeated corroboration across independent sources

Sensitivity Rules

Never persist secrets (API keys, tokens)
Be careful persisting PII without explicit retention rules and consent

Pseudo-Policy:

1def promote(candidate, envelope: IdentityEnvelope, *, trace, now_ts: int):2    assert_envelope(envelope)34    # Invariants:5    # - No runtime writes to global scope6    # - Policy is a signed artifact, never promotable7    if candidate.scope == "global":8        return reject("global scope is write-locked")9    if candidate.memory_type == "policy":10        return reject("policy is a signed artifact, not promotable")1112    # Tenant writes are high-stakes.13    if candidate.scope == "tenant":14        if candidate.memory_type == "fact" and not (candidate.verified or candidate.human_approved):15            return reject("tenant facts require verification or explicit approval")16        if candidate.memory_type == "episode" and not (candidate.human_approved or candidate.from_trusted_workflow):17            return reject("tenant episodes require approval or trusted workflow signal")1819    # Sensitivity guardrails.20    if candidate.contains_secrets:21        return reject("secrets are never persisted")2223    # Minimal canonical write: reference + digest + provenance (never raw payload).24    record = {25        "memory_id": new_id("mem"),26        "tenant_id": envelope.tenant_id,27        "user_id": envelope.user_id,28        "scope": candidate.scope,29        "memory_type": candidate.memory_type,30        "status": "provisional",31        "content_ref": candidate.content_ref,32        "content_digest": candidate.content_digest,33        "provenance_run_id": trace.run_id,34        "policy_version": envelope.policy_version,35        "retention_policy": candidate.retention_policy,36        "created_at": now_ts,37    }38    canonical_memory_store.put(record)3940    trace.record_event({41        "event_type": "promotion_write",42        "memory_id": record["memory_id"],43        "status": record["status"],44        "scope": record["scope"],45        "memory_type": record["memory_type"],46        "content_digest": record["content_digest"],47    })4849    # Everything expensive is off the critical path.50    enqueue_hardening(memory_id=record["memory_id"])51    return record["memory_id"]

This is the generational GC analogy in practice:

Session state is cheap and volatile
User memory is moderately durable
Tenant memory is high-stakes and requires verification
Global memory is write-locked

Promotion discipline is not about paranoia.

It is about protecting invariants.

What We Learned

Promotion Gates Need Calibration, Not Just Restriction

We initially tightened promotion gates to prevent bad writes. The instinct was correct: durability should be earned.

Over time, the system began forgetting legitimate outcomes. Verified decisions expired with session state, and when related tasks resurfaced weeks later, the agent had no grounding in what had already been established.

We recalibrated promotion by memory type. Facts and policy remained tightly gated. Episodes from completed workflows were eligible for promotion with automatic provenance tagging and verification signals.

A promotion gate is not just a wall. It is a calibration surface. Too permissive and drift compounds. Too restrictive and the system forgets what it legitimately learned.

Observed Ecosystem Convergence

Promotion Gating Is a Product-Level Pattern

Pattern: Durable memory requires explicit approval, not automatic persistence.

Cursor extracts Memories from chat but saves them only with user approval, which is a direct implementation of promotion gating.
Bolt intentionally clears chat history when switching agents, instructing users to preserve durable guidance in "Project Knowledge."
Lovable positions "Custom knowledge" as persistent shared memory applied across future edits, not accumulated chat transcript.

Session state is ephemeral by default; durability requires a gate.

What We Learned

Deletion Drifted Without Synchronized Invalidation

We treated deletion in canonical storage as sufficient. The system of record behaved correctly.

Derived indexes did not. They continued serving artifacts that no longer existed in canonical truth. rojections outlived their source.

We introduced synchronized invalidation. Deletion emitted tombstones with provenance and triggered oordinated updates across every derived layer.

Deletion without invalidation creates both cost drift and correctness drift. If a deleted artifact can resurface through a projection, deletion > is not complete.

Architectural Principle

Promotion Is a Database Write

Promotion changes durable state and must be treated as such.

Global scope is write-locked.
Policy artifacts are not promotable.
Tenant memory requires verification and provenance.

Part VII: Asynchronous Hardening and Memory Lifecycle

Promotion is the transition from volatile session state to durable canonical memory.

But durable memory should not be fully materialized synchronously.

In commercial systems, enrichment and hardening steps are frequently asynchronous.

These operations are:

Computationally expensive
Potentially slow
Sometimes dependent on external systems
Often unnecessary for immediate inference

The critical path should do only what is required to establish canonical truth.

Everything else belongs in a background processing pipeline.

Minimal Canonical Writes

The inference loop writes:

A minimal canonical record
With scope
With type
With provenance
With retention policy
With sensitivity classification

It does not block on:

Embedding generation
Cross-document deduplication
Fact corroboration
Conflict detection
Episodic summarization
Index rebuilds
Retention reclassification

Those belong to hardening.

The Hardening Pipeline

Pattern:

1run_complete2  → promotion_candidate written to canonical store (minimal record)3  → enqueue enrichment tasks45background_worker6  → validate / enrich / embed7  → update derived indexes8  → emit trace event

This separation accomplishes four things:

Keeps latency predictable
Prevents enrichment failures from blocking inference
Preserves canonical truth even if derived layers fail
Enables cost amortization through batched AI inference

Design rule:
The inference loop writes minimal verified canonical records.
Everything else is projection.

Observed Ecosystem Convergence

Asynchronous Hardening Is an Emerging Architectural Norm

Pattern: Enrichment and consolidation are moving off the critical inference path.

Letta processes messages asynchronously where a "subconscious agent" updates memory blocks out of band.
Amazon Bedrock AgentCore generates long-term structured memory asynchronously from raw session events, with semantic retrieval APIs for later access.

Both systems keep enrichment off the critical path. Canonical truth persists immediately; projections follow.

Lifecycle States

Asynchronous hardening introduces a non-obvious reality:

Canonical truth may be persisted before it is fully trusted.

Treat newly promoted records as provisional until hardening completes. That means every promoted artifact carries an explicit status and can move through a small lifecycle:

Provisional
Persisted with provenance and scope
Eligible for replay
Not eligible for broad retrieval
Active
Validated and hardened
Eligible for retrieval and reuse
Quarantined
Suspected poisoning
Contradictions
Failed checks
Excluded from retrieval
Revoked
Explicitly superseded or deleted via tombstone with provenance

Hardening determines retrieval eligibility.

Inference should never block waiting for enrichment.

What We Learned

Quarantine Needed a Replacement Path

Our first quarantine behaved correctly. A tenant-scoped fact was contradicted during asynchronous hardening, flagged, and excluded from retrieval.

What we did not anticipate was the vacuum it created. That fact had grounded responses for weeks. Once removed, the agent simply stopped referencing it, with no visible signal that context had changed.

We added a resolution workflow. Quarantine no longer meant exclusion alone. It triggered review, leading to correction, explicit revocation, or reinstatement.

Lifecycle states need more than transitions. They need resolution paths. Quarantine without resolution is silent deletion.

Observed Ecosystem Convergence

Lifecycle State and Retrieval Eligibility Are Explicit

Pattern: Memory records carry explicit lifecycle status that gates retrieval eligibility.

Amazon Bedrock AgentCore formalizes memory organization with actor and session scoping, recommending namespaced organization to avoid conflicts. It also warns that event metadata isn't meant for sensitive content because it isn't encrypted with a customer-managed key.
Windsurf implements global, workspace, and system-level memory tiers with distinct lifecycle characteristics.

Scope boundaries and sensitivity classification are becoming structural, not advisory.

Failure Modes and the Canonical Contract

Asynchronous hardening introduces several failure modes.

They must degrade recall, not correctness.

1. Enrichment Failure

Embedding generation fails.
Summarizer times out.
Downstream index is unavailable.

Rule:

Canonical record remains intact
Status remains provisional
Derived projections are retried

Correctness is preserved.

2. Contradiction Discovered

Corroboration fails.
Trusted system-of-record disagrees with agent-authored fact.

Rule:

Mark artifact as quarantined
Emit trace event
Optionally write corrective memory with higher precedence

Never silently overwrite.

3. Duplication and Merge Races

Multiple runs promote semantically identical artifacts.

Deduplication merges incorrectly.

Rule:

Use deterministic identity keys (content_digest + scope + type)
Make merge operations idempotent
Record merge decisions in the event log

Projection must not rewrite canonical history without trace.

4. Late Revocation

Tenant updates policy.
User revokes consent.
Compliance deletion arrives after embedding and indexing.

Rule:

Tombstones are first-class
Deletion triggers coordinated invalidation across every derived layer

Deletion without invalidation resurrects stale context.

5. Partial Projection

Canonical write succeeds.
Some derived indexes update.
Others do not.

Rule:

Retrieval must be tolerant of missing projections and fall back to canonical fetch paths.

Projections should never be required for correctness.

Memory Conflict and Drift

Over time, multi-tenant agents accumulate contradictions.

Examples:

Tenant policy changed but old memory persists
Preference updated but stale entries remain
Past episode contradicts current entitlement state

You need a strategy for conflict. The simplest workable approach:

Prefer newest memory with verified provenance
Prefer canonical systems of record over agent-authored facts
Mark memories with “confidence” and “verified” flags
Allow revocation and explicit tombstones (deletions with provenance)

Tombstones matter because:

You need to prove deletion
Stop retrieval from resurrecting stale artifacts
Ensure derived indexes invalidate consistently

Hardening without tombstones creates retention drift.

Why Hardening Exists at All

Without asynchronous hardening:

Latency becomes unpredictable
Inference blocks on expensive enrichment
Failures cascade into user-visible errors

Without hardening gates:

Provisional artifacts influence retrieval prematurely
Poisoned memory spreads
Verification becomes retroactive instead of preventive

Hardening separates:

Canonical truth persistence
Retrieval eligibility
Acceleration layer maintenance

This is the canonical contract:
Inference writes minimal truth.
Hardening validates and projects.
Derived layers accelerate.

Architectural Principle

If It Isn’t Canonical, It Isn’t Durable

Durable state must be established synchronously; enrichment happens asynchronously and never blocks correctness.

The critical path writes minimal canonical truth with identity, scope, provenance, and cost surfaces pinned.
Embeddings, deduplication, enrichment, and index updates occur off the critical path and are fully rebuildable.
Replay and audit depend only on canonical records, never on projections or background jobs.

Part VIII: Privacy, Retention, and Cryptographic Boundaries

Isolation protects tenants from each other.
Privacy protects tenants from you.

Privacy is often treated as a feature. In commercial agent systems, it must be treated as a data-plane architecture decision.

If your system promises "we don't retain this," "this run won't train models," "this tenant's data is isolated," or "this is ephemeral", then those guarantees must be visible in your routing, storage, and indexing layers. Not just in marketing copy.

If privacy depends on flags inside the model prompt or conditionals inside business logic, it will eventually fail.

The only durable privacy guarantee is structural separation.

Compliance requires encryption at rest.
Architecture requires partitioned routing.
Threat containment requires domain separation.

Privacy as Architectural Routing

Isolation is mandatory for multi-tenancy. It's a structural invariant.
If Tenant A can influence or see Tenant B’s data, the system is broken.

Privacy posture is different.

Privacy posture is a policy commitment enforced by architecture.
It defines how much the platform itself retains, observes, or learns from tenant activity.

A common anti-pattern:

1if privacy_mode:2    disable_logging()

This is extremely fragile.

Logging isn't the only place data persists. Data can leak into async queues, derived retrieval indexes, observability pipelines, caches, debug traces, analytics jobs, and model prompt caching layers.

If "privacy mode" relies on remembering to guard every one of those paths, it will eventually fail.

In production, the failure mode isn't malicious. It's incremental. A new logging layer is introduced. A new background job is added. A new cache is deployed. And privacy assumptions silently degrade.

Privacy modes should influence:

Where traces are written
Whether promotion is allowed
Whether retrieval indexes are updated
Whether embeddings are generated
Whether objects are persisted

Privacy must route execution differently, not just mask output.

But the critical part is this:

The no-retention path should not share the same physical data plane as the retention path.

Separate buckets.
Separate partitions.
Separate encryption domains.

Otherwise accidental logging becomes inevitable.

What We Learned

Logging Drifted Past Privacy Boundaries

We implemented privacy by masking sensitive fields and stripping explicit PII markers. It appeared compliant.

Over time, sensitive data surfaced in places we had not guarded: tool payloads, derived projections, and trace fragments. Cleanup logic missed new paths as the system evolved.

We moved privacy enforcement to ingestion. Privacy mode became part of the identity envelope, and routing decisions were made before any writes occurred.

Cleanup logic is fragile. Routing is structural. If privacy is enforced after writes, every new logging layer, background job, and cache becomes a potential leak.

The Stronger Pattern: Separate Data Planes

The more defensible pattern (validated by mature SaaS systems) is this: route privacy-mode traffic through a separate path.

That can mean separate replicas, separate storage backends, separate queues, suppressed or redirected observability streams, and distinct retention policies at the storage layer.

The point is structural separation. At minimum, enforce physical storage partitioning and routing-level isolation.

A privacy-aware system may include:

Retained trace log
Metadata-only trace log
Durable structured memory store
Ephemeral session store

In privacy mode:

Trace logs may contain only metadata (run_id, timestamps, token counts)
Canonical structured memory may disallow promotion entirely
Object store writes may be disabled

This is not just policy enforcement.

It is architectural branching.

If your privacy guarantee requires scanning logs after the fact, you are already too late.

Observed Ecosystem Convergence

Privacy Routing as Architectural Separation Is Production Practice

Pattern: Privacy is a routing decision enforced before any writes occur, not a flag checked after.

Cursor implements separate service replicas and parallel queues/workers per privacy mode, defaulting to privacy-mode if the routing header is missing.
Warp documents that Business/Enterprise plans operate under zero-data-retention agreements, with the server assuming telemetry disabled if the header is absent.

Safe defaults and physical separation replace conditional scrubbing.

What Privacy Mode Should Actually Control

At minimum, privacy mode should govern:

Event log retention
Are trace envelopes persisted?
If yes, are payloads redacted?
If no, is only metadata retained?
Structured memory promotion
Are promotion gates disabled?
Are only session-scope artifacts allowed?
Derived index writes
Are embeddings created?
Where are they stored?
Are they scoped to ephemeral partitions?
Object store persistence
Are large payloads retained?
Encrypted with tenant-specific keys?
Auto-expiring?

Retention Discipline

If you promise "no retention," define what that means precisely.

Does it mean no canonical trace stored? No structured memory writes? No embedding generation? No analytics logs? No prompt caching? No external model provider retention?

You cannot say "no retention" if the request is still embedded into a global vector index, flows into analytics dashboards, or is logged to a shared debugging stream.

A clean architectural approach:

Metadata-only trace retention
- run_id, cost, timing
- no payload
Ephemeral session store
- in-memory or short TTL
Derived indexes disabled
Object store writes blocked or TTL-bound
Separate observability stream with redaction

That way your promise is enforceable, not aspirational.

Retention is where drift hides

Without explicit retention semantics:

Session artifacts become durable
Durable artifacts never expire
Index entries outlive source truth

Every canonical record must carry retention metadata:

Retention policy ID
Expiration timestamp or rule
Sensitivity classification

Example retention policy:

1{2  "memory_id": "mem_441",3  "scope": "user",4  "memory_type": "preference",5  "retention_policy": "ttl_90_days",6  "expires_at": "2026-05-16T00:00:00Z"7}

Retention must cascade

When a canonical record expires:

A tombstone is emitted
Derived projections are invalidated
Object store references are removed or reclassified

Retention without projection invalidation creates ghost memory.

Designing this after the fact is painful.
Designing it up front as an alternate routing path is tractable.

Privacy cannot be retrofitted.

The more you know

Retention Drift Is Real

Even well-designed systems accumulate "retention drift" over time. A new logging layer is added and isn't privacy-aware. A background embedding job writes to the wrong index. A debug flag writes full transcripts to object storage. A feature team builds a derived index and forgets to scope it.

We quickly learned that deletion alone is not enough. Unless invalidation is tightly coordinated across derived layers, indexes will continue serving content that has already been removed from canonical storage.

The system remained operational. The storage layer behaved correctly.

The issue was lifecycle design. Without synchronized invalidation, derived systems outlive the truth they project.

This is why separation by architecture is stronger than separation by conditionals.

Encryption and Tenant Keys

Isolation is a policy boundary.
Encryption is a cryptographic boundary.

In any commercial system, canonical stores and object stores should be encrypted at rest. In multi-tenant enterprise-grade systems, we need to go further:

Tenant-scoped encryption keys
Key rotation policies
Envelope encryption for object payloads
Separation of encryption domains between canonical and derived stores

Why this matters:

A storage misconfiguration should not automatically imply cross-tenant readability.
Deletion guarantees are stronger when encryption keys can be revoked.
“No retention” modes can be reinforced with short-lived encryption domains.
Object stores containing large payloads require stricter cryptographic boundaries than derived summaries.

Recommended pattern:

Canonical store: tenant-scoped envelope encryption (per-tenant KEK, rotating DEKs)
Object store: tenant-scoped encryption + content digests (integrity/provenance), with sensitivity-based storage rules
Retrieval index: tenant-partitioned index; store encrypted IDs + embeddings for curated artifacts only
Cache layers: tenant-partitioned, TTL-bound caches; encrypt with short-lived keys; never authoritative
Event log: tenant-partitioned; metadata encrypted with platform-managed KMS for cross-tenant analytics, and payloads encrypted with tenant-scoped envelope encryption (or stored as tenant-encrypted object references)

This allows:

Tenant-level cryptographic revocation
Controlled key rotation
Partitioned blast radius

Key design principle:
If derived systems are compromised, canonical truth and raw payloads should remain cryptographically isolated.

Encryption is not just compliance posture. It reinforces scope boundaries.

What We Learned

Shared Encryption Domains Complicated Lifecycle Changes

We used a single tenant-scoped KMS key across canonical storage and derived indexes. It simplified IAM and key management and worked in steady state.

The coupling surfaced when we needed stronger revocation guarantees for canonical data. Because canonical and derived layers shared the same key, a scoped change required coordinated migrations across storage and index layers.

We separated encryption domains. Canonical stores could rotate or revoke independently, and derived indexes used shorter-lived keys that could be discarded during rebuilds.

The cost of a shared encryption domain is invisible until rotation or revocation is required. By then the coupling is load-bearing.

Observed Ecosystem Convergence

Credential and Cryptographic Containment Is Formalizing

Pattern: Secrets and encryption domains are structurally isolated, not trusted to prompts.

AWS AgentCore Identity provides a token vault that stores user tokens, handles refresh, and secures tool API keys, aligning with structural containment for secrets.
Cursor documents temporary encrypted caching with client-generated keys that exist server-side only during the request.
Anthropic's prompt caching stores KV cache representations and cryptographic hashes rather than raw prompt text, supporting zero-data-retention alongside caching economics.

Encryption domains are separating across volatile and durable layers.

The more you know

Canonical vs Derived Encryption Domains

Another subtle pattern:

Canonical and derived layers should not share identical encryption semantics.

Why? Because projections are rebuildable.

If you use identical encryption domains everywhere:

A projection compromise can reveal canonical references
Revocation becomes more complicated

Projections are disposable.
Canonical truth must be durable and revocable.

Relationship to Trace Metadata

Privacy mode complicates decision traces. If you suppress traces entirely, you lose replay. If you persist traces indiscriminately, you violate retention promises.

Trace envelopes should include:

Model version
Policy version
Retrieval artifact IDs
Token counts
Tool contract versions

But in privacy mode:

Persist structural trace metadata
Redact sensitive payloads
Allow tenants to opt into extended trace retention
Separate trace retention from model training retention

This preserves operability without over-collecting.

Replay may become limited in privacy mode. That is acceptable.

What is not acceptable is accidentally retaining sensitive content because logging was decoupled from privacy routing.

Cross-Tenant Isolation Reinforced by Cryptography

Scope enforcement in the data plane protects retrieval.
Encryption enforces isolation at rest.

Even if a bug bypasses retrieval filters:

Encrypted tenant domains limit exposure
Key management boundaries provide additional containment

Cryptography does not replace logical isolation. It reinforces it.

Architectural Principle

Privacy as Architecture

Privacy must be enforced structurally, not retrofitted procedurally.

Routing decisions must occur before any durable write.
Retention policies must cascade across canonical and derived layers.
Encryption must reinforce scope boundaries at rest and in transit.

Part IX: Cost Surfaces and Token Economics

In traditional cloud systems, cost scales with compute, storage, and network. In agent systems, cost scales with tokens, retrieval volume, context assembly size, tool invocation frequency, and trace retention footprint.

The difference is subtle but profound:

Cost is no longer driven by infrastructure. It's driven by intent.

And intent flows through context.

In commercial agent systems, cost does not scale linearly with traffic.

It compounds with context.

The Four Primary Cost Surfaces

Per-run cost is not just tokens in and tokens out. In commercial systems it is a composite of four surfaces:

Inference
Retrieval
Tooling
Persistence

Each surface has different scaling behavior.

A disciplined system tracks these surfaces separately so optimization doesn’t become guesswork.

1. Inference Cost

Inference cost is driven by:

Input tokens
Output tokens
Context window size
Model selection
Prompt prefix size (caching effects)

As context grows, inference cost grows even if the business logic remains unchanged.

This is the first place drift becomes visible.

2. Retrieval Cost

Retrieval cost scales with:

Number of documents indexed
Embedding dimensionality
Query volume
Hybrid ranking complexity
Cross-tenant filtering overhead

As memory accumulates, retrieval cost rises, even if token budgets stay fixed.

Derived projections amplify cost.

3. Tooling Cost

Tool invocation cost includes:

External API calls
Database reads
Connector queries
Downstream system calls
Internal compute
Side effects (including retries and human approval latency)

Tools often dwarf model cost in enterprise environments.

Without trace attribution per run, tooling cost becomes opaque.

4. Persistence Cost

Persistence cost includes:

Canonical storage
Object store usage
Derived index storage
Log and trace retention
Backup and compliance overhead

Durable memory is an economic commitment.

Promotion decisions multiply storage cost over time.

The Economic Model of an Agent Run

Every agent run has a composite cost. It is not just tokens in and tokens out.

It is the sum of the four distinct surfaces.

Formally:

$C_{total} = C_{inf} + C_{ret} + C_{tool} + C_{persist}$

Where:

Inference scales with context size.
Retrieval scales with memory growth.
Tooling scales with autonomy.
Persistence scales with promotion discipline.

Most teams focus on inference cost.

In commercial systems, that is usually not the dominant long-term driver.

Layer-Based Token Budgets

One of the most effective structural controls is explicit token budgeting by layer.

Example allocation:

Global constitution: 500 tokens
Tenant policy: 800 tokens
User memory: 600 tokens
Retrieved episodes/artifacts: 1,200 tokens
Session state/scratch: 900 tokens

Total: 4,000 token input budget

Without allocation, session state will crowd out policy.
Without allocation, retrieval will crowd out global constraints.

Layer budgets:

Preserve guardrails
Bound token cost
Reduce drift
Enable cost prediction

Budget discipline is architectural, not cosmetic.

Without layer-specific budgets, tenant memory can crowd out safety rules, session noise can crowd out durable facts, and retrieval bloat can drown signal.

Example Token Budget:

1TOKEN_BUDGET = 40002allocation = {3    "global": 500,4    "tenant": 800,5    "user": 600,6    "retrieved": 1200,7    "session": 9008}

What We Learned

Budgets Matter

We did not enforce strict per-layer token budgets. Context size expanded gradually across runs, sessions, and memory layers.

When traffic spiked, costs did not rise linearly. They jumped. Context growth had been compounding quietly.

We introduced explicit token allocations by layer: global, tenant, user, retrieved, and session. Each had a hard cap.

Without per-layer budgets, context discipline is aspirational. With them, it becomes both quality control and economic control.

Prompt Caching and Context Hashing

In commercial systems, many runs share identical static context layers: global constitution, tenant policy, stable user preferences.

These layers should be explicitly versioned and constructed deterministically.

A stable hash of the static prefix:

1prefix_hash = hash(global_constitution + tenant_policy_version)

This enables:

Observability of policy drift
Deterministic replay
Compatibility with provider-side prefix caching
Reduction of redundant prefix construction

When supported by the model provider, identical prefixes may benefit from prompt caching at the infrastructure layer, reducing repeated token charges. Even without provider caching, explicit prefix hashing encourages disciplined versioning and makes constitutional changes visible and auditable.

This reduces inference cost volatility.

But only if:

Prefix boundaries are stable
Versioning is explicit
Trace logs record prefix version

Otherwise cache invalidation becomes opaque.

The more you know

Why Prefix Caching Changes the Incentives

If static layers are cacheable, the marginal cost shifts toward the dynamic tail:

retrieval payloads
tool traces
accumulated session state

That makes compaction and progressive disclosure even more valuable, because you stop paying repeatedly for the same invariant prefix and start paying almost entirely for what you chose to assemble.

Observed Ecosystem Convergence

Prompt Caching and Token Accounting Are Platform Primitives

Pattern: Caching is an architectural cost lever with explicit economic telemetry.

Anthropic supports automatic and explicit cache breakpoints, storing cryptographic hashes rather than raw text.
Google Gemini provides a first-class "cached content" object including system instructions and tool configuration.
Google Cloud Vertex AI reports cachedContentTokenCount in response metadata for explicit economic telemetry.
Amazon Bedrock exposes prompt caching with cache checkpoints and distinct pricing semantics.

All four platforms treat caching as an architectural decision, not an optimization afterthought.

What Token Drift Looks Like

Architectural discussions about token budgets stay abstract until you attach numbers.

Consider a representative workload:

Early average input: ~1,800 tokens
Output: ~400 tokens
Total per run: ~2,200 tokens
Approximate cost: ~$0.02–$0.03 per run

note

The exact dollar amounts vary by model pricing, input/output rate asymmetry, and whether static prefixes benefit from caching, but the drift dynamics are consistent.

Now introduce three common forms of drift:

Retrieval surface expands
Session transcripts are retained instead of compacted
Promotion frequency increases

Six weeks later, average input grows to ~2,900 tokens. Output remains stable.

Per-run cost rises to ~$0.035–$0.045.

That increase feels small in isolation. It is a 50–70% jump that rarely triggers alarms during development.

At 50,000 runs per day:

$0.025 → $0.040 average
~$22,000 monthly delta

No new features shipped. No model changed.

Context simply accumulated.

Why Cost Compounds

Drift rarely starts in inference.

It starts in promotion.

Every durable memory write increases the retrieval surface.
A larger retrieval surface increases context payload size.
Larger payloads increase inference cost.
Higher inference cost creates pressure to promote summaries.
Promotion increases durable memory.

The system compounds.

This is why token drift is rarely linear.

It is structural.

Promotion as an Economic Multiplier

Every promotion to durable memory:

Increases future retrieval size
Increases index cardinality
Increases token pressure
Increases retention cost

Promotion frequency directly influences long-term cost curve.

If 5% of runs promote tenant-scope artifacts at 50,000 runs/day, that is 2,500 new durable records daily.

Reducing tenant-level promotion from 5% to 1% materially slows both storage and token growth curves.

Promotion discipline is not just governance. It is cost control.

Progressive Disclosure vs Up-Front Retrieval

Early systems (like chatbots), generally used a simplistic approach:

Query retrieval layer broadly
Insert top-k results
Let the model sort it out

This often "works", but it is without discipline.

Not all context should or needs to be visible at once.

Progressive disclosure is the better pattern:

Retrieve minimal context
Attempt inference
If low confidence, retrieve additional context
Repeat

This avoids the “dump everything into the prompt” anti-pattern.

1context = retrieve_minimal(query)2result, confidence = infer(query, context)34if confidence < 0.6:5    context += retrieve_additional(query)6    result, confidence = infer(query, context)

What We Learned

Progressive Disclosure Is Replacing Up-Front Context Dumps

Pattern: Minimal context first, expand only when needed.

Replit avoids dumping full error output into context, instead injecting a minimal signal that tells the agent to pull details via a log tool on demand.
Anthropic's Skills guide formalizes multi-level loading: metadata always present, full instructions loaded only when needed, linked files navigated on demand.
Bolt argues that richer model context can reduce wasted iterations, framing token investment as an economic decision.

Context is loaded progressively, not preloaded exhaustively.

Observing Cost as a First-Class Signal

Cost is not a monthly bill. It is a per-run property of execution.

Beyond the obvious token charges, agent systems incur structural cost across multiple dimensions:

Context assembly cost
Retrieval indexing and reranking cost
Promotion and durable writes
Retention and storage footprint
Tool execution cost
Observability overhead

If these are not attributed per run, you cannot tune budgets, compare retrieval strategies, or detect runaway behavior early.

Every trace envelope should include:

Model and prefix version
Tokens in / tokens out
Static prefix tokens vs dynamic context tokens
Retrieval artifact count and bytes retrieved
Rerank candidate count
Tool invocations and retry count
Promotion writes by scope
Estimated cost (model + tools)
Latency
Hardening queue lag (if async memory is used)

Tracing is not logging.
It is cost instrumentation.

Example cost instrumentation view (derived from the trace envelope):

1{2  "run_id": "01J...",3  "usage": {4    "tokens_in": 1832,5    "tokens_out": 412,6    "static_prefix_tokens": 620,7    "dynamic_context_tokens": 12128  },9  "retrieval": {10    "count": 8,11    "bytes_in": 14523,12    "rerank_candidates": 3213  },14  "tools": {15    "invoked": 2,16    "retry_count": 117  },18  "promotions": 1,19  "estimated_cost_usd": 0.023120}

What We Learned

Trace Gaps Obscured Cost

We initially modeled cost using aggregate token counts. When expenses rose, we could not explain why. Trace envelopes lacked layer-level token breakdowns, retrieval volume by scope, promotion counts, and model or prefix versioning.

Cost analysis required manual reconstruction of how context had been assembled.

We extended trace envelopes to include tokens per layer, retrieval artifact IDs, promotion counts, and model and prefix versions. Cost attribution became deterministic instead of anecdotal.

Cost modeling is only as strong as trace fidelity. If you cannot decompose a run into its cost surfaces, optimization is guesswork.

Context Is Now an Economic Surface

In commercial agent systems, infrastructure cost scales predictably.

Context cost compounds.

It compounds with:

Retrieval volume
Assembly size
Promotion rates
Compaction discipline
Retention duration

Progressive disclosure turns these into controllable surfaces. Retrieve minimally. Expand only when needed.

This reduces token volume and stabilizes behavior without sacrificing accuracy.

Teams that engineer these surfaces control cost.
Teams that treat context as a byproduct discover it later.

Cost rarely grows with traffic alone.
It grows with unmanaged context.

Cost control is not model negotiation.
It is deliberate context engineering.

Architectural Principle

Context Has a Meter

Every context decision carries measurable economic cost.

Retrieval, promotion, compaction, and retention are cost surfaces.
Attribution must exist per run.
Unmeasured cost compounds invisibly.

Part X: Evaluation as Context Discipline

Most teams think evaluation means scoring model outputs.

In commercial agent systems, evaluation means something far more structural:

You are evaluating context assembly under constraint.

Models are interchangeable components.
Context discipline is the system.

What You Are Actually Evaluating

A commercial agent run is the composition of:

Identity resolution
Scope filtering
Retrieval selection
Layer budgeting
Semantic stabilization
Garbage collection
Tool invocation
Policy enforcement
Promotion gating
Lifecycle transitions
Cost surfaces

Evaluation must ask:

Was the correct context retrieved?
Were scope boundaries enforced?
Were policies visible in the working set?
Did promotion decisions follow rules?
Did cost remain within expected bounds?
Would this run behave the same way tomorrow?

If you cannot answer those questions deterministically, you are not evaluating autonomy. You are sampling outputs.

Deterministic Replay Is the Baseline

Evaluation begins with replay.

Every run must capture:

Model version
Policy version
Prefix hash
Retrieval artifact IDs
Tool contract versions
Promotion decisions
Token counts
Cost breakdown by surface

Replay fixes those variables and re-executes the run.

If output changes under identical inputs, you have drift.

Drift can originate from:

Model upgrades
Retrieval index mutation
Promotion contamination
Prefix instability
Lifecycle state changes

Replay is not debugging.
Replay is architectural validation.

Without replay, optimization becomes guesswork.

Example replay validation view (derived from the trace envelope):

1{2  "run_id": "run_01J...",3  "model": {4    "version": "claude-sonnet-4-6"5  },6  "prefix": {7    "version": "constitution_v12 + tenant_policy_v8",8    "hash": "sha256:..."9  },10  "retrieval": {11    "artifact_ids": ["mem_88", "mem_104"]12  },13  "tool_contracts": {14    "refund_api": "v3.1",15    "crm_lookup": "v2.4"16  },17  "input_token_breakdown": {18    "global": 480,19    "tenant": 720,20    "user": 610,21    "retrieval": 1150,22    "session": 84023  },24  "output_tokens": 410,25  "promotion_count": 1,26  "estimated_cost": 0.08227}

Replay without artifact IDs becomes probabilistic.

Replay without versioned prefix becomes inaccurate.

Replay without tool contract versions becomes misleading.

Replay is only deterministic if the system treats versioning as a first-class discipline.

Observed Ecosystem Convergence

Replay Infrastructure Is Becoming a Platform Expectation

Pattern: Trace envelopes and versioned artifacts are becoming standard replay primitives.

Anthropic's evals guidance defines "transcript/trace/trajectory" as the canonical unit of evaluation.
OpenAI's Responses API provides durable response IDs, previous_response_id threading, metadata, and detailed usage including cached tokens.
OpenAI Agents SDK preserves per-request usage breakdowns (request_usage_entries), enabling measurable cost surfaces per run.

Replay is shifting from ad hoc reconstruction to platform-supported infrastructure.

What We Learned

Evaluation Without Replay Was Guesswork

Before full trace envelopes, evaluation relied on spot-checking outputs, comparing example runs, and manually reconstructing context. It was slow, subjective, and blind to subtle cost regressions.

Without versioned prefixes, artifact IDs, layer-level token breakdowns, and tool contract versions, we could not deterministically explain why behavior changed.

We extended trace fidelity to include these elements. Replay became deterministic. Evaluation became measurable. Upgrade decisions became evidence-based instead of intuition-driven.

If you cannot decompose what changed between two runs, evaluation is opinion. Trace fidelity turns it into measurement.

Model Upgrades Are Context Stress Tests

Frontier models evolve.
Tokenization changes.
Tool calling behavior shifts.
Summarization becomes more aggressive.
Safety layers adjust.

A model upgrade is not just a capability change.
It is a context stress test.

Under a new model:

Does retrieval selection change?
Does semantic stabilization behave differently?
Does promotion frequency increase?
Do policies get crowded out?
Does token usage shift by layer?

Evaluation harnesses must compare:

Old model + fixed context
New model + identical context

If behavior shifts materially, you have learned something about your context design.

Stronger models do not fix weak context discipline. They amplify it.

Promotion Drift Is an Evaluation Surface

Evaluation must track:

Promotion rate by scope
Promotion type distribution
Provisional → active transitions
Quarantine frequency
Tombstone rate
Memory growth curve

A rising tenant-scope promotion rate is not a feature win.

It is an economic and governance signal.

Promotion drift compounds:

Retrieval cardinality
Token pressure
Storage cost
Cross-scope contamination risk

Evaluation is not just output quality.
It is structural hygiene.

Cost as a First-Class Evaluation Metric

Every run should emit:

Inference tokens
Retrieval volume
Tool invocation count
Persistence writes
Derived index updates

Cost should be evaluated against:

Task type
Layer budget allocation
Promotion decision
Model version
Prefix hash

If cost shifts without intentional architectural change, context drift is occurring.

Evaluation must detect this early.

Observed Ecosystem Convergence

Cost Observability Is Becoming a Standard API Surface

Pattern: Per-run cost attribution is emerging as a platform primitive.

OpenAI Agents SDK exposes per-request usage breakdowns for fine-grained drift detection.
Google Cloud Vertex AI reports cached token counts in response metadata, making cache economics visible.
GitHub Copilot packages coding agent capabilities into enterprise plans with centralized audit logs and usage telemetry exports.

Production agents are expected to be observable and cost-accountable at org scope.

Confidence and Context Sufficiency

Evaluation is incomplete without measuring sufficiency.

When progressive disclosure is used, you should measure:

How often additional retrieval was required
Confidence thresholds triggering expansion
Cost delta between minimal vs expanded context

If confidence thresholds drift upward over time, context may be degrading.

Evaluation is about understanding when the system needs more context and when it is over-consuming it.

From Runs to Decision Graphs

Once trace envelopes are disciplined, something more powerful emerges.

Every run captures:

What was visible
What influenced action
Which policies applied
What was written
What cost was incurred

Over time, this becomes a structured corpus of decisions.

You can analyze:

Policy application frequency
Precedent influence
Cost distribution by task type
Drift across model versions
Promotion patterns by scope

That corpus is the foundation of decision graphs.

Not theoretical graphs.

Replayable, auditable, cost-attributed decision networks.

But they only emerge if evaluation is embedded from the beginning.

Replay Is Not Just for Model Upgrades

Replay supports:

Model upgrade validation
Policy changes
Retrieval tuning
Prompt modifications
Tool migration
Cost optimization experiments

If your evaluation loop depends on synthetic examples, it will miss edge cases.

Production traces are your regression corpus.

Model Upgrade Replay

Model upgrades are the most obvious replay scenario.

Upgrade process:

Freeze a representative replay corpus from production traces
Re-run corpus with new model version
Compare:
- Outputs
- Policy adherence
- Token usage
- Tool invocation patterns
- Cost

Differences are classified:

Improvement
Neutral
Regression

Without replay, upgrade evaluation becomes anecdotal.
Without cost comparison, upgrade risk becomes financial.

Policy Change Evaluation

When tenant or global policy changes:

Re-run affected traces
Detect differences in approval decisions
Validate no unintended escalation of privilege
Validate no suppression of required actions

Policy becomes versioned infrastructure, not live mutation.

Policy drift without replay is invisible.

Retrieval Strategy Testing

Changes to:

Embedding model
Chunk size
Hybrid ranking weights
Scope filtering rules
Compaction rules
Progressive disclosure thresholds

should be replayed across prior runs.

Key metrics:

Retrieval artifact count
Token usage by layer
Hallucination rate
Tool invocation deltas

Retrieval changes affect both quality and cost.

Replay reveals both.

This turns retrieval tuning from intuition into measurement.

Shadow Mode

Replay supports offline evaluation. But mature systems also support live shadowing.

Pattern is straightforward:

Production executes normally and output remains authoritative.
In parallel, a shadow run uses:
- A new model version
- A modified retrieval strategy
- New compaction rules
- Etc.
Shadow outputs are logged but not surfaced to users
Differences are analyzed asynchronously.

This reduces risk during:

Model migrations
Context restructuring
Index rebuilds
Policy refactors

It also enables support for:

Controlled rollout
Drift detection
Confidence building

Shadow mode turns architectural evolution into controlled iteration.

What We Learned

Shadow Mode Caught What Staging Missed

We upgraded to a newer model and passed regression tests in staging. Mechanically, everything worked.

What staging did not reflect was production memory: real tenant configurations, months of promoted episodes, and live retrieval surfaces. When we enabled the new model in shadow mode on a small slice of production traffic, subtle behavioral deviations appeared immediately. Policy language was interpreted differently, shifting tool selection and response framing.

We corrected the prompts before full rollout.

Staging validates mechanics. Shadow mode validates behavior against accumulated memory. They test different failure surfaces.

Regression Suites from Real Traffic

Synthetic test cases rarely capture:

Long-tail edge cases
Rare entitlement combinations
Complex multi-step tool chains
High-context runs

The strongest evaluation corpus is your own trace log:

Sample representative runs across tenants and task types.
Strip or redact sensitive payloads if required.
Store structured replay fixtures.
Version them alongside policy and model artifacts.

Your production history becomes your test suite.

Over time, you accumulate:

Edge cases
Near-failures
Policy conflicts
Retrieval anomalies
Promotion mistakes

Those are more valuable than curated prompts.

Without real traces, evaluation remains shallow.

Architectural Principle

Evaluate Integrity, Not Just Outputs

Evaluation must validate context behavior, not only model responses.

Isolation, scope, promotion, and lifecycle are testable.
Replayability is non-optional.
If you cannot reconstruct it, you cannot trust it.

Part XI: Model Versioning and Upgrade Discipline

Model upgrades are not configuration changes.

They are behavioral migrations.

In commercial agent systems, a model upgrade can alter:

Interpretation of policy language
Tool invocation sequencing
Confidence thresholds
Hallucination patterns
Token usage patterns
Sensitivity to ambiguity
Response verbosity
Context utilization behavior

If model upgrades are treated casually, system behavior becomes unpredictable.

Model evolution must be controlled at the architectural level.

Version Everything That Influences Behavior

Model versioning alone is insufficient.

Behavior is a function of:

Model name and version
Temperature and inference parameters
Static prefix hash
Policy artifact versions
Tool contract versions
Retrieval configuration version
Compaction strategy version

If any of them change without trace visibility, replay fidelity degrades and regression analysis becomes impossible.

Model versioning is part of the canonical contract.

This makes behavioral provenance explicit.

Without it, you cannot explain deltas.

Observed Ecosystem Convergence

Versioned Behavior Contracts Are Platform-Level Primitives

Pattern: Platforms expose structured versioning for tools, models, and configuration.

OpenAI's Responses API includes explicit tool configuration fields (tools, tool_choice, parallel_tool_calls), detailed token usage with caching breakdowns, and metadata.
Google Gemini formalizes structured tool invocation where the model returns structured parameters.
Anthropic's prompt caching uses hash-based storage, making constitutional and policy prefix changes auditable.

Behavioral provenance is becoming explicit, not implicit.

The Upgrade Sequence

A disciplined upgrade follows this sequence:

1. Freeze a Replay Corpus

Select representative production traces that include:

High-context runs
Multi-step tool chains
Policy-sensitive flows
Edge-case approvals

Lock the artifact IDs and prefix versions. These runs become your regression suite.

2. Deterministic Replay

Run the corpus under:

Old model
New model

Hold all other variables constant.

Compare:

Output correctness
Policy adherence
Token usage
Tool invocation patterns
Cost

3. Shadow Production

Deploy new model in shadow mode.

Run in parallel
Record outputs
Do not affect user-visible behavior

Track:

Output divergence
Policy deviations
Token deltas
Latency shifts

Replay validates determinism. Shadow validates live distribution behavior.

4. Progressive Rollout

Gradually increase traffic percentage.

Monitor:

Drift signals
Cost per run
Tool call volume
Promotion rate

Roll back immediately if invariants break.

5. Rollback Readiness

Rollback must be:

Instant
Deterministic
Stateless

If rollback requires schema migration, reindexing, or cache rebuilding, you have coupled layers incorrectly.

Architecture must remain stable while models evolve.

Versioning without rollback is theater.

Context Window Changes Are Architectural Events

Larger context windows tempt teams to relax discipline.

Common failure pattern:

“The window is bigger now.”
Retrieval breadth increases.
Session history persists longer.
Promotion discipline loosens.

Short term: outputs improve.
Medium term: token cost explodes.
Long term: drift becomes embedded.

A larger window does not eliminate:

The need for compaction
The need for scope boundaries
The need for promotion gating
The need for evaluation

It magnifies the consequences of ignoring them.

Window size is a constraint multiplier.

Budgets are architectural, not model-driven.

Separation of Concerns Across Layers

Model upgrades should not require:

Rewriting canonical storage
Reindexing entire memory store
Altering encryption boundaries
Modifying promotion semantics

If they do, the architecture is over-coupled.

Layer separation ensures:

Canonical store remains stable
Derived projections can be rebuilt
Retrieval configuration can evolve independently
Compaction strategy can be tuned without rewriting history

Models change. Architecture must remain durable.

If your system entangles model behavior with memory writes or policy evaluation, upgrades will feel dangerous.

If layers are cleanly defined, upgrades become controlled experiments.

Upgrade Without Replay Is Operational Risk

If you upgrade a model without replay:

You cannot quantify regression
You cannot quantify cost delta
You cannot detect policy drift
You cannot isolate behavior change to the model

You are trusting a black box to remain aligned.

In enterprise systems, that is unacceptable.

What We Learned

Model Upgrade Without Full Trace Was Fragile

In early iterations, we captured model version and final output. We did not capture prefix version, artifact IDs, layer token breakdowns, or tool contract versions.

When behavior changed after a model upgrade, we could not determine whether the cause was retrieval, policy visibility, compaction, or the model itself.

We extended trace envelopes to capture full execution context. Upgrade analysis became surgical instead of speculative.

If you cannot isolate what changed, every model migration is a bet. Trace fidelity turns it into a controlled experiment.

Architecture Must Outlive Models

Frontier models are constantly evolving.
Your architecture must survive for years.

Design rule:
Models are replaceable components.
Context engineering is the durable asset.

Observed Ecosystem Convergence

Architecture Is Outliving Individual API Generations

Pattern: Systems that decouple canonical state from model-specific behavior survive API transitions.

OpenAI's migration from Assistants to the Responses API, with a published sunset date (August 26, 2026), demonstrates why architecture must remain stable while models evolve.
Vercel Agent is converging on "skills" and MCP-like tooling ecosystems with machine-readable docs designed for agent consumption.

Models are replaceable components. Context architecture is the durable asset.

Architectural Principle

Models Are Mutable

Model upgrades must be treated as behavioral migrations, not routine swaps.

Version everything that influences behavior.
Replay and shadow before exposure.
Rollback must be immediate while architecture remains stable.

Part XII: From RAG to a Context Engine

Many teams start with RAG.

Retrieve. Append. Generate. Repeat.

For prototypes, this works. For commercial systems, it does not.

RAG solves recall. It does not solve isolation, replay, promotion discipline, retention enforcement, lifecycle management, cost predictability, or upgrade safety.

RAG is a retrieval pattern.

A context engine is an architectural system.

Why Basic RAG Is Not Enough

Basic RAG assumes:

Retrieval is stateless
Memory is external
The model is the primary reasoning surface
History can be appended safely

Commercial agent systems violate all of these assumptions.

They require:

Long-lived memory
Scoped isolation
Cross-tenant guarantees
Durable decisions
Cost accounting
Evaluation loops

If you build on naive RAG, retrieval becomes accidental truth, transcripts become memory, promotion becomes implicit, drift becomes structural, and cost becomes unpredictable.

RAG is a building block. It is not the architecture.

Observed Ecosystem Convergence

RAG Alone Has Proven Insufficient

Pattern: Every major platform has added lifecycle, gating, and scoping layers on top of basic retrieval.

Sourcegraph Cody documented moving away from embeddings-only retrieval back in 2024, partly because sending code to a third party created operational complexity, echoing that derived layers become toxic when they silently become the system of record.
Every platform reviewed, from Claude Code to Bedrock AgentCore to Letta, has added lifecycle, gating, and scoping layers beyond retrieval.

Retrieval is a building block. It is not the architecture.

What a Commercial Context Engine Looks Like

A commercial context engine has:

Typed memory, not blobs
Scoped storage, not implicit visibility
Canonical event log, not accumulated transcript fragments
Structured memory store, not unbounded conversation residue
Promotion gates, not automatic persistence
Hardening lifecycle, not synchronous promotion shortcuts
Lifecycle garbage collection, not window overflow
Hybrid retrieval, not vector-only recall
Token budgets by layer, not "fit what you can"
Privacy as routing architecture, not configuration flags
Trace envelopes, not black-box runs
Encryption boundaries, not shared trust assumptions
Structured trace envelopes, not free-form logs
Deterministic replay, not probabilistic reconstruction
Versioned artifacts, not silent drift

It assembles context deliberately. It reconstructs context per run. It promotes intentionally. It enforces isolation structurally. It measures cost explicitly. It evolves through replay.

Observed Ecosystem Convergence

The Context Engine Pattern Is the Common Destination

Pattern: Independent systems converge on the same structural invariants once they operate agents in production.

Across Claude Code, Cursor, Letta, OpenClaw, Amazon Bedrock AgentCore, OpenCode, Warp, Windsurf, Bolt, and Lovable, the convergence is structural: multi-tenant boundaries, durable memory, tool orchestration, retention rules, replayable failures, and predictable economics.

The interfaces differ, but the invariants do not.

Architectural Principle

The Four Constraints of Context

Context must be constrained, versioned, scoped, and observable.

Constrained by token budgets, compaction rules, and retrieval limits
Versioned by policy artifacts, trace metadata, and constitution changes
Scoped by tenant boundaries, user isolation, and memory type
Observable through trace envelopes, replay, cost accounting, and promotion logs

When those properties exist, drift becomes diagnosable, isolation becomes provable, retention becomes enforceable, and cost becomes bounded.

Without them, cost, drift, and isolation become outcomes you discover instead of variables you control.

Conclusion: Engineering Context

Ecosystem Convergence

Over the past year, a clear pattern has emerged. Teams building IDE agents, coding copilots, orchestration layers, and model APIs, often in isolation and under very different product pressures, have arrived at strikingly similar system designs.

Throughout this guide, we've cited specific architectural signals from these systems. The convergence is not anecdotal. It spans memory design, isolation enforcement, trace capture, retrieval strategy, cost accounting, and promotion discipline.

Where the Signals Come From

System	Primary Convergence Signals
Claude Code (Anthropic)	Layered scoped memory files, subagent isolation, OpenTelemetry trace export, structured tool use
Cursor	Explicit context inclusion, codebase indexing separated from live reasoning, privacy-mode routing via separate replicas, Merkle-tree content proofs for index isolation
Letta	Typed memory blocks (episodic, semantic, policy), explicit promotion, asynchronous "subconscious" enrichment
OpenClaw	Canonical execution logs separated from derived projections, replay-oriented architecture
AWS AgentCore	Short-term vs long-term memory split, typed tool contracts, structured execution traces, token vault for credential containment
Warp	Typed durable artifacts (Workflows, Notebooks, Rules, etc.), block-scoped execution model, unconditional secret redaction, telemetry-disabled-by-default posture
Windsurf	Location-based memory scoping via file placement, multi-tier memory lifecycle
Bolt & Lovable	Ephemeral chat vs durable "Project Knowledge", plan-then-execute gating, persistent custom knowledge applied across future edits
Major Model APIs	Anthropic prompt caching with hash-based storage, OpenAI function calling schemas and structured outputs, AWS Bedrock model invocation logging and guardrails

Constraint, Not Coincidence

Across these systems, the same invariants appear:

Memory is layered, scoped, and explicitly governed
Durable memory requires deliberate promotion
Context is assembled per run, not carried forward as a growing window
Derived layers are projections rebuilt from canonical truth
Retrieval combines structured and semantic signals
Tool use is typed and contract-bound
Isolation is architectural, not optional
Replay and traceability are mandatory
Async processing moves heavy work off the critical path
Cost and token budgets are explicit control surfaces

These systems were built independently. They landed on the same patterns because production agents force these constraints.

Context is no longer a prompt engineering trick. It is infrastructure.

You do not have to adopt any specific framework. But if you ignore these structural constraints, you will rediscover them through drift, cost escalation, replay failures, privacy incidents, promotion poisoning, and upgrade regressions.

Convergence is a signal. And one you should listen to.

The more you know

Why There Is No Simple Maturity Model

It is tempting to reduce commercial agent systems to a clean ladder.

Level 1: chatbot + vector DB
Level 2: scoped memory
Level 3: canonical logs
Level 4: evaluation harnesses and replay
Level 5: privacy routing and cost instrumentation

In practice, maturity is multi-dimensional.

A system may have strict isolation but weak promotion discipline.
It may version models correctly but lack replay fidelity.
It may instrument cost but fail to control lifecycle drift.
It may run evaluation loops but persist unverified memory.

Isolation, promotion, lifecycle hardening, replay, evaluation, cost instrumentation, and privacy architecture evolve independently.

The parts described above define the axes.

Where your system sits along each axis determines its maturity.

The Architecture That Survives

A context engine exists to enforce three structural invariants:

Structural Isolation
Tenant boundaries are enforced in the data plane, not implied in prompts. Scope, partitioning, encryption domains, and routing make cross-tenant contamination structurally impossible, not statistically unlikely.

Isolation is not a guardrail.
It is a boundary.
Deterministic Replay
Every decision must be reconstructible.
Versioned prefixes, artifact IDs, tool contracts, and trace envelopes turn autonomy from guesswork into debuggable infrastructure.

If you cannot replay a run, you cannot trust it.
If you cannot trust it, you cannot evolve it.
Economic Predictability
Cost must be bounded by architecture.
Layered token budgets, promotion discipline, constrained retrieval, and per-run cost attribution prevent context from compounding silently.

If cost emerges from accumulation instead of design, scale becomes drift.

If any of these are optional, autonomy will eventually degrade.

Not suddenly. Not catastrophically. Gradually.

Guardrails fade. Costs creep. Memory drifts. Isolation weakens.

The systems that survive production pressure are NOT the ones with the best prompts.

They are the ones where context is engineered as infrastructure.

And infrastructure either holds, or it doesn’t.

A Note on Adoption

This guide describes the full architecture that production pressure eventually demands.

No team should attempt to implement all of it at once.

The patterns here are not a checklist. They are a reference architecture. Your system's constraints determine which layers matter first.

If you are not yet multi-tenant, isolation can be deferred. If you are not under compliance or audit pressure, the full hardening pipeline and lifecycle state machine can wait. If your agent runs are short-lived and stateless, promotion gating matters less than retrieval discipline. If cost is not yet a problem, trace envelopes and per-run attribution can follow later.

Start with the constraints your use case actually imposes:

If you handle enterprise data across tenants, start with isolation and scoped memory.
If you need to debug behavioral drift, start with trace envelopes and replay.
If cost is compounding, start with layer budgets and promotion discipline.
If you are building durable memory, start with truth vs acceleration separation.

The full architecture is the destination. Your roadmap determines the order of arrival.

The ecosystem is moving fast. Managed memory services, provider-side context management, turnkey trace infrastructure, and retrieval-as-a-service offerings are shipping regularly. Many of the layers described in this guide are increasingly buy-not-build decisions. The architectural principles remain the same regardless of whether you implement them yourself or adopt managed services that enforce them on your behalf. What matters is that the invariants hold, regardless of who builds the plumbing.

Build what your constraints require today. Adopt what the ecosystem provides tomorrow. But know where the architecture converges, so you are building toward it rather than away from it.

Final Principle

Context is no longer just what you pass to a model.

It is your safety boundary.
Your isolation boundary.
Your cost boundary.
Your audit trail.
And your competitive moat.

Frontier models are increasingly interchangeable. APIs converge. Capabilities normalize. Pricing compresses.

What does not commoditize is how context is assembled, constrained, promoted, retained, replayed, and priced. That discipline determines safety, reliability, latency, auditability, and margin. Two teams can call the same model and produce radically different systems depending on their memory architecture and context control surfaces.

In commercial agent systems, model choice is leverage.
Context engineering is the moat.

Context is not an implementation detail. It is infrastructure. Engineer it accordingly.

Hope this helps,

Tags: #multi-tenant #agents #compliance

Context Engineering for Commercial Agent Systems

Memory, Isolation, Hardening, and Multi-Tenant Context Infrastructure

Part I: Context Is Infrastructure

The Three Non-Negotiables of Commercial Agent Systems

Models Are Commoditized. Context Is Not.

Context as Competitive Surface

Decisions as First-Class Records

Decisions and Traces Are First-Class Primitives

Non-Negotiables, Not Features

Part II: Memory as a Scoped, Typed System

Memory Must Be Scoped and Typed

Memory Scopes (Security Boundaries)

Global Scope: Platform-Wide, Tenant-Invariant Memory

Tenant Scope: Organization-Wide Shared Memory

User Scope: Personalized Memory Within a Tenant

Session Scope: Ephemeral Runtime State

Memory Scopes Converge on Hierarchical, File-Like Policy

Memory Type (Semantic Role)

Memory Classification Is Becoming Explicit

A Simple Rule

Memory Layer Summary:

Tenant Configuration Needed an Override Layer

Why Scoped + Typed Memory Changes Everything

Name the Boundary

Part III: Truth vs Acceleration

Canonical Stores (Truth)

1. Canonical Event Log (Append-Only)

Example agent event:

2. Canonical Structured Memory Store

Example memory record:

Derived Stores (Projections)

1. Retrieval Index (Vector / Hybrid Search)

2. Object Store (Large Payloads)

Hybrid Search

The Vector Index Became Accidental Truth

Isolation at the Projection Layer

Event Logs and Structured Records Are Splitting into Distinct Tiers

Content Proofs and Cross-Tenant Isolation

Content Proofs and Index Isolation Are Production Patterns

A Quick Mental Model

Why This Separation Matters

Truth Is Rebuildable, Acceleration Is Disposable

Part IV: The Context Engine Loop

The High-Level Loop

Step 1: Ingest

Step 2: Plan Context Needs

Plan-Before-Execute Is Standard Practice

Step 3: Retrieve (Isolation Enforced Here)

Step 4: Assemble the Working Set

Silent Guardrail Drift

Budgeted Context Assembly Replaces Wholesale Inclusion

Step 5: Semantic Stabilization (Pre-Compaction Flush)

Compaction Without Stabilization Corrupted Meaning

Step 6: Agentic Garbage Collection (Working-Set Compaction)

Example:

Garbage Collection by Memory Layer:

Transparency Competed With Cost

Step 7: Infer and Act

Step 8: Promotion Gate

Step 9: Emit Trace Envelope

A minimal representation might look like this:

Step 10: Lifecycle Garbage Collection (Durability & Retention Discipline)

Why Three Forms of Garbage Collection?

Run Boundary Events

Multi-Turn Conversations Do Not Justify Persistent Windows

Transcript Indexing Drift

The Discipline

Assemble, Don’t Accumulate

Part V: Multi-Agent Context Boundaries

Context Inheritance vs Isolation

Context Sharing Was Correct Until It Wasn't

Subagent Outputs Are Promotion Events

Subagent Outputs Bypassed Promotion Gates

Trace Lineage Across Agent Boundaries

Example trace structure (truncated for brevity):

Scope Inheritance Rules

Summary of inheritance rules:

Cost Attribution Across Agents

Multi-Agent Context Isolation Is Becoming Structural

Delegation Multiplies Risk