Retrieval-first memory vs full-context prompting

Send the right memory, not the whole pile of tokens.

Supavector's technical bet is simple: index knowledge once, retrieve the relevant working set at answer time, and keep lifecycle policy close to retrieval. That reduces model input cost, preserves context headroom, and keeps latency tied to the selected evidence instead of the entire corpus size.

4.3x lower /v1/ask p95 latency for AMV-L vs TTL in the deterministic local workload.

85.3% fewer retrieval candidates scanned by AMV-L vs TTL, from 3,822 to 563 average candidates.

210K workload operations across TTL, AMV-L, and LRU runs, with 0 reported workload errors.

48 retrieved chunks per answer in the AMV-L run, averaging 4,790 prompt tokens.

Methodology

What was measured

The measured portion comes from the local Supavector runtime telemetry run in ../supavector/telemetry/events_ttl_amvl_lru.ndjson. The cost portion is a transparent model using official provider token prices and visible prompt-size assumptions.

Same workload replayed Each policy run used seed 1337, 50,000 writes, 10,000 recall requests, 10,000 ask requests, k=24 for recall, k=48 for answers, and concurrency 12.

Policies compared TTL baseline with a large warm sample, AMV-L with value-aware lifecycle and small warm sample, and LRU with recency-only warm selection.

Cost assumptions shown Default calculator values are 50,000 monthly requests, 100,000 full-context input tokens, 5,000 retrieval-first input tokens, and 800 output tokens per answer.

Traditional alternatives

What developers usually compare against

The relevant comparison is not only vector search. Developers also compare against manually pasting files into ChatGPT, Claude, or Grok, or building a raw LLM API flow that dumps every possible document into every request.

Approach	How it works	Developer impact	User impact
Consumer chat app ChatGPT, Claude, Grok	Upload or paste files into a chat workspace and ask questions manually.	Fast for one-off analysis, but not a governed production API with project tokens, source sync, access policy, audit trails, or repeatable deployment surfaces.	Useful for personal work, but users may re-upload content, hit context or file limits, and get inconsistent source coverage across sessions.
Full-context API prompt Dump all relevant text every call	Send a large document set or conversation transcript as input tokens to the model on every answer request.	Lowest integration complexity, highest repeated input-token cost, more context-window pressure, and weaker source lifecycle controls.	Can be slow and expensive at scale, especially when most of the prompt is irrelevant to the specific question.
Vector database only DIY RAG plumbing	Store embeddings, run similarity search, then assemble prompts in custom application code.	Better token efficiency, but the team still owns source sync, hidden collections, prompt construction, portal delivery, billing, RBAC, and telemetry.	Quality depends on the custom retrieval layer, citation handling, freshness, and whether stale or unauthorized chunks are filtered correctly.
Supavector Agent Memory Retrieval-first operating layer	Index source content once, use Memory search/ask/chat/code APIs, and route the same governed memory through Studio, portals, embeds, and backend calls.	Bounded prompt size, reusable Memory objects, source and access policy controls, usage telemetry, and a smaller model-token blast radius.	Answers use selected evidence instead of a whole corpus dump, with lower wait time and less manual document handling.

Measured local results

AMV-L keeps retrieval bounded without the TTL scan penalty

AMV-L is not just a cost story. In the local replay, the value-aware lifecycle reduced the measured retrieval candidate set from 3,822 to 563 average scanned candidates versus TTL and cut answer p95 latency from 5.54s to 1.29s.

Policy	Avg candidates scanned	/v1/ask p95	/v1/ask p99	/v1/recall p95	Run duration
TTL baseline large warm sample	3,822	5,544 ms	6,396 ms	2,379 ms	129.3 min
AMV-L selected	563	1,290 ms	1,540 ms	455 ms	31.6 min
LRU recency only	219	1,553 ms	6,402 ms	464 ms	30.6 min

Candidate scan pressure

Average memory_candidates.vector_search_scanned_count across recall and ask retrieval events.

TTL baseline3,822

AMV-L563

LRU219

LRU scanned fewer candidates, but its ask p99 tail was 6.4s in this run. AMV-L is the more defensible default when the claim is balanced latency, bounded retrieval, and value-aware memory quality rather than scan count alone.

Cost model

Full-context prompts make every query pay for the whole corpus again

The default scenario compares 100K input tokens per request against a 5K retrieval-first prompt. Output is held constant at 800 tokens, so the savings come from reducing repeated input tokens, not from assuming shorter answers.

Monthly answer requests Full-context input tokens Retrieval-first input tokens Output tokens per answer

Model-only calculation: ((input_tokens / 1M) * input_price) + ((output_tokens / 1M) * output_price), multiplied by monthly requests. It excludes prompt caching, batch discounts, vector storage, embedding/indexing, server-side tools, and provider subscriptions. OpenAI rates use standard processing below the long-context surcharge threshold published on the API pricing page.

Provider model	Input / output	Full-context monthly	Retrieval-first monthly	Monthly savings	Reduction

Monthly token-cost spread

Bars show full-context spend. The green marker is the retrieval-first spend for the same model.

Why the percentage stays high

At 100K input tokens, a full-context call sends 20x more input than a 5K retrieval-first call. Output cost is unchanged, so the reduction is slightly below the input reduction when output prices are high.

OpenAI GPT-5.4: $13,100 full-context vs $1,225 retrieval-first at the default volume.
Claude Sonnet 4.6: $15,600 full-context vs $1,350 retrieval-first at the default volume.
xAI Grok 4.3: $6,350 full-context vs $413 retrieval-first at the default volume.

Product impact

What this means for developers and users

Developers care about

Predictable unit economics: fewer repeated input tokens per answer request.
Latency headroom: retrieval candidate sets stay bounded as the corpus grows.
Context headroom: the model receives relevant chunks instead of a whole transcript or source dump.
Operational controls: source sync, access policy, API tokens, telemetry, billing, and portal surfaces are part of the same Memory object.

Users care about

Less waiting: the assistant does not need to drag a full document pile through every answer.
More relevant answers: the prompt is built from selected evidence rather than whatever happened to fit in the context window.
Fresher knowledge: source sync updates the indexed memory without asking users to re-upload files manually.
Safer sharing: public portals and embeds can use the same governed memory without exposing builder controls.

Defensible claims

Claims that match the evidence

These are intentionally phrased as measured or modeled claims, with the boundary conditions included.

Technically supported

In the local deterministic run, AMV-L reduced average retrieval candidates scanned by 85.3% versus the TTL baseline.
In the same run, AMV-L reduced /v1/ask p95 latency from 5,544 ms to 1,290 ms, a 4.3x improvement.
For the default provider-pricing scenario, replacing a 100K-token full-context prompt with a 5K-token retrieval-first prompt reduces model token spend by 90.6-93.5% across the listed OpenAI, Anthropic, and xAI models.

Caveats to keep

The AMV-L run is a local deterministic benchmark, not a hosted production SLA.
Token costs exclude vector storage, embedding/indexing, provider tool calls, prompt caching, batch discounts, long-context surcharges, regional processing uplifts, and negotiated enterprise pricing.
If the answer genuinely needs the entire corpus every time, retrieval-first reduces less. The advantage is strongest when most questions need a small, relevant slice of a larger knowledge base.

Technical review ledger

Claim type, basis, and boundary

This section separates measured benchmark claims from modeled cost claims and architectural product claims so reviewers can verify the evidence path.

Claim	Type	Basis	Boundary
AMV-L reduces retrieval candidate scanning by 85.3% versus TTL.	Measured	Average `memory_candidates.vector_search_scanned_count`: TTL 3,822.07, AMV-L 563.48, across 20,000 candidate events per policy.	Local deterministic workload, seed 1337, fallback embeddings enabled, not a hosted production SLA.
AMV-L lowers /v1/ask p95 latency by 4.30x versus TTL.	Measured	TTL ask p95 5,544.42 ms divided by AMV-L ask p95 1,289.56 ms in the same replay.	Measured on the local runtime profile and workload mix used by `scripts/run_ttl_amvl_eval.sh`.
Retrieval-first prompting cuts model token spend by 90.6-93.5% in the default scenario.	Modeled	Official standard token prices, 50,000 monthly answers, 100K full-context input tokens, 5K retrieval input tokens, 800 output tokens.	Model-only token cost; excludes cache discounts, batch discounts, vector operations, tool calls, regional uplifts, and negotiated pricing.
Memory is a cleaner production boundary than DIY prompt and vector plumbing.	Architectural	Memory APIs cover create, sources, sync, write, search, ask, boolean_ask, chat, code, status, and actions in the local docs.	Architecture claim; production quality still depends on source quality, chunking, retrieval tuning, evaluation, and access policy.

Sources

Data and pricing references

Local benchmark data: ../supavector/telemetry/events_ttl_amvl_lru.ndjson and ../supavector/telemetry/README_ttl_amvl.md OpenAI API pricing: GPT-5.5, GPT-5.4, GPT-5.4 mini Claude API pricing docs: Haiku 4.5, Sonnet 4.6, and Opus 4.7 xAI models and pricing: Grok 4.3 ChatGPT plan context limits and consumer plan comparison notes

Technical whitepaper

Supavector Agent Memory as retrieval infrastructure.

Supavector packages retrieval, lifecycle policy, source sync, access control, and delivery surfaces into a deployable Memory object. The goal is to give applications durable knowledge without paying the latency, cost, and governance penalty of sending entire corpora to an LLM on every request.

Abstract

Why the system exists

General-purpose chat tools are effective for individual analysis, but production applications need repeatable access to governed knowledge. Raw vector databases solve only the similarity-search portion. Supavector sits above storage and model providers to coordinate source ingestion, retrieval planning, prompt construction, access resolution, generation, delivery, and usage reporting.

The technical position is retrieval-first: treat the model context window as a scarce runtime budget. Store and score durable knowledge outside the model, then send only the selected evidence and task instructions needed for the current request.

Design goals

Bound context

Keep model input tied to selected evidence, not total corpus size.

Govern access

Resolve tenant, project, role, and source exposure before retrieval.

Reuse surfaces

Expose the same Memory through APIs, Studio, portals, and embeds.

System architecture

One Memory object coordinates many moving parts

The product is intentionally more than a vector index. The Memory object gives the runtime a stable boundary for source ownership, retrieval behavior, generation policy, access rules, public delivery, and metering.

Core components

Control planeProjects, service tokens, user roles, source configuration, portal settings, credits, and operator workflows live in the hosted Studio.
Indexing pipelineSource content is normalized, chunked, embedded, and written into hidden retrieval collections attached to a Memory.
Runtime retrievalSearch and answer calls resolve access, retrieve candidates, choose an evidence set, and construct a bounded prompt.
Delivery surfacesThe same Memory can back backend API calls, no-code Studio tests, public or internal portals, and embedded assistants.

Query execution

A request is routed through policy before it reaches the model

Retrieval-first design only works if the system can decide what is allowed, what is relevant, and what fits in the prompt budget before generation starts.

Runtime sequence

AuthenticateResolve workspace, project, service token, or portal visitor context.
ConstrainApply Memory settings, source exposure rules, tenant access, and requested mode such as search, ask, chat, or code.
RetrieveRun vector and metadata retrieval against the Memory's hidden collection and candidate tiers.
BudgetSelect chunks and instructions that fit the answer mode and token budget.
GenerateCall the configured provider with bounded evidence, then return answer text, references, and usage telemetry.

Memory lifecycle

AMV-L is the lifecycle policy behind bounded retrieval

TTL and LRU are useful baselines, but they optimize for age or recency. AMV-L adds value-aware lifecycle behavior so high-signal memory can remain hot while low-value or redundant memory can decay, demote, compact, or evict.

Why AMV-L matters

TTL removes items by time, which can keep stale low-value content too long or remove useful older content too early.
LRU favors recent access, which can be fast but may overfit to recency and produce unstable tail behavior.
AMV-L uses value and lifecycle thresholds to control hot/warm/cold movement, helping retrieval stay bounded without reducing the system to age or recency alone.

The benchmark tab shows the result of that policy choice: AMV-L scanned 563 average candidates versus TTL's 3,822, while keeping ask p99 at 1.54s in the local run.

Developer interface

The API surface is organized around memory operations

Applications should not need to know how every source was synced or how every vector collection is named. They call Memory operations and let the runtime enforce the configured retrieval and access policy.

Representative request pattern

POST /v1/memories/{memory_id}/ask
Authorization: Bearer supav_...
Content-Type: application/json

{
  "query": "What changed in the refund policy?",
  "k": 40,
  "answerLength": "short",
  "policy": "amvl"
}

Operational outputs

Answer payload: generated response plus references to selected chunks or documents.
Telemetry: latency, retrieval candidate counts, prompt token estimates, model token usage, and error status.
Billing hooks: generation tokens and storage usage can be attributed to the project and Memory boundary.
Auditability: the same Memory configuration can be inspected in Studio and reused by API clients.

Governance and deployment

Production RAG needs controls around retrieval, not only better prompts

The technical value compounds when the system owns the operational boundary around retrieval: who may call it, which sources are visible, where it runs, and how usage is measured.

Project-scoped auth

Service tokens and project roles separate builder, backend, portal, and operator access.

Source exposure

Sources can be published, internal, restricted, or scoped before retrieval chooses chunks.

Provider neutrality

The model adapter can route to OpenAI, Anthropic, xAI, or other configured providers without changing the Memory boundary.

Runtime placement

Hosted, dedicated, self-hosted, or customer-cloud deployments can share the same product concepts while changing infrastructure location.

Engineering tradeoffs

What the architecture does and does not claim

Strong claims

Retrieval-first systems can materially reduce repeated model input tokens for large knowledge bases.
A Memory object is a cleaner production boundary than scattering source sync, vector search, prompt assembly, and portal delivery across separate services.
Lifecycle policy matters because retrieval cost and tail latency are affected by the candidate set, not only by the final chunk count.

Explicit limits

RAG does not guarantee correctness by itself. Source quality, chunking, prompt design, evaluation, and access rules remain part of the system.
Some tasks need long-context reasoning across many documents. Those should use larger budgets or a staged workflow instead of forcing a tiny retrieval set.
Benchmarks should be rerun for each deployment profile before turning local results into hosted latency commitments.