Retrieval-first memory vs full-context prompting
Send the right memory, not the whole pile of tokens.
Supavector's technical bet is simple: index knowledge once, retrieve the relevant working set at answer time, and keep lifecycle policy close to retrieval. That reduces model input cost, preserves context headroom, and keeps latency tied to the selected evidence instead of the entire corpus size.
Methodology
What was measured
The measured portion comes from the local Supavector runtime telemetry run in ../supavector/telemetry/events_ttl_amvl_lru.ndjson. The cost portion is a transparent model using official provider token prices and visible prompt-size assumptions.
Traditional alternatives
What developers usually compare against
The relevant comparison is not only vector search. Developers also compare against manually pasting files into ChatGPT, Claude, or Grok, or building a raw LLM API flow that dumps every possible document into every request.
| Approach | How it works | Developer impact | User impact |
|---|---|---|---|
| Consumer chat app ChatGPT, Claude, Grok |
Upload or paste files into a chat workspace and ask questions manually. | Fast for one-off analysis, but not a governed production API with project tokens, source sync, access policy, audit trails, or repeatable deployment surfaces. | Useful for personal work, but users may re-upload content, hit context or file limits, and get inconsistent source coverage across sessions. |
| Full-context API prompt Dump all relevant text every call |
Send a large document set or conversation transcript as input tokens to the model on every answer request. | Lowest integration complexity, highest repeated input-token cost, more context-window pressure, and weaker source lifecycle controls. | Can be slow and expensive at scale, especially when most of the prompt is irrelevant to the specific question. |
| Vector database only DIY RAG plumbing |
Store embeddings, run similarity search, then assemble prompts in custom application code. | Better token efficiency, but the team still owns source sync, hidden collections, prompt construction, portal delivery, billing, RBAC, and telemetry. | Quality depends on the custom retrieval layer, citation handling, freshness, and whether stale or unauthorized chunks are filtered correctly. |
| Supavector Agent Memory Retrieval-first operating layer |
Index source content once, use Memory search/ask/chat/code APIs, and route the same governed memory through Studio, portals, embeds, and backend calls. | Bounded prompt size, reusable Memory objects, source and access policy controls, usage telemetry, and a smaller model-token blast radius. | Answers use selected evidence instead of a whole corpus dump, with lower wait time and less manual document handling. |
Measured local results
AMV-L keeps retrieval bounded without the TTL scan penalty
AMV-L is not just a cost story. In the local replay, the value-aware lifecycle reduced the measured retrieval candidate set from 3,822 to 563 average scanned candidates versus TTL and cut answer p95 latency from 5.54s to 1.29s.
| Policy | Avg candidates scanned | /v1/ask p95 | /v1/ask p99 | /v1/recall p95 | Run duration |
|---|---|---|---|---|---|
| TTL baseline large warm sample |
3,822 | 5,544 ms | 6,396 ms | 2,379 ms | 129.3 min |
| AMV-L selected |
563 | 1,290 ms | 1,540 ms | 455 ms | 31.6 min |
| LRU recency only |
219 | 1,553 ms | 6,402 ms | 464 ms | 30.6 min |
Candidate scan pressure
Average memory_candidates.vector_search_scanned_count across recall and ask retrieval events.
LRU scanned fewer candidates, but its ask p99 tail was 6.4s in this run. AMV-L is the more defensible default when the claim is balanced latency, bounded retrieval, and value-aware memory quality rather than scan count alone.
Cost model
Full-context prompts make every query pay for the whole corpus again
The default scenario compares 100K input tokens per request against a 5K retrieval-first prompt. Output is held constant at 800 tokens, so the savings come from reducing repeated input tokens, not from assuming shorter answers.
Model-only calculation: ((input_tokens / 1M) * input_price) + ((output_tokens / 1M) * output_price), multiplied by monthly requests. It excludes prompt caching, batch discounts, vector storage, embedding/indexing, server-side tools, and provider subscriptions. OpenAI rates use standard processing below the long-context surcharge threshold published on the API pricing page.
| Provider model | Input / output | Full-context monthly | Retrieval-first monthly | Monthly savings | Reduction |
|---|
Monthly token-cost spread
Bars show full-context spend. The green marker is the retrieval-first spend for the same model.
Why the percentage stays high
At 100K input tokens, a full-context call sends 20x more input than a 5K retrieval-first call. Output cost is unchanged, so the reduction is slightly below the input reduction when output prices are high.
- OpenAI GPT-5.4: $13,100 full-context vs $1,225 retrieval-first at the default volume.
- Claude Sonnet 4.6: $15,600 full-context vs $1,350 retrieval-first at the default volume.
- xAI Grok 4.3: $6,350 full-context vs $413 retrieval-first at the default volume.
Product impact
What this means for developers and users
Developers care about
- Predictable unit economics: fewer repeated input tokens per answer request.
- Latency headroom: retrieval candidate sets stay bounded as the corpus grows.
- Context headroom: the model receives relevant chunks instead of a whole transcript or source dump.
- Operational controls: source sync, access policy, API tokens, telemetry, billing, and portal surfaces are part of the same Memory object.
Users care about
- Less waiting: the assistant does not need to drag a full document pile through every answer.
- More relevant answers: the prompt is built from selected evidence rather than whatever happened to fit in the context window.
- Fresher knowledge: source sync updates the indexed memory without asking users to re-upload files manually.
- Safer sharing: public portals and embeds can use the same governed memory without exposing builder controls.
Defensible claims
Claims that match the evidence
These are intentionally phrased as measured or modeled claims, with the boundary conditions included.
Technically supported
- In the local deterministic run, AMV-L reduced average retrieval candidates scanned by 85.3% versus the TTL baseline.
- In the same run, AMV-L reduced /v1/ask p95 latency from 5,544 ms to 1,290 ms, a 4.3x improvement.
- For the default provider-pricing scenario, replacing a 100K-token full-context prompt with a 5K-token retrieval-first prompt reduces model token spend by 90.6-93.5% across the listed OpenAI, Anthropic, and xAI models.
Caveats to keep
- The AMV-L run is a local deterministic benchmark, not a hosted production SLA.
- Token costs exclude vector storage, embedding/indexing, provider tool calls, prompt caching, batch discounts, long-context surcharges, regional processing uplifts, and negotiated enterprise pricing.
- If the answer genuinely needs the entire corpus every time, retrieval-first reduces less. The advantage is strongest when most questions need a small, relevant slice of a larger knowledge base.
Technical review ledger
Claim type, basis, and boundary
This section separates measured benchmark claims from modeled cost claims and architectural product claims so reviewers can verify the evidence path.
| Claim | Type | Basis | Boundary |
|---|---|---|---|
| AMV-L reduces retrieval candidate scanning by 85.3% versus TTL. | Measured | Average memory_candidates.vector_search_scanned_count: TTL 3,822.07, AMV-L 563.48, across 20,000 candidate events per policy. |
Local deterministic workload, seed 1337, fallback embeddings enabled, not a hosted production SLA. |
| AMV-L lowers /v1/ask p95 latency by 4.30x versus TTL. | Measured | TTL ask p95 5,544.42 ms divided by AMV-L ask p95 1,289.56 ms in the same replay. | Measured on the local runtime profile and workload mix used by scripts/run_ttl_amvl_eval.sh. |
| Retrieval-first prompting cuts model token spend by 90.6-93.5% in the default scenario. | Modeled | Official standard token prices, 50,000 monthly answers, 100K full-context input tokens, 5K retrieval input tokens, 800 output tokens. | Model-only token cost; excludes cache discounts, batch discounts, vector operations, tool calls, regional uplifts, and negotiated pricing. |
| Memory is a cleaner production boundary than DIY prompt and vector plumbing. | Architectural | Memory APIs cover create, sources, sync, write, search, ask, boolean_ask, chat, code, status, and actions in the local docs. | Architecture claim; production quality still depends on source quality, chunking, retrieval tuning, evaluation, and access policy. |
Sources
Data and pricing references
../supavector/telemetry/events_ttl_amvl_lru.ndjson and ../supavector/telemetry/README_ttl_amvl.md
OpenAI API pricing: GPT-5.5, GPT-5.4, GPT-5.4 mini
Claude API pricing docs: Haiku 4.5, Sonnet 4.6, and Opus 4.7
xAI models and pricing: Grok 4.3
ChatGPT plan context limits and consumer plan comparison notes
Technical whitepaper
Supavector Agent Memory as retrieval infrastructure.
Supavector packages retrieval, lifecycle policy, source sync, access control, and delivery surfaces into a deployable Memory object. The goal is to give applications durable knowledge without paying the latency, cost, and governance penalty of sending entire corpora to an LLM on every request.
Abstract
Why the system exists
General-purpose chat tools are effective for individual analysis, but production applications need repeatable access to governed knowledge. Raw vector databases solve only the similarity-search portion. Supavector sits above storage and model providers to coordinate source ingestion, retrieval planning, prompt construction, access resolution, generation, delivery, and usage reporting.
The technical position is retrieval-first: treat the model context window as a scarce runtime budget. Store and score durable knowledge outside the model, then send only the selected evidence and task instructions needed for the current request.
Design goals
Keep model input tied to selected evidence, not total corpus size.
Resolve tenant, project, role, and source exposure before retrieval.
Expose the same Memory through APIs, Studio, portals, and embeds.
System architecture
One Memory object coordinates many moving parts
The product is intentionally more than a vector index. The Memory object gives the runtime a stable boundary for source ownership, retrieval behavior, generation policy, access rules, public delivery, and metering.
Core components
-
Control planeProjects, service tokens, user roles, source configuration, portal settings, credits, and operator workflows live in the hosted Studio.
-
Indexing pipelineSource content is normalized, chunked, embedded, and written into hidden retrieval collections attached to a Memory.
-
Runtime retrievalSearch and answer calls resolve access, retrieve candidates, choose an evidence set, and construct a bounded prompt.
-
Delivery surfacesThe same Memory can back backend API calls, no-code Studio tests, public or internal portals, and embedded assistants.
Query execution
A request is routed through policy before it reaches the model
Retrieval-first design only works if the system can decide what is allowed, what is relevant, and what fits in the prompt budget before generation starts.
Runtime sequence
- AuthenticateResolve workspace, project, service token, or portal visitor context.
- ConstrainApply Memory settings, source exposure rules, tenant access, and requested mode such as search, ask, chat, or code.
- RetrieveRun vector and metadata retrieval against the Memory's hidden collection and candidate tiers.
- BudgetSelect chunks and instructions that fit the answer mode and token budget.
- GenerateCall the configured provider with bounded evidence, then return answer text, references, and usage telemetry.
Memory lifecycle
AMV-L is the lifecycle policy behind bounded retrieval
TTL and LRU are useful baselines, but they optimize for age or recency. AMV-L adds value-aware lifecycle behavior so high-signal memory can remain hot while low-value or redundant memory can decay, demote, compact, or evict.
Why AMV-L matters
- TTL removes items by time, which can keep stale low-value content too long or remove useful older content too early.
- LRU favors recent access, which can be fast but may overfit to recency and produce unstable tail behavior.
- AMV-L uses value and lifecycle thresholds to control hot/warm/cold movement, helping retrieval stay bounded without reducing the system to age or recency alone.
The benchmark tab shows the result of that policy choice: AMV-L scanned 563 average candidates versus TTL's 3,822, while keeping ask p99 at 1.54s in the local run.
Developer interface
The API surface is organized around memory operations
Applications should not need to know how every source was synced or how every vector collection is named. They call Memory operations and let the runtime enforce the configured retrieval and access policy.
Representative request pattern
POST /v1/memories/{memory_id}/ask
Authorization: Bearer supav_...
Content-Type: application/json
{
"query": "What changed in the refund policy?",
"k": 40,
"answerLength": "short",
"policy": "amvl"
}
Operational outputs
- Answer payload: generated response plus references to selected chunks or documents.
- Telemetry: latency, retrieval candidate counts, prompt token estimates, model token usage, and error status.
- Billing hooks: generation tokens and storage usage can be attributed to the project and Memory boundary.
- Auditability: the same Memory configuration can be inspected in Studio and reused by API clients.
Governance and deployment
Production RAG needs controls around retrieval, not only better prompts
The technical value compounds when the system owns the operational boundary around retrieval: who may call it, which sources are visible, where it runs, and how usage is measured.
Service tokens and project roles separate builder, backend, portal, and operator access.
Sources can be published, internal, restricted, or scoped before retrieval chooses chunks.
The model adapter can route to OpenAI, Anthropic, xAI, or other configured providers without changing the Memory boundary.
Hosted, dedicated, self-hosted, or customer-cloud deployments can share the same product concepts while changing infrastructure location.
Engineering tradeoffs
What the architecture does and does not claim
Strong claims
- Retrieval-first systems can materially reduce repeated model input tokens for large knowledge bases.
- A Memory object is a cleaner production boundary than scattering source sync, vector search, prompt assembly, and portal delivery across separate services.
- Lifecycle policy matters because retrieval cost and tail latency are affected by the candidate set, not only by the final chunk count.
Explicit limits
- RAG does not guarantee correctness by itself. Source quality, chunking, prompt design, evaluation, and access rules remain part of the system.
- Some tasks need long-context reasoning across many documents. Those should use larger budgets or a staged workflow instead of forcing a tiny retrieval set.
- Benchmarks should be rerun for each deployment profile before turning local results into hosted latency commitments.