Retrieval-first memory vs full-context prompting
Send the right memory, not the whole pile of tokens.
Supavector's technical bet is simple: index knowledge once, retrieve the relevant working set at answer time, and keep lifecycle policy close to retrieval. That reduces model input cost, preserves context headroom, and keeps latency tied to the selected evidence instead of the entire corpus size.
Methodology
What was measured
The benchmark portion is measured directly from Supavector runtime telemetry collected during a controlled, deterministic replay. The cost portion is a transparent model that uses published provider token prices and the prompt-size assumptions shown alongside each chart.
Traditional alternatives
What developers usually compare against
The relevant comparison is not only vector search. Developers also compare against manually pasting files into ChatGPT, Claude, or Grok, or building a raw LLM API flow that dumps every possible document into every request.
| Approach | How it works | Developer impact | User impact |
|---|---|---|---|
| Consumer chat app ChatGPT, Claude, Grok |
Upload or paste files into a chat workspace and ask questions manually. | Fast for one-off analysis, but not a governed production API with project tokens, source sync, access policy, audit trails, or repeatable deployment surfaces. | Useful for personal work, but users may re-upload content, hit context or file limits, and get inconsistent source coverage across sessions. |
| Full-context API prompt Dump all relevant text every call |
Send a large document set or conversation transcript as input tokens to the model on every answer request. | Lowest integration complexity, highest repeated input-token cost, more context-window pressure, and weaker source lifecycle controls. | Can be slow and expensive at scale, especially when most of the prompt is irrelevant to the specific question. |
| Vector database only DIY RAG plumbing |
Store embeddings, run similarity search, then assemble prompts in custom application code. | Better token efficiency, but the team still owns source sync, hidden collections, prompt construction, portal delivery, billing, RBAC, and telemetry. | Quality depends on the custom retrieval layer, citation handling, freshness, and whether stale or unauthorized chunks are filtered correctly. |
| Supavector QueryLayer Retrieval-first operating layer |
Index source content once, use Index search/ask/chat/code APIs, and route the same governed memory through Studio, portals, embeds, and backend calls. | Bounded prompt size, reusable query indexs, source and access policy controls, usage telemetry, and a smaller model-token blast radius. | Answers use selected evidence instead of a whole corpus dump, with lower wait time and less manual document handling. |
Benchmark results
AMV-L keeps retrieval bounded without the TTL scan penalty
AMV-L is not just a cost story. In the benchmark replay, the value-aware lifecycle reduced the average retrieval candidate set from 3,822 to 563 versus TTL and cut answer p95 latency from 5.54s to 1.29s.
| Policy | Avg candidates scanned | /v1/ask p95 | /v1/ask p99 | /v1/recall p95 | Run duration |
|---|---|---|---|---|---|
| TTL baseline large warm sample |
3,822 | 5,544 ms | 6,396 ms | 2,379 ms | 129.3 min |
| AMV-L selected |
563 | 1,290 ms | 1,540 ms | 455 ms | 31.6 min |
| LRU recency only |
219 | 1,553 ms | 6,402 ms | 464 ms | 30.6 min |
Candidate scan pressure
Average memory_candidates.vector_search_scanned_count across recall and ask retrieval events.
LRU scanned fewer candidates, but its ask p99 tail was 6.4s in this run. AMV-L is the stronger default when the objective is balanced latency, bounded retrieval, and value-aware memory quality rather than scan count alone.
Cost model
Full-context prompts make every query pay for the whole corpus again
The default scenario compares 100K input tokens per request against a 5K retrieval-first prompt. Output is held constant at 800 tokens, so the savings come from reducing repeated input tokens, not from assuming shorter answers.
Model-only calculation: ((input_tokens / 1M) * input_price) + ((output_tokens / 1M) * output_price), multiplied by monthly requests. It excludes prompt caching, batch discounts, vector storage, embedding/indexing, server-side tools, and provider subscriptions. OpenAI rates use standard processing below the long-context surcharge threshold published on the API pricing page.
| Provider model | Input / output | Full-context monthly | Retrieval-first monthly | Monthly savings | Reduction |
|---|
Monthly token-cost spread
Bars show full-context spend. The green marker is the retrieval-first spend for the same model.
Why the percentage stays high
At 100K input tokens, a full-context call sends 20x more input than a 5K retrieval-first call. Output cost is unchanged, so the reduction is slightly below the input reduction when output prices are high.
- OpenAI GPT-5.4: $13,100 full-context vs $1,225 retrieval-first at the default volume.
- Claude Sonnet 4.6: $15,600 full-context vs $1,350 retrieval-first at the default volume.
- xAI Grok 4.3: $6,350 full-context vs $413 retrieval-first at the default volume.
Product impact
What this means for developers and users
Developers care about
- Predictable unit economics: fewer repeated input tokens per answer request.
- Latency headroom: retrieval candidate sets stay bounded as the corpus grows.
- Context headroom: the model receives relevant chunks instead of a whole transcript or source dump.
- Operational controls: source sync, access policy, API tokens, telemetry, billing, and portal surfaces are part of the same query index.
Users care about
- Less waiting: the assistant does not need to drag a full document pile through every answer.
- More relevant answers: the prompt is built from selected evidence rather than whatever happened to fit in the context window.
- Fresher knowledge: source sync updates the indexed memory without asking users to re-upload files manually.
- Safer sharing: public portals and embeds can use the same governed memory without exposing builder controls.
Headline findings
What the numbers actually say
Every figure below is either measured directly from a benchmark run or calculated from a transparent cost model. The boundaries each finding holds within are stated next to it, so readers can decide how the result maps to their own workload.
Technically supported
- In the deterministic benchmark replay, AMV-L reduced average retrieval candidates scanned by 85.3% versus the TTL baseline.
- In the same replay, AMV-L reduced /v1/ask p95 latency from 5,544 ms to 1,290 ms, a 4.3x improvement.
- For the default provider-pricing scenario, replacing a 100K-token full-context prompt with a 5K-token retrieval-first prompt reduces model token spend by 90.6-93.5% across the listed OpenAI, Anthropic, and xAI models.
Caveats to keep
- The AMV-L run is a deterministic benchmark workload, not a hosted production SLA.
- Token costs exclude vector storage, embedding/indexing, provider tool calls, prompt caching, batch discounts, long-context surcharges, regional processing uplifts, and negotiated enterprise pricing.
- If the answer genuinely needs the entire corpus every time, retrieval-first reduces less. The advantage is strongest when most questions need a small, relevant slice of a larger knowledge base.
Evidence basis
Measured, modeled, or architectural
Every figure on this page falls into one of three categories: a directly measured benchmark result, a cost value modeled from public pricing, or an architectural description of the product. The table below states the evidence type, the basis for the number, and the conditions under which the result applies.
| Finding | Type | Basis | Boundary |
|---|---|---|---|
| AMV-L reduces retrieval candidate scanning by 85.3% versus TTL. | Measured | Average candidate scan counts recorded by the retrieval engine: TTL 3,822.07, AMV-L 563.48, taken across 20,000 candidate events per policy. | Deterministic benchmark workload, seed 1337, fallback embeddings enabled. Not a hosted production SLA. |
| AMV-L lowers /v1/ask p95 latency by 4.30x versus TTL. | Measured | TTL ask p95 5,544.42 ms divided by AMV-L ask p95 1,289.56 ms in the same replay. | Measured on the same runtime profile and workload mix used for the AMV-L evaluation suite. |
| Retrieval-first prompting cuts model token spend by 90.6-93.5% in the default scenario. | Modeled | Official standard token prices, 50,000 monthly answers, 100K full-context input tokens, 5K retrieval input tokens, 800 output tokens. | Model-only token cost; excludes cache discounts, batch discounts, vector operations, tool calls, regional uplifts, and negotiated pricing. |
| The query index is a cleaner production boundary than DIY prompt and vector plumbing. | Architectural | QueryLayer APIs cover create, sources, sync, write, search, ask, boolean_ask, chat, code, status, and actions across the public Supavector documentation. | This is an architectural characterization. Production quality still depends on source quality, chunking, retrieval tuning, evaluation, and access policy. |
References
Pricing pages and benchmark data
Cost figures reference the published API pricing pages below. The underlying benchmark replay logs and telemetry are available upon reasonable request.
Technical paper
Supavector QueryLayer as retrieval infrastructure.
Supavector packages retrieval, lifecycle policy, source sync, access control, and delivery surfaces into a deployable query index. The goal is to give applications durable knowledge without paying the latency, cost, and governance penalty of sending entire corpora to an LLM on every request.
Abstract
Why the system exists
General-purpose chat tools are effective for individual analysis, but production applications need repeatable access to governed knowledge. Raw vector databases solve only the similarity-search portion. Supavector sits above storage and model providers to coordinate source ingestion, retrieval planning, prompt construction, access resolution, generation, delivery, and usage reporting.
The technical position is retrieval-first: treat the model context window as a scarce runtime budget. Store and score durable knowledge outside the model, then send only the selected evidence and task instructions needed for the current request.
Design goals
Keep model input tied to selected evidence, not total corpus size.
Resolve tenant, project, role, and source exposure before retrieval.
Expose the same query index through APIs, Studio, portals, and embeds.
System architecture
One query index coordinates many moving parts
The product is intentionally more than a vector index. The query index gives the runtime a stable boundary for source ownership, retrieval behavior, generation policy, access rules, public delivery, and metering.
Core components
-
Control planeProjects, service tokens, user roles, source configuration, portal settings, credits, and operator workflows live in the hosted Studio.
-
Indexing pipelineSource content is normalized, chunked, embedded, and written into hidden retrieval collections attached to a query index.
-
Runtime retrievalSearch and answer calls resolve access, retrieve candidates, choose an evidence set, and construct a bounded prompt.
-
Delivery surfacesThe same query index can back backend API calls, no-code Studio tests, public or internal portals, and embedded assistants.
Query execution
A request is routed through policy before it reaches the model
Retrieval-first design only works if the system can decide what is allowed, what is relevant, and what fits in the prompt budget before generation starts.
Runtime sequence
- AuthenticateResolve workspace, project, service token, or portal visitor context.
- ConstrainApply index settings, source exposure rules, tenant access, and requested mode such as search, ask, chat, or code.
- RetrieveRun vector and metadata retrieval against the query index's hidden collection and candidate tiers.
- BudgetSelect chunks and instructions that fit the answer mode and token budget.
- GenerateCall the configured provider with bounded evidence, then return answer text, references, and usage telemetry.
Index lifecycle
AMV-L is the lifecycle policy behind bounded retrieval
TTL and LRU are useful baselines, but they optimize for age or recency. AMV-L adds value-aware lifecycle behavior so high-signal memory can remain hot while low-value or redundant memory can decay, demote, compact, or evict.
Why AMV-L matters
- TTL removes items by time, which can keep stale low-value content too long or remove useful older content too early.
- LRU favors recent access, which can be fast but may overfit to recency and produce unstable tail behavior.
- AMV-L uses value and lifecycle thresholds to control hot/warm/cold movement, helping retrieval stay bounded without reducing the system to age or recency alone.
The benchmark tab shows the result of that policy choice: AMV-L scanned 563 average candidates versus TTL's 3,822, while keeping ask p99 at 1.54s in the benchmark replay.
Developer interface
The API surface is organized around memory operations
Applications should not need to know how every source was synced or how every vector collection is named. They call Index operations and let the runtime enforce the configured retrieval and access policy.
Representative request pattern
POST /v1/memories/{memory_id}/ask
Authorization: Bearer supav_...
Content-Type: application/json
{
"query": "What changed in the refund policy?",
"k": 40,
"answerLength": "short",
"policy": "amvl"
}
Operational outputs
- Answer payload: generated response plus references to selected chunks or documents.
- Telemetry: latency, retrieval candidate counts, prompt token estimates, model token usage, and error status.
- Billing hooks: generation tokens and storage usage can be attributed to the project and Index boundary.
- Auditability: the same index configuration can be inspected in Studio and reused by API clients.
Governance and deployment
Production RAG needs controls around retrieval, not only better prompts
The technical value compounds when the system owns the operational boundary around retrieval: who may call it, which sources are visible, where it runs, and how usage is measured.
Service tokens and project roles separate builder, backend, portal, and operator access.
Sources can be published, internal, restricted, or scoped before retrieval chooses chunks.
The model adapter can route to OpenAI, Anthropic, xAI, or other configured providers without changing the query index boundary.
Hosted, dedicated, self-hosted, or customer-cloud deployments can share the same product concepts while changing infrastructure location.
Engineering tradeoffs
Scope of the architecture
Supported conclusions
- Retrieval-first systems can materially reduce repeated model input tokens for large knowledge bases.
- A query index is a cleaner production boundary than scattering source sync, vector search, prompt assembly, and portal delivery across separate services.
- Lifecycle policy matters because retrieval cost and tail latency are affected by the candidate set, not only by the final chunk count.
Known limits
- RAG does not guarantee correctness by itself. Source quality, chunking, prompt design, evaluation, and access rules remain part of the system.
- Some tasks need long-context reasoning across many documents. Those should use larger budgets or a staged workflow instead of forcing a tiny retrieval set.
- Benchmarks should be rerun for each deployment profile before turning replay results into hosted latency commitments.