Supavector Performance
Benchmark report generated for repo review
Pricing checked May 9, 2026

Retrieval-first memory vs full-context prompting

Send the right memory, not the whole pile of tokens.

Supavector's technical bet is simple: index knowledge once, retrieve the relevant working set at answer time, and keep lifecycle policy close to retrieval. That reduces model input cost, preserves context headroom, and keeps latency tied to the selected evidence instead of the entire corpus size.

4.3x lower /v1/ask p95 latency for AMV-L vs TTL in the deterministic local workload.
85.3% fewer retrieval candidates scanned by AMV-L vs TTL, from 3,822 to 563 average candidates.
210K workload operations across TTL, AMV-L, and LRU runs, with 0 reported workload errors.
48 retrieved chunks per answer in the AMV-L run, averaging 4,790 prompt tokens.

Methodology

What was measured

The measured portion comes from the local Supavector runtime telemetry run in ../supavector/telemetry/events_ttl_amvl_lru.ndjson. The cost portion is a transparent model using official provider token prices and visible prompt-size assumptions.

Same workload replayed Each policy run used seed 1337, 50,000 writes, 10,000 recall requests, 10,000 ask requests, k=24 for recall, k=48 for answers, and concurrency 12.
Policies compared TTL baseline with a large warm sample, AMV-L with value-aware lifecycle and small warm sample, and LRU with recency-only warm selection.
Cost assumptions shown Default calculator values are 50,000 monthly requests, 100,000 full-context input tokens, 5,000 retrieval-first input tokens, and 800 output tokens per answer.

Traditional alternatives

What developers usually compare against

The relevant comparison is not only vector search. Developers also compare against manually pasting files into ChatGPT, Claude, or Grok, or building a raw LLM API flow that dumps every possible document into every request.

Approach How it works Developer impact User impact
Consumer chat app
ChatGPT, Claude, Grok
Upload or paste files into a chat workspace and ask questions manually. Fast for one-off analysis, but not a governed production API with project tokens, source sync, access policy, audit trails, or repeatable deployment surfaces. Useful for personal work, but users may re-upload content, hit context or file limits, and get inconsistent source coverage across sessions.
Full-context API prompt
Dump all relevant text every call
Send a large document set or conversation transcript as input tokens to the model on every answer request. Lowest integration complexity, highest repeated input-token cost, more context-window pressure, and weaker source lifecycle controls. Can be slow and expensive at scale, especially when most of the prompt is irrelevant to the specific question.
Vector database only
DIY RAG plumbing
Store embeddings, run similarity search, then assemble prompts in custom application code. Better token efficiency, but the team still owns source sync, hidden collections, prompt construction, portal delivery, billing, RBAC, and telemetry. Quality depends on the custom retrieval layer, citation handling, freshness, and whether stale or unauthorized chunks are filtered correctly.
Supavector Agent Memory
Retrieval-first operating layer
Index source content once, use Memory search/ask/chat/code APIs, and route the same governed memory through Studio, portals, embeds, and backend calls. Bounded prompt size, reusable Memory objects, source and access policy controls, usage telemetry, and a smaller model-token blast radius. Answers use selected evidence instead of a whole corpus dump, with lower wait time and less manual document handling.

Measured local results

AMV-L keeps retrieval bounded without the TTL scan penalty

AMV-L is not just a cost story. In the local replay, the value-aware lifecycle reduced the measured retrieval candidate set from 3,822 to 563 average scanned candidates versus TTL and cut answer p95 latency from 5.54s to 1.29s.

Policy Avg candidates scanned /v1/ask p95 /v1/ask p99 /v1/recall p95 Run duration
TTL baseline
large warm sample
3,822 5,544 ms 6,396 ms 2,379 ms 129.3 min
AMV-L
selected
563 1,290 ms 1,540 ms 455 ms 31.6 min
LRU
recency only
219 1,553 ms 6,402 ms 464 ms 30.6 min

Candidate scan pressure

Average memory_candidates.vector_search_scanned_count across recall and ask retrieval events.

TTL baseline3,822
AMV-L563
LRU219

LRU scanned fewer candidates, but its ask p99 tail was 6.4s in this run. AMV-L is the more defensible default when the claim is balanced latency, bounded retrieval, and value-aware memory quality rather than scan count alone.

Cost model

Full-context prompts make every query pay for the whole corpus again

The default scenario compares 100K input tokens per request against a 5K retrieval-first prompt. Output is held constant at 800 tokens, so the savings come from reducing repeated input tokens, not from assuming shorter answers.

Model-only calculation: ((input_tokens / 1M) * input_price) + ((output_tokens / 1M) * output_price), multiplied by monthly requests. It excludes prompt caching, batch discounts, vector storage, embedding/indexing, server-side tools, and provider subscriptions. OpenAI rates use standard processing below the long-context surcharge threshold published on the API pricing page.

Provider model Input / output Full-context monthly Retrieval-first monthly Monthly savings Reduction

Monthly token-cost spread

Bars show full-context spend. The green marker is the retrieval-first spend for the same model.

Why the percentage stays high

At 100K input tokens, a full-context call sends 20x more input than a 5K retrieval-first call. Output cost is unchanged, so the reduction is slightly below the input reduction when output prices are high.

  • OpenAI GPT-5.4: $13,100 full-context vs $1,225 retrieval-first at the default volume.
  • Claude Sonnet 4.6: $15,600 full-context vs $1,350 retrieval-first at the default volume.
  • xAI Grok 4.3: $6,350 full-context vs $413 retrieval-first at the default volume.

Product impact

What this means for developers and users

Developers care about

  • Predictable unit economics: fewer repeated input tokens per answer request.
  • Latency headroom: retrieval candidate sets stay bounded as the corpus grows.
  • Context headroom: the model receives relevant chunks instead of a whole transcript or source dump.
  • Operational controls: source sync, access policy, API tokens, telemetry, billing, and portal surfaces are part of the same Memory object.

Users care about

  • Less waiting: the assistant does not need to drag a full document pile through every answer.
  • More relevant answers: the prompt is built from selected evidence rather than whatever happened to fit in the context window.
  • Fresher knowledge: source sync updates the indexed memory without asking users to re-upload files manually.
  • Safer sharing: public portals and embeds can use the same governed memory without exposing builder controls.

Defensible claims

Claims that match the evidence

These are intentionally phrased as measured or modeled claims, with the boundary conditions included.

Technically supported

  • In the local deterministic run, AMV-L reduced average retrieval candidates scanned by 85.3% versus the TTL baseline.
  • In the same run, AMV-L reduced /v1/ask p95 latency from 5,544 ms to 1,290 ms, a 4.3x improvement.
  • For the default provider-pricing scenario, replacing a 100K-token full-context prompt with a 5K-token retrieval-first prompt reduces model token spend by 90.6-93.5% across the listed OpenAI, Anthropic, and xAI models.

Caveats to keep

  • The AMV-L run is a local deterministic benchmark, not a hosted production SLA.
  • Token costs exclude vector storage, embedding/indexing, provider tool calls, prompt caching, batch discounts, long-context surcharges, regional processing uplifts, and negotiated enterprise pricing.
  • If the answer genuinely needs the entire corpus every time, retrieval-first reduces less. The advantage is strongest when most questions need a small, relevant slice of a larger knowledge base.

Technical review ledger

Claim type, basis, and boundary

This section separates measured benchmark claims from modeled cost claims and architectural product claims so reviewers can verify the evidence path.

Claim Type Basis Boundary
AMV-L reduces retrieval candidate scanning by 85.3% versus TTL. Measured Average memory_candidates.vector_search_scanned_count: TTL 3,822.07, AMV-L 563.48, across 20,000 candidate events per policy. Local deterministic workload, seed 1337, fallback embeddings enabled, not a hosted production SLA.
AMV-L lowers /v1/ask p95 latency by 4.30x versus TTL. Measured TTL ask p95 5,544.42 ms divided by AMV-L ask p95 1,289.56 ms in the same replay. Measured on the local runtime profile and workload mix used by scripts/run_ttl_amvl_eval.sh.
Retrieval-first prompting cuts model token spend by 90.6-93.5% in the default scenario. Modeled Official standard token prices, 50,000 monthly answers, 100K full-context input tokens, 5K retrieval input tokens, 800 output tokens. Model-only token cost; excludes cache discounts, batch discounts, vector operations, tool calls, regional uplifts, and negotiated pricing.
Memory is a cleaner production boundary than DIY prompt and vector plumbing. Architectural Memory APIs cover create, sources, sync, write, search, ask, boolean_ask, chat, code, status, and actions in the local docs. Architecture claim; production quality still depends on source quality, chunking, retrieval tuning, evaluation, and access policy.

Sources

Data and pricing references