DeltaStore TSC
Solstice AI Studio  ·  KV Cache Compression for Production LLM Serving
Justin Meister
Founder & Chief Architect
[email protected]
solsticestudio.ai/tsc_papers
Every LLM inference provider pays for GPU memory that's mostly redundant. The KV cache — the per-user memory that grows with context length — is 80% shared across users serving the same model. We compress the whole fleet, not just one user. Result: 10.6× proven single-snapshot, 113× at serving level, $1.35M/yr saved at 10K users.
10.6×
Single Snapshot
113×
Serving Level
$1.35M
Annual Savings (10K)
3 → 4
GPUs: 2K → 10K Users
80%
Top-1 Match Rate
Single-Snapshot Compression — Proven on TinyLlama-1.1B

Validated on WikiText-2 with real model inference. Not synthetic benchmarks. Two orthogonal techniques that multiply — eviction reduces sequence length, quantization reduces bit width.

TechniqueRatioQuality
Rotated K4/V2 quantization only6.4×KL=0.01
+ PreserveEarly eviction (30%)7.1×KL=0.12 (acceptable)
+ PreserveEarly eviction (10%)10.6×KL=0.49 (marginal)

Google TurboQuant = 6× lossless. Our 6.4× quant-only baseline matches. The additional 66% comes from attention-aware token eviction (H2O/Scissorhands family) with a novel PreserveEarly schedule that protects the first 20% of layers.

Serving-Level Architecture — The Real Breakthrough

Single-snapshot compression uses known techniques. The novel contribution is compressing across the serving fleet, not just one user at a time.

ComponentMultiplier
Shared prefix dedup3–5×
System prompt KV stored once, regardless of user count. Typically 60%+ of cache.
Temporal delta coding2–3×
Multi-turn conversations share prefix. Only the delta from last turn is stored.
Rotated quantization (K4/V2)6.4×
Hadamard rotation + symmetric quantization on per-user deltas.

Combined: 3× prefix × 2× temporal × 6.4× quant = 32–113× at the serving level. These compress orthogonal dimensions and stack cleanly.

GPU Scaling Model — Sub-Linear Cost Growth

LLaMA-3-7B serving concurrent users on A100 80GB GPUs ($2/hr spot). Shared prefix = 2048-token system prompt. Per-user context = 4096 tokens.

GPU Count Comparison

2K users
uncompressed
7
7 GPUs
2K users
DeltaStore
3
3 GPUs
5K users
uncompressed
17
17 GPUs
5K users
DeltaStore
3
3 GPUs
10K users
uncompressed
34
34 GPUs
10K users
DeltaStore
4
4 GPUs
Fleet SizeWithoutWith TSCSaved/Yr
2,000 users7 GPUs3 GPUs$504K
5,000 users17 GPUs3 GPUs$886K
10,000 users34 GPUs4 GPUs$1.35M

Why sub-linear: The shared system prompt KV cache (stored once) dominates memory. Going from 2K to 5K users = same 3 GPUs. Going to 10K adds just 1 GPU. Uncompressed scaling is linear — every user needs full KV allocation.

Cost per user: drops from $252/yr (uncompressed, 2K) to $135/yr (compressed, 10K). At 10K users, DeltaStore reduces per-user GPU cost by 60% while the uncompressed cost stays flat at $252/yr.

Honest Framing — What's Known vs. Novel
TechniqueStatus
Rotated symmetric quantization (K4/V2)Known — QuaRot, TurboQuant (Google)
Attention-aware token evictionKnown — H2O, Scissorhands (2023)
PreserveEarly eviction scheduleOur variant — protects first 20% of layers
Cross-user KV prefix dedupNovel architecture
Temporal delta coding for KVNovel architecture
Serving-fleet compression stackNovel system design

TurboQuant compresses one user at a time. DeltaStore compresses the whole serving fleet. That's the difference between 6× and 113×.

What Was Explored and Eliminated

4-part research series with 18 benchmark configurations tested. Every claim is validated on real models with real data. Full methodology at solsticestudio.ai/tsc_papers.

Quality Tiers — Match Compression to Use Case
TierRatioKL DivTop-1Use Case
Conservative6.4×0.0198%Medical, legal, financial — zero tolerance for drift
Balanced7.1×0.1290%Customer service, enterprise chat — small quality trade for big savings
Aggressive10.6×0.4980%Code completion, summarization, search — throughput over precision

All tiers stack with serving-level architecture. Conservative 6.4× snapshot becomes 32–50× at fleet level. Operators choose their quality/cost trade-off per deployment.

10.6×
Proven Snapshot
Validated on real models. 77% better than Google TurboQuant (6×). Reproducible benchmarks published.
113×
Serving Level
Cross-user dedup + temporal deltas + quantization. Sub-linear GPU scaling means costs flatten as you grow.
$1.35M
Annual Savings
At 10K concurrent users on LLaMA-3-7B. 30 fewer GPUs. Per-user cost drops 60% vs uncompressed.