Validated on WikiText-2 with real model inference. Not synthetic benchmarks. Two orthogonal techniques that multiply — eviction reduces sequence length, quantization reduces bit width.
| Technique | Ratio | Quality |
|---|---|---|
| Rotated K4/V2 quantization only | 6.4× | KL=0.01 |
| + PreserveEarly eviction (30%) | 7.1× | KL=0.12 (acceptable) |
| + PreserveEarly eviction (10%) | 10.6× | KL=0.49 (marginal) |
Google TurboQuant = 6× lossless. Our 6.4× quant-only baseline matches. The additional 66% comes from attention-aware token eviction (H2O/Scissorhands family) with a novel PreserveEarly schedule that protects the first 20% of layers.
Single-snapshot compression uses known techniques. The novel contribution is compressing across the serving fleet, not just one user at a time.
| Component | Multiplier |
|---|---|
| Shared prefix dedup | 3–5× |
| System prompt KV stored once, regardless of user count. Typically 60%+ of cache. | |
| Temporal delta coding | 2–3× |
| Multi-turn conversations share prefix. Only the delta from last turn is stored. | |
| Rotated quantization (K4/V2) | 6.4× |
| Hadamard rotation + symmetric quantization on per-user deltas. | |
Combined: 3× prefix × 2× temporal × 6.4× quant = 32–113× at the serving level. These compress orthogonal dimensions and stack cleanly.
LLaMA-3-7B serving concurrent users on A100 80GB GPUs ($2/hr spot). Shared prefix = 2048-token system prompt. Per-user context = 4096 tokens.
| Fleet Size | Without | With TSC | Saved/Yr |
|---|---|---|---|
| 2,000 users | 7 GPUs | 3 GPUs | $504K |
| 5,000 users | 17 GPUs | 3 GPUs | $886K |
| 10,000 users | 34 GPUs | 4 GPUs | $1.35M |
Why sub-linear: The shared system prompt KV cache (stored once) dominates memory. Going from 2K to 5K users = same 3 GPUs. Going to 10K adds just 1 GPU. Uncompressed scaling is linear — every user needs full KV allocation.
Cost per user: drops from $252/yr (uncompressed, 2K) to $135/yr (compressed, 10K). At 10K users, DeltaStore reduces per-user GPU cost by 60% while the uncompressed cost stays flat at $252/yr.
| Technique | Status |
|---|---|
| Rotated symmetric quantization (K4/V2) | Known — QuaRot, TurboQuant (Google) |
| Attention-aware token eviction | Known — H2O, Scissorhands (2023) |
| PreserveEarly eviction schedule | Our variant — protects first 20% of layers |
| Cross-user KV prefix dedup | Novel architecture |
| Temporal delta coding for KV | Novel architecture |
| Serving-fleet compression stack | Novel system design |
TurboQuant compresses one user at a time. DeltaStore compresses the whole serving fleet. That's the difference between 6× and 113×.
4-part research series with 18 benchmark configurations tested. Every claim is validated on real models with real data. Full methodology at solsticestudio.ai/tsc_papers.
| Tier | Ratio | KL Div | Top-1 | Use Case |
|---|---|---|---|---|
| Conservative | 6.4× | 0.01 | 98% | Medical, legal, financial — zero tolerance for drift |
| Balanced | 7.1× | 0.12 | 90% | Customer service, enterprise chat — small quality trade for big savings |
| Aggressive | 10.6× | 0.49 | 80% | Code completion, summarization, search — throughput over precision |
All tiers stack with serving-level architecture. Conservative 6.4× snapshot becomes 32–50× at fleet level. Operators choose their quality/cost trade-off per deployment.