TSC — KV Cache Compression One-Pager

Every LLM inference provider pays for GPU memory that's mostly redundant. The KV cache — the per-user memory that grows with context length — is 80% shared across users serving the same model. We compress the whole fleet, not just one user. Result: 10.6× proven single-snapshot, 113× at serving level, $1.35M/yr saved at 10K users.

Single-Snapshot Compression — Proven on TinyLlama-1.1B

Validated on WikiText-2 with real model inference. Not synthetic benchmarks. Two orthogonal techniques that multiply — eviction reduces sequence length, quantization reduces bit width.

Technique	Ratio	Quality
Rotated K4/V2 quantization only	6.4×	KL=0.01
+ PreserveEarly eviction (30%)	7.1×	KL=0.12 (acceptable)
+ PreserveEarly eviction (10%)	10.6×	KL=0.49 (marginal)

Google TurboQuant = 6× lossless. Our 6.4× quant-only baseline matches. The additional 66% comes from attention-aware token eviction (H2O/Scissorhands family) with a novel PreserveEarly schedule that protects the first 20% of layers.

Serving-Level Architecture — The Real Breakthrough

Single-snapshot compression uses known techniques. The novel contribution is compressing across the serving fleet, not just one user at a time.

Component	Multiplier
Shared prefix dedup	3–5×
System prompt KV stored once, regardless of user count. Typically 60%+ of cache.
Temporal delta coding	2–3×
Multi-turn conversations share prefix. Only the delta from last turn is stored.
Rotated quantization (K4/V2)	6.4×
Hadamard rotation + symmetric quantization on per-user deltas.

Combined: 3× prefix × 2× temporal × 6.4× quant = 32–113× at the serving level. These compress orthogonal dimensions and stack cleanly.

GPU Scaling Model — Sub-Linear Cost Growth

LLaMA-3-7B serving concurrent users on A100 80GB GPUs ($2/hr spot). Shared prefix = 2048-token system prompt. Per-user context = 4096 tokens.

GPU Count Comparison

2K users
uncompressed

7 GPUs

2K users
DeltaStore

3 GPUs

5K users
uncompressed

17 GPUs

5K users
DeltaStore

3 GPUs

10K users
uncompressed

34 GPUs

10K users
DeltaStore

4 GPUs

Fleet Size	Without	With TSC	Saved/Yr
2,000 users	7 GPUs	3 GPUs	$504K
5,000 users	17 GPUs	3 GPUs	$886K
10,000 users	34 GPUs	4 GPUs	$1.35M

Why sub-linear: The shared system prompt KV cache (stored once) dominates memory. Going from 2K to 5K users = same 3 GPUs. Going to 10K adds just 1 GPU. Uncompressed scaling is linear — every user needs full KV allocation.

Cost per user: drops from $252/yr (uncompressed, 2K) to $135/yr (compressed, 10K). At 10K users, DeltaStore reduces per-user GPU cost by 60% while the uncompressed cost stays flat at $252/yr.

Honest Framing — What's Known vs. Novel

Technique	Status
Rotated symmetric quantization (K4/V2)	Known — QuaRot, TurboQuant (Google)
Attention-aware token eviction	Known — H2O, Scissorhands (2023)
PreserveEarly eviction schedule	Our variant — protects first 20% of layers
Cross-user KV prefix dedup	Novel architecture
Temporal delta coding for KV	Novel architecture
Serving-fleet compression stack	Novel system design

TurboQuant compresses one user at a time. DeltaStore compresses the whole serving fleet. That's the difference between 6× and 113×.

What Was Explored and Eliminated

SVD low-rank factorization — quantizing U and V factors separately compounds errors multiplicatively. All configs KL > 2.0. Ruled out.
Global importance scoring — max-pooling importance across layers loses layer-specific patterns. Per-layer strictly dominates.
Head pruning (GQA models) — zero redundant heads. Max cosine similarity = 0.73. Not viable.
Cross-layer weight delta — weight ratios 1.25–1.44×. No compression opportunity.

4-part research series with 18 benchmark configurations tested. Every claim is validated on real models with real data. Full methodology at solsticestudio.ai/tsc_papers.

Quality Tiers — Match Compression to Use Case

Tier	Ratio	KL Div	Top-1	Use Case
Conservative	6.4×	0.01	98%	Medical, legal, financial — zero tolerance for drift
Balanced	7.1×	0.12	90%	Customer service, enterprise chat — small quality trade for big savings
Aggressive	10.6×	0.49	80%	Code completion, summarization, search — throughput over precision

All tiers stack with serving-level architecture. Conservative 6.4× snapshot becomes 32–50× at fleet level. Operators choose their quality/cost trade-off per deployment.

10.6×

Proven Snapshot

Validated on real models. 77% better than Google TurboQuant (6×). Reproducible benchmarks published.

113×

Serving Level

Cross-user dedup + temporal deltas + quantization. Sub-linear GPU scaling means costs flatten as you grow.

$1.35M

Annual Savings

At 10K concurrent users on LLaMA-3-7B. 30 fewer GPUs. Per-user cost drops 60% vs uncompressed.