A four-part investigation into compressing transformer KV caches — from keyframe + delta encoding to attention-aware eviction and serving-fleet architecture. From first principles to 113× serving-level compression with $1.35M annual savings at 10K users.
The field has focused on model weight compression (GPTQ, AWQ, QuIP#). These are valuable — but weights are a one-time fixed cost. The KV cache is the variable cost that grows with context length, conversation history, and concurrent users. It is the actual bottleneck for long-context production inference.
Compresses model weights. Requires calibration and retraining. Applied once at deployment. Operates on a fixed storage axis. State of the art for weight quantization.
Compresses KV cache at runtime. No model modification. Works on any pretrained model. Serving-level architecture (cross-user dedup + temporal deltas) delivers sub-linear GPU scaling.
These compress different things and stack freely. A QuIP#-compressed model running with TSC KV compression gets both savings simultaneously — on independent storage axes with no interference.
At the serving level, a LLaMA-3-7B deployment serving 2K concurrent users requires 7 GPUs without compression vs 3 GPUs with DeltaStore — and scaling to 10K users adds only 1 additional GPU thanks to sub-linear cost growth from shared prefix deduplication. Annual savings: $504K at 2K users, $1.35M at 10K users.