TECHNICAL REPORT SERIES · 2026

Temporal State Compression
Research Series

A four-part investigation into compressing transformer KV caches — from keyframe + delta encoding to attention-aware eviction and serving-fleet architecture. From first principles to 113× serving-level compression with $1.35M annual savings at 10K users.

113×
Serving-level
compression
10.6×
Single-snapshot
proven
$1.35M
Annual savings
at 10K users
5
Papers in
series
PART I March 2026
Temporal State Compression for Transformer KV Caches:
63× Lossy Compression with Zero Measurable Quality Loss
Introduces TSC and the core prefer_append_for_growing insight. Fixes the root bug (max_delta_error calibration) that caused every step to fall back to keyframes. Validates 7.98× lossless and 63× lossy on GPT-2 across WikiText-2, with top-1=1.0000 and ppl_delta=−0.03.
7.98× lossless 63× lossy top-1 = 1.0000 ppl_delta = −0.03 10.5× over QuIP# seq 128–1024 validated
PART II March 2026
Beyond 84×: A Vector Quantization Quality Framework
for Transformer KV Cache Compression
Investigates stacking VQ on TSC keyframes to reach extreme ratios. Identifies root cause of 1034× being unusable (per-head KV magnitude distributions). Builds a three-stage quality framework: per-head normalization, selective fp16 fallback, and Product Quantization. Establishes 84× (TSC + H2O eviction) and 219× (PQ) as defensible operating points.
84× + eviction 219× PQ (KL=0.05) 1034× archival KV magnitude root cause 14× over QuIP# H2O eviction at 75% retention
PART III March 2026
TSC at Scale: Modern Architecture Validation,
the Cross-Layer Weight Question, and a Definitive
Production Compression Hierarchy
Extends validation to TinyLlama-1.1B (RoPE + RMSNorm + GQA — the Gemma/LLaMA-3/Qwen-2 architecture class) at seq=1024. The selective VQ tier reaches 125× at KL=0.057 with top-1=1.000 on modern architecture. Also closes the weight compression question: a systematic probe finds inter-layer weight delta/weight ratios of 1.25–1.44×, conclusively ruling out cross-layer delta compression. Closes with the definitive production tier hierarchy and the complete TSC vs QuIP# comparison.
125× at KL=0.057 top-1 = 1.000 seq=1024 validated TinyLlama-1.1B (Gemma class) 20.8× over QuIP# Weight delta: negative result DynamicCache portability
PART IV April 2026
The Honest Audit: From 63× Claims to 10.6× Proven,
and Why Serving-Level Architecture Changes Everything
A rigorous audit of Parts I–III compression claims using real model validation (TinyLlama-1.1B on WikiText-2). Proves 10.6× single-snapshot via attention-aware eviction + rotated K4/V2 quantization. Rules out SVD factorization (error compounding) and global importance (per-layer dominates). Reframes the contribution: the novel value is serving-fleet architecture — cross-user KV dedup + temporal delta coding yields 32–113× with sub-linear GPU scaling and $1.35M annual savings at 10K users.
10.6× snapshot 113× serving $1.35M savings (10K) Sub-linear GPU scaling SVD ruled out Honest framing PreserveEarly schedule
GREEN DIVIDEND April 2026 ✦ New
The Green Dividend: How KV Cache Compression
Eliminates 88% of GPU Energy Waste
Every GPU that DeltaStore TSC removes from a serving fleet is 300–700 watts that stops drawing power 24/7/365. At 10K users, 30 fewer A100s saves 102 MWh and 40 tonnes CO₂ per year — or 93 tonnes on H100s. Quantifies energy, carbon, and water savings from single deployments to industry scale (10M users = 30,000 GPUs eliminated). Fully cited methodology with EPA, EIA, and NVIDIA sources.
30 GPUs eliminated 93t CO₂ avoided (H100) 102 MWh saved (A100) 20 cars off the road 184,500L water saved Industry-scale projection
SCALE VALIDATION April 2026 ✦ New
Scale Validation: TSC Compression Holds
from 7B to 72B Parameters
Production-scale benchmark on NVIDIA H100: Mistral 7B, Qwen2.5 32B, and Qwen2 72B. Compression ratios are architecture-independent (4.1× Lossless, 6.2× Conservative stable across all models). Larger models compress better at conservative tiers. Full sweep of 12 configurations maps the Pareto frontier. Deep dive on why 6.2× snapshot becomes 19–62× at serving level through prefix dedup and temporal deltas.
3 models (7B–72B) 6.2× → 62× serving H100 80GB validated 12 configs per model 97% KV memory reduction $4.3M/yr at enterprise scale

Why KV Cache Compression Matters More Than Weight Compression

The field has focused on model weight compression (GPTQ, AWQ, QuIP#). These are valuable — but weights are a one-time fixed cost. The KV cache is the variable cost that grows with context length, conversation history, and concurrent users. It is the actual bottleneck for long-context production inference.

Google QuIP# (6×)

Compresses model weights. Requires calibration and retraining. Applied once at deployment. Operates on a fixed storage axis. State of the art for weight quantization.

TSC + DeltaStore (10.6–113×)

Compresses KV cache at runtime. No model modification. Works on any pretrained model. Serving-level architecture (cross-user dedup + temporal deltas) delivers sub-linear GPU scaling.

They compose: QuIP# model + TSC KV cache

These compress different things and stack freely. A QuIP#-compressed model running with TSC KV compression gets both savings simultaneously — on independent storage axes with no interference.

At the serving level, a LLaMA-3-7B deployment serving 2K concurrent users requires 7 GPUs without compression vs 3 GPUs with DeltaStore — and scaling to 10K users adds only 1 additional GPU thanks to sub-linear cost growth from shared prefix deduplication. Annual savings: $504K at 2K users, $1.35M at 10K users.