New KV Packet method eliminates recomputation overhead in LLM caching, enabling faster inference on Llama-3.1 and Qwen2.5 models
arXiv cs.LG · April 16, 2026
AI Summary
•KV Packet treats cached documents as immutable 'packets' with lightweight trainable soft-token adapters to handle context shifts without recomputing KV states
•Achieves near-zero FLOPs and lower Time-to-First-Token (TTFT) latency compared to existing recomputation-based methods like CacheBlend, EPIC, and SAM-KV
•Uses self-supervised distillation to train adapters that bridge context discontinuities, eliminating non-negligible computational overhead from previous approaches
•Demonstrated effectiveness on Llama-3.1 and Qwen2.5 large language models for improved inference performance