
Dnotitia released STAR-KV, a new approach to compressing KV cache—temporary GPU memory that stores context for AI language models—achieving up to 20x compression and faster inference through low-rank compression and GPU optimization. KV cache has emerged as a major bottleneck in AI infrastructure, consuming about 81% of GPU memory in some long-context scenarios, making this compression technique potentially valuable for reducing inference costs as AI systems process increasingly large amounts of context.
Summaries like this, in your inbox every morning.
Sign up free →What happened
Dnotitia Inc., a South Korean company, released the paper and source code for STAR-KV, a low-rank compression method for KV cache (temporary GPU memory used by AI language models). The technology was developed with UC San Diego's VVIP Lab and was selected as a Spotlight paper at ICML 2026, representing about 2.2% of reviewed submissions and about 8.4% of accepted papers.
Why it matters
KV cache has become a critical bottleneck in AI infrastructure—when a LLaMA-3.1-8B model processes a 128K-token context at batch size 4, the KV cache accounts for about 81% of total GPU memory. As AI systems handle longer documents, conversation histories, and external tool outputs, KV cache compression is increasingly seen as core infrastructure technology for processing long context at lower cost.
What to watch
STAR-KV compressed the full KV cache by up to 20x using low-rank compression combined with mixed-precision quantization, and achieved attention computation speed increases of up to 6.9x and overall generation throughput increases of up to 3.1x. The paper and source code are available on arXiv and GitHub.
No discussion yet for this article
Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
1 minute a day. The AI essentials.
200+ sources · Email / LINE / Slack