How much faster is STAR-KV compared to existing methods?

STAR-KV increased attention computation speed by up to 6.9x and overall generation throughput by up to 3.1x. The method also showed higher accuracy than major existing KV cache compression methods.

What compression ratio does STAR-KV achieve?

Low-rank compression alone reduced the KV cache by up to 75%. When combined with the mixed-precision quantization method proposed in the paper, STAR-KV compressed the full KV cache by up to 20x.

Back to articles

Dnotitia's STAR-KV Achieves up to 20x KV Cache Compression

Yahoo Finance AI7h ago3 min read

Key takeaway

Dnotitia released STAR-KV, a new approach to compressing KV cache—temporary GPU memory that stores context for AI language models—achieving up to 20x compression and faster inference through low-rank compression and GPU optimization. KV cache has emerged as a major bottleneck in AI infrastructure, consuming about 81% of GPU memory in some long-context scenarios, making this compression technique potentially valuable for reducing inference costs as AI systems process increasingly large amounts of context.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
Dnotitia Inc., a South Korean company, released the paper and source code for STAR-KV, a low-rank compression method for KV cache (temporary GPU memory used by AI language models). The technology was developed with UC San Diego's VVIP Lab and was selected as a Spotlight paper at ICML 2026, representing about 2.2% of reviewed submissions and about 8.4% of accepted papers.
Why it matters
KV cache has become a critical bottleneck in AI infrastructure—when a LLaMA-3.1-8B model processes a 128K-token context at batch size 4, the KV cache accounts for about 81% of total GPU memory. As AI systems handle longer documents, conversation histories, and external tool outputs, KV cache compression is increasingly seen as core infrastructure technology for processing long context at lower cost.
What to watch
STAR-KV compressed the full KV cache by up to 20x using low-rank compression combined with mixed-precision quantization, and achieved attention computation speed increases of up to 6.9x and overall generation throughput increases of up to 3.1x. The paper and source code are available on arXiv and GitHub.

FAQ

How much faster is STAR-KV compared to existing methods?: STAR-KV increased attention computation speed by up to 6.9x and overall generation throughput by up to 3.1x. The method also showed higher accuracy than major existing KV cache compression methods.
What compression ratio does STAR-KV achieve?: Low-rank compression alone reduced the KV cache by up to 75%. When combined with the mixed-precision quantization method proposed in the paper, STAR-KV compressed the full KV cache by up to 20x.

Discussion

No discussion yet for this article

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →