AIToday

Khazad: Open-Source LLM Cache Cuts API Costs by ~50%

Hacker News10h ago5 min read
Khazad: Open-Source LLM Cache Cuts API Costs by ~50%

Key takeaway

Khazad is an open-source semantic cache that sits between Python applications and LLM APIs, replaying semantically similar cached responses instead of making redundant API calls. At a 0.50 hit rate, it reduces API call volume by ~50%, speeds up responses by ~96% on cache hits, and lowers costs by ~50%. It works transparently with OpenAI, Anthropic, Azure OpenAI, and compatible proxies, making it useful for teams running high-traffic FAQ bots, support tools, and development environments.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  • What happened

    Khazad, an open-source semantic cache for LLM API calls, intercepts HTTP traffic at the transport layer and serves semantically equivalent cached responses via Redis Vector Sets. At a 0.50 hit rate, it delivers ~50% fewer API calls, ~96% faster responses on cache hits, and ~50% lower spend; it works transparently with zero changes to application code.

  • Why it matters

    For teams running high-volume, repetitive LLM traffic—FAQ bots, support assistants, RAG systems, dev/test environments—Khazad offers cost and latency savings without rewriting application code. It supports multiple LLM providers (OpenAI, Anthropic, Azure OpenAI, and OpenAI-compatible proxies like Ollama and vLLM) through a single Python init() call.

  • What to watch

    Khazad requires Python ≥3.10 and Redis 8 with Vector Sets support. It is httpx-only, so SDKs built on requests, aiohttp, or boto3 (AWS Bedrock) are not intercepted. Start with threshold=0.90 to control false positives, and treat the Redis instance with the same security care as application logs, since prompts are embedded and responses are stored in clear text.

FAQ

Which LLM providers and SDKs does Khazad support?
Khazad covers SDKs built on httpx, including OpenAI, Anthropic, Gemini via google-genai, Mistral, and most OpenAI-compatible proxies (vLLM, Ollama, LiteLLM). SDKs using requests, aiohttp, or boto3 (AWS Bedrock) are not intercepted.
What are the privacy and security considerations?
Prompts are embedded and responses are stored in clear text in Redis. If prompts may contain PII or secrets, set a ttl, enable Redis AUTH/TLS, and treat the Redis instance with the same care as application logs.
What configuration should I start with?
Start at threshold=0.90 and raise it if you see wrong cache hits. Watch avg_hit_similarity in get_stats()—if it sits near your threshold, your traffic may be too diverse to cache safely.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →