
Khazad is an open-source semantic cache that sits between Python applications and LLM APIs, replaying semantically similar cached responses instead of making redundant API calls. At a 0.50 hit rate, it reduces API call volume by ~50%, speeds up responses by ~96% on cache hits, and lowers costs by ~50%. It works transparently with OpenAI, Anthropic, Azure OpenAI, and compatible proxies, making it useful for teams running high-traffic FAQ bots, support tools, and development environments.
Summaries like this, in your inbox every morning.
Sign up free →What happened
Khazad, an open-source semantic cache for LLM API calls, intercepts HTTP traffic at the transport layer and serves semantically equivalent cached responses via Redis Vector Sets. At a 0.50 hit rate, it delivers ~50% fewer API calls, ~96% faster responses on cache hits, and ~50% lower spend; it works transparently with zero changes to application code.
Why it matters
For teams running high-volume, repetitive LLM traffic—FAQ bots, support assistants, RAG systems, dev/test environments—Khazad offers cost and latency savings without rewriting application code. It supports multiple LLM providers (OpenAI, Anthropic, Azure OpenAI, and OpenAI-compatible proxies like Ollama and vLLM) through a single Python init() call.
What to watch
Khazad requires Python ≥3.10 and Redis 8 with Vector Sets support. It is httpx-only, so SDKs built on requests, aiohttp, or boto3 (AWS Bedrock) are not intercepted. Start with threshold=0.90 to control false positives, and treat the Redis instance with the same security care as application logs, since prompts are embedded and responses are stored in clear text.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
1 minute a day. The AI essentials.
200+ sources · Email / LINE / Slack