What AI Cost Firewall Does
AI Cost Firewall acts as a smart gateway for OpenAI-compatible chat-completion requests.
Instead of sending every request directly to an LLM provider, it checks whether the same or a similar request has already been answered.
Responsibilities
- validate requests
- normalize requests
- check exact cache
- check semantic cache
- forward upstream when needed
- store cache entries
- estimate token and cost savings
- expose metrics (Prometheus metrics and Grafana dashboards for cache diagnostics and cost visibility))
- handle readiness, shutdown, and reload behavior
- route cache misses to OpenAI-compatible upstreams
OpenAI-compatible gateway
Supported endpoint:
POST /v1/chat/completions
Existing applications can point their OpenAI-compatible client to AI Cost Firewall.
Two cache layers
| Layer | Backend | Purpose |
|---|---|---|
| Exact cache | Redis / Valkey | Reuse identical normalized requests |
| Semantic cache | Qdrant | Reuse semantically similar prompts |
Exact cache hits are fastest and have no embedding lookup cost. Semantic cache hits require embedding lookup but can reuse answers for similar prompts.
Deployment Patterns
AI Cost Firewall supports practical OpenAI-compatible deployment patterns including:
- OpenAI cloud deployments
- fully local Ollama deployments
- hybrid OpenAI + local embedding deployments
- OpenRouter routing deployments
- self-hosted vLLM deployments
Runnable examples are available under:
deploy/examples/