What AI Cost Firewall Does

AI Cost Firewall acts as a smart gateway for OpenAI-compatible chat-completion requests.

Instead of sending every request directly to an LLM provider, it checks whether the same or a similar request has already been answered.

Responsibilities

validate requests
normalize requests
check exact cache
check semantic cache
forward upstream when needed
store cache entries
estimate token and cost savings
expose metrics (Prometheus metrics and Grafana dashboards for cache diagnostics and cost visibility))
handle readiness, shutdown, and reload behavior
route cache misses to OpenAI-compatible upstreams

OpenAI-compatible gateway

Supported endpoint:

POST /v1/chat/completions

Existing applications can point their OpenAI-compatible client to AI Cost Firewall.

Two cache layers

Layer	Backend	Purpose
Exact cache	Redis / Valkey	Reuse identical normalized requests
Semantic cache	Qdrant	Reuse semantically similar prompts

Exact cache hits are fastest and have no embedding lookup cost. Semantic cache hits require embedding lookup but can reuse answers for similar prompts.

Deployment Patterns

AI Cost Firewall supports practical OpenAI-compatible deployment patterns including:

OpenAI cloud deployments
fully local Ollama deployments
hybrid OpenAI + local embedding deployments
OpenRouter routing deployments
self-hosted vLLM deployments

Runnable examples are available under:

deploy/examples/

Responsibilities​

OpenAI-compatible gateway​

Two cache layers​

Deployment Patterns​

Responsibilities

OpenAI-compatible gateway

Two cache layers

Deployment Patterns