Skip to main content

What AI Cost Firewall Does

AI Cost Firewall acts as a smart gateway for OpenAI-compatible chat-completion requests.

Instead of sending every request directly to an LLM provider, it checks whether the same or a similar request has already been answered.

Responsibilities

  • validate requests
  • normalize requests
  • check exact cache
  • check semantic cache
  • forward upstream when needed
  • store cache entries
  • estimate token and cost savings
  • expose metrics (Prometheus metrics and Grafana dashboards for cache diagnostics and cost visibility))
  • handle readiness, shutdown, and reload behavior
  • route cache misses to OpenAI-compatible upstreams

OpenAI-compatible gateway

Supported endpoint:

POST /v1/chat/completions

Existing applications can point their OpenAI-compatible client to AI Cost Firewall.

Two cache layers

LayerBackendPurpose
Exact cacheRedis / ValkeyReuse identical normalized requests
Semantic cacheQdrantReuse semantically similar prompts

Exact cache hits are fastest and have no embedding lookup cost. Semantic cache hits require embedding lookup but can reuse answers for similar prompts.

Deployment Patterns

AI Cost Firewall supports practical OpenAI-compatible deployment patterns including:

  • OpenAI cloud deployments
  • fully local Ollama deployments
  • hybrid OpenAI + local embedding deployments
  • OpenRouter routing deployments
  • self-hosted vLLM deployments

Runnable examples are available under:

deploy/examples/