Skip to main content

What is AI Cost Firewall?

AI Cost Firewall v0.2.0 is a pilot-ready OpenAI-compatible API gateway for caching, cost control, and operational visibility.

It sits between your application and an upstream LLM provider. Applications send requests to AI Cost Firewall instead of calling the provider directly. The firewall checks whether a response can be reused from cache and forwards only necessary requests upstream.

It supports practical OpenAI-compatible upstream and embedding endpoints, including cloud APIs, local gateways, and self-hosted model servers.

Why it exists

LLM applications often send repeated or semantically similar prompts. Without caching, every request can become an upstream API call, token usage, added latency, and additional cost.

AI Cost Firewall reduces this waste with two cache layers:

  1. Exact cache — reuses responses for identical normalized requests.
  2. Semantic cache — reuses responses for similar prompts when similarity is high enough.

Core capabilities

  • OpenAI-compatible /v1/chat/completions endpoint
  • Redis / Valkey exact caching
  • Qdrant semantic caching
  • Prometheus metrics and Grafana dashboards
  • strict configuration validation
  • model allowlist behavior through model_price
  • request size limits
  • readiness and liveness endpoints
  • graceful shutdown and hot reload via SIGHUP
  • semantic cache lifecycle control
  • ready-to-run Docker Compose deployment examples
  • OpenAI-compatible provider patterns for OpenAI, Ollama, LM Studio, vLLM, LiteLLM, and OpenRouter
  • /version endpoint for release and compatibility introspection

v0.2.0 compatibility model

v0.2.0 intentionally stays within the OpenAI-compatible API ecosystem. It supports OpenAI-compatible chat and embedding providers through the same flat configuration model.

It does not introduce native provider-specific API integrations or provider-specific configuration blocks. This keeps pilot deployments predictable while still covering common OpenAI-compatible providers and gateways.

The runtime exposes /version so operators can confirm the running release and compatibility assumptions.

Deployment examples

Ready-to-run pilot deployment patterns are available under:

deploy/examples/

These include OpenAI cloud, local Ollama, hybrid OpenAI + local embeddings, OpenRouter, and a full local stack with dashboards.

  1. What it does
  2. Quick Start with Docker
  3. Request flow
  4. Configuration overview
  5. Runtime overview
  6. Metrics